Amazon Redshift Database Developer Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 1006 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Amazon Redshift
Table of Contents
Welcome
- Are You a First-Time Amazon Redshift User?
- Are You a Database Developer?
- Prerequisites
Amazon Redshift System Overview
- Data Warehouse System Architecture
- Performance
- Columnar Storage
- Internal Architecture and System Operation
- Workload Management
- Using Amazon Redshift with Other Services
Getting Started Using Databases
- Step 1: Create a Database
- Step 2: Create a Database User
  - Delete a Database User
- Step 3: Create a Database Table
  - Insert Data Rows into a Table
  - Select Data from a Table
- Step 4: Load Sample Data
- Step 5: Query the System Tables
- Step 6: Cancel a Query
  - Cancel a Query from Another Session
  - Cancel a Query Using the Superuser Queue
- Step 7: Clean Up Your Resources
Building a Proof of Concept for Amazon Redshift
- Identifying the Goals of the Proof of Concept
- Setting Up Your Proof of Concept
  - Designing and Setting Up Your Cluster
  - Converting Your Schema and Setting Up the Datasets
- Cluster Design Considerations
- Amazon Redshift Evaluation Checklist
- Benchmarking Your Amazon Redshift Evaluation
- Additional Resources
Amazon Redshift Best Practices
- Amazon Redshift Best Practices for Designing Tables
- Amazon Redshift Best Practices for Loading Data
- Amazon Redshift Best Practices for Designing Queries
- Working with Recommendations from Amazon Redshift Advisor
  - Viewing Amazon Redshift Advisor Recommendations in the Console
  - Amazon Redshift Advisor Recommendations
Tutorial: Tuning Table Design
- Prerequisites
- Steps
- Step 1: Create a Test Data Set
  - To Create a Test Data Set
  - Next Step
- Step 2: Test System Performance to Establish a Baseline
  - To Test System Performance to Establish a Baseline
  - Next Step
- Step 3: Select Sort Keys
  - To Select Sort Keys
  - Next Step
- Step 4: Select Distribution Styles
- Step 5: Review Compression Encodings
  - To Review Compression Encodings
  - Next Step
- Step 6: Recreate the Test Data Set
  - To Recreate the Test Data Set
  - Next Step
- Step 7: Retest System Performance After Tuning
  - To Retest System Performance After Tuning
  - Next Step
- Step 8: Evaluate the Results
  - Next Step
- Step 9: Clean Up Your Resources
  - Next Step
- Summary
  - Next Step
Tutorial: Loading Data from Amazon S3
- Prerequisites
- Overview
- Steps
- Step 1: Launch a Cluster
  - Next Step
- Step 2: Download the Data Files
  - Next Step
- Step 3: Upload the Files to an Amazon S3 Bucket
  - Next Step
- Step 4: Create the Sample Tables
  - Next Step
- Step 5: Run the COPY Commands
  - COPY Command Syntax
  - Loading the SSB Tables
- Step 6: Vacuum and Analyze the Database
  - Next Step
- Step 7: Clean Up Your Resources
  - Next
- Summary
  - Next Step
Tutorial: Configuring Workload Management (WLM) Queues to Improve Query Processing
- Overview
  - Prerequisites
  - Sections
- Section 1: Understanding the Default Queue Processing Behavior
- Section 2: Modifying the WLM Query Queue Configuration
- Section 3: Routing Queries to Queues Based on User Groups and Query Groups
- Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a Queue
  - Step 1: Override the Concurrency Level Using wlm_query_slot_count
    - To Override the Concurrency Level Using wlm_query_slot_count
  - Step 2: Run Queries from Different Sessions
    - To Run Queries from Different Sessions
- Section 5: Cleaning Up Your Resources
Tutorial: Querying Nested Data with Amazon Redshift Spectrum
- Overview
  - Prerequisites
- Step 1: Create an External Table That Contains Nested Data
- Step 2: Query Your Nested Data in Amazon S3 with SQL Extensions
- Nested Data Use Cases
- Nested Data Limitations
Managing Database Security
- Amazon Redshift Security Overview
- Default Database User Privileges
- Superusers
- Users
  - Creating, Altering, and Deleting Users
- Groups
  - Creating, Altering, and Deleting Groups
- Schemas
- Example for Controlling User and Group Access
Designing Tables
- Choosing a Column Compression Type
- Choosing a Data Distribution Style
- Choosing Sort Keys
- Defining Constraints
- Analyzing Table Design
Using Amazon Redshift Spectrum to Query External Data
- Amazon Redshift Spectrum Overview
  - Amazon Redshift Spectrum Regions
  - Amazon Redshift Spectrum Considerations
- Getting Started with Amazon Redshift Spectrum
- IAM Policies for Amazon Redshift Spectrum
- Creating Data Files for Queries in Amazon Redshift Spectrum
- Creating External Schemas for Amazon Redshift Spectrum
  - Working with Amazon Redshift Spectrum External Catalogs
- Creating External Tables for Amazon Redshift Spectrum
- Improving Amazon Redshift Spectrum Query Performance
- Monitoring Metrics in Amazon Redshift Spectrum
- Troubleshooting Queries in Amazon Redshift Spectrum
Loading Data
- Using a COPY Command to Load Data
- Updating Tables with DML Commands
- Updating and Inserting New Data
- Performing a Deep Copy
- Analyzing Tables
- Vacuuming Tables
- Managing Concurrent Write Operations
Unloading Data
- Unloading Data to Amazon S3
- Unloading Encrypted Data Files
- Unloading Data in Delimited or Fixed-Width Format
- Reloading Unloaded Data
Creating User-Defined Functions
- UDF Security and Privileges
- Creating a Scalar SQL UDF
  - Scalar SQL Function Example
- Creating a Scalar Python UDF
- Naming UDFs
  - Overloading Function Names
  - Preventing UDF Naming Conflicts
- Logging Errors and Warnings in UDFs
Tuning Query Performance
- Query Processing
- Analyzing and Improving Queries
- Troubleshooting Queries
Implementing Workload Management
- Defining Query Queues
- WLM Query Queue Hopping
- Short Query Acceleration
  - Maximum Run Time for Short Queries
  - Monitoring SQA
- Modifying the WLM Configuration
- WLM Queue Assignment Rules
  - Queue Assignments Example
- Assigning Queries to Queues
- WLM Dynamic and Static Configuration Properties
  - WLM Dynamic Memory Allocation
  - Dynamic WLM Example
- WLM Query Monitoring Rules
- WLM System Tables and Views
SQL Reference
- Amazon Redshift SQL
  - SQL Functions Supported on the Leader Node
    - Examples
  - Amazon Redshift and PostgreSQL
- Using SQL
- SQL Commands
- SQL Functions Reference
- Reserved Words
System Tables Reference
- System Tables and Views
- Types of System Tables and Views
- Visibility of Data in System Tables and Views
  - Filtering System-Generated Queries
- STL Tables for Logging
- STV Tables for Snapshot Data
- System Views
- System Catalog Tables
Configuration Reference
- Modifying the Server Configuration
- analyze_threshold_percent
- datestyle
- describe_field_name_in_uppercase
- enable_result_cache_for_session
  - Values (Default in Bold)
  - Description
- extra_float_digits
  - Values (Default in Bold)
  - Description
- max_cursor_result_set_size
  - Values (Default in Bold)
  - Description
- query_group
  - Values (Default in Bold)
  - Description
- search_path
- statement_timeout
- timezone
- wlm_query_slot_count
Sample Database
- CATEGORY Table
- DATE Table
- EVENT Table
- VENUE Table
- USERS Table
- LISTING Table
- SALES Table
Appendix: Time Zone Names and Abbreviations
- Time Zone Names
- Time Zone Abbreviations
Document History
- Earlier Updates

Amazon Redshift

Database Developer Guide

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Amazon Redshift: Database Developer Guide

Amazon's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any manner

that is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks not

owned by Amazon are the property of their respective owners, who may or may not be aﬃliated with, connected to, or sponsored by

Amazon.

Amazon Redshift Database Developer Guide

Table of Contents

Welcome ........................................................................................................................................... 1

Are You a First-Time Amazon Redshift User? ................................................................................. 1

Are You a Database Developer? ................................................................................................... 2

Prerequisites .............................................................................................................................. 3

Amazon Redshift System Overview ...................................................................................................... 4

Data Warehouse System Architecture ........................................................................................... 4

Performance .............................................................................................................................. 6

Massively Parallel Processing .............................................................................................. 6

Columnar Data Storage ..................................................................................................... 7

Data Compression ............................................................................................................. 7

Query Optimizer ................................................................................................................ 7

Result Caching .................................................................................................................. 7

Compiled Code .................................................................................................................. 8

Columnar Storage ...................................................................................................................... 8

Internal Architecture and System Operation ................................................................................ 10

Workload Management ............................................................................................................. 11

Using Amazon Redshift with Other Services ................................................................................ 11

Moving Data Between Amazon Redshift and Amazon S3 ....................................................... 11

Using Amazon Redshift with Amazon DynamoDB ................................................................. 11

Importing Data from Remote Hosts over SSH ...................................................................... 11

Automating Data Loads Using AWS Data Pipeline ................................................................. 12

Migrating Data Using AWS Database Migration Service (AWS DMS) ......................................... 12

Getting Started Using Databases ........................................................................................................ 13

Step 1: Create a Database ......................................................................................................... 13

Step 2: Create a Database User .................................................................................................. 14

Delete a Database User ..................................................................................................... 14

Step 3: Create a Database Table ................................................................................................. 14

Insert Data Rows into a Table ............................................................................................ 15

Select Data from a Table ................................................................................................... 15

Step 4: Load Sample Data ......................................................................................................... 15

Step 5: Query the System Tables ............................................................................................... 16

View a List of Table Names ............................................................................................... 16

View Database Users ........................................................................................................ 17

View Recent Queries ......................................................................................................... 17

Determine the Process ID of a Running Query ..................................................................... 18

Step 6: Cancel a Query ............................................................................................................. 18

Cancel a Query from Another Session ................................................................................. 19

Cancel a Query Using the Superuser Queue ......................................................................... 19

Step 7: Clean Up Your Resources ................................................................................................ 20

Proof of Concept Playbook ................................................................................................................ 21

Identifying the Goals of the Proof of Concept .............................................................................. 21

Setting Up Your Proof of Concept .............................................................................................. 21

Designing and Setting Up Your Cluster ............................................................................... 22

Converting Your Schema and Setting Up the Datasets ........................................................... 22

Cluster Design Considerations .................................................................................................... 22

Amazon Redshift Evaluation Checklist ......................................................................................... 23

Benchmarking Your Amazon Redshift Evaluation .......................................................................... 24

Additional Resources ................................................................................................................. 25

Amazon Redshift Best Practices ......................................................................................................... 26

Best Practices for Designing Tables ............................................................................................. 26

Take the Tuning Table Design Tutorial ................................................................................ 27

Choose the Best Sort Key .................................................................................................. 27

Choose the Best Distribution Style ..................................................................................... 27

Use Automatic Compression .............................................................................................. 28

API Version 2012-12-01

iii

Amazon Redshift Database Developer Guide

Deﬁne Constraints ............................................................................................................ 28

Use the Smallest Possible Column Size ............................................................................... 28

Using Date/Time Data Types for Date Columns .................................................................... 29

Best Practices for Loading Data ................................................................................................. 29

Take the Loading Data Tutorial .......................................................................................... 29

Take the Tuning Table Design Tutorial ................................................................................ 29

Use a COPY Command to Load Data .................................................................................. 30

Use a Single COPY Command ............................................................................................ 30

Split Your Load Data into Multiple Files .............................................................................. 30

Compress Your Data Files .................................................................................................. 30

Use a Manifest File ........................................................................................................... 30

Verify Data Files Before and After a Load ............................................................................ 31

Use a Multi-Row Insert ..................................................................................................... 31

Use a Bulk Insert .............................................................................................................. 31

Load Data in Sort Key Order .............................................................................................. 31

Load Data in Sequential Blocks .......................................................................................... 32

Use Time-Series Tables ..................................................................................................... 32

Use a Staging Table to Perform a Merge ............................................................................. 32

Schedule Around Maintenance Windows ............................................................................. 32

Best Practices for Designing Queries ........................................................................................... 32

Working with Advisor ................................................................................................................ 34

Access Advisor ................................................................................................................. 34

Advisor Recommendations ................................................................................................. 35

Tutorial: Tuning Table Design ............................................................................................................. 45

Prerequisites ............................................................................................................................ 45

Steps ...................................................................................................................................... 45

Step 1: Create a Test Data Set ................................................................................................... 45

To Create a Test Data Set .................................................................................................. 46

Next Step ........................................................................................................................ 49

Step 2: Establish a Baseline ....................................................................................................... 49

To Test System Performance to Establish a Baseline ............................................................. 50

Next Step ........................................................................................................................ 52

Step 3: Select Sort Keys ............................................................................................................ 52

To Select Sort Keys .......................................................................................................... 53

Next Step ........................................................................................................................ 53

Step 4: Select Distribution Styles ............................................................................................... 53

Distribution Styles ............................................................................................................ 54

To Select Distribution Styles .............................................................................................. 54

Next Step ........................................................................................................................ 57

Step 5: Review Compression Encodings ....................................................................................... 57

To Review Compression Encodings ..................................................................................... 57

Next Step ........................................................................................................................ 59

Step 6: Recreate the Test Data Set ............................................................................................. 59

To Recreate the Test Data Set ............................................................................................ 60

Next Step ........................................................................................................................ 62

Step 7: Retest System Performance After Tuning ......................................................................... 62

To Retest System Performance After Tuning ........................................................................ 62

Next Step ........................................................................................................................ 66

Step 8: Evaluate the Results ...................................................................................................... 66

Next Step ........................................................................................................................ 68

Step 9: Clean Up Your Resources ................................................................................................ 68

Next Step ........................................................................................................................ 68

Summary ................................................................................................................................ 68

Next Step ........................................................................................................................ 69

Tutorial: Loading Data from Amazon S3 .............................................................................................. 70

Prerequisites ............................................................................................................................ 70

Overview ................................................................................................................................. 70

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Steps ...................................................................................................................................... 71

Step 1: Launch a Cluster ........................................................................................................... 71

Next Step ........................................................................................................................ 72

Step 2: Download the Data Files ................................................................................................ 72

Next Step ........................................................................................................................ 72

Step 3: Upload the Files to an Amazon S3 Bucket ........................................................................ 72

...................................................................................................................................... 73

Next Step ........................................................................................................................ 73

Step 4: Create the Sample Tables ............................................................................................... 74

Next Step ........................................................................................................................ 76

Step 5: Run the COPY Commands .............................................................................................. 76

COPY Command Syntax .................................................................................................... 76

Loading the SSB Tables ..................................................................................................... 77

Step 6: Vacuum and Analyze the Database .................................................................................. 87

Next Step ........................................................................................................................ 88

Step 7: Clean Up Your Resources ................................................................................................ 88

Next ............................................................................................................................... 88

Summary ................................................................................................................................ 88

Next Step ........................................................................................................................ 89

Tutorial: Conﬁguring WLM Queues to Improve Query Processing ............................................................ 90

Overview ................................................................................................................................. 90

Prerequisites .................................................................................................................... 90

Sections .......................................................................................................................... 90

Section 1: Understanding the Default Queue Processing Behavior ................................................... 90

Step 1: Create the WLM_QUEUE_STATE_VW View ................................................................ 91

Step 2: Create the WLM_QUERY_STATE_VW View ................................................................. 92

Step 3: Run Test Queries ................................................................................................... 93

Section 2: Modifying the WLM Query Queue Conﬁguration ............................................................ 94

Step 1: Create a Parameter Group ...................................................................................... 94

Step 2: Conﬁgure WLM ..................................................................................................... 95

Step 3: Associate the Parameter Group with Your Cluster ...................................................... 96

Section 3: Routing Queries to Queues Based on User Groups and Query Groups ................................ 98

Step 1: View Query Queue Conﬁguration in the Database ...................................................... 98

Step 2: Run a Query Using the Query Group Queue .............................................................. 99

Step 3: Create a Database User and Group ........................................................................ 100

Step 4: Run a Query Using the User Group Queue .............................................................. 100

Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a Queue ......... 101

Step 1: Override the Concurrency Level Using wlm_query_slot_count .................................... 102

Step 2: Run Queries from Diﬀerent Sessions ...................................................................... 103

Section 5: Cleaning Up Your Resources ...................................................................................... 103

Tutorial: Querying Nested Data with Amazon Redshift Spectrum .......................................................... 104

Overview ............................................................................................................................... 104

Prerequisites .................................................................................................................. 104

Step 1: Create an External Table That Contains Nested Data ........................................................ 105

Step 2: Query Your Nested Data in Amazon S3 with SQL Extensions .............................................. 105

Extension 1: Access to Columns of Structs ......................................................................... 105

Extension 2: Ranging Over Arrays in a FROM Clause ............................................................ 106

Extension 3: Accessing an Array of Scalars Directly Using an Alias ......................................... 108

Extension 4: Accessing Elements of Maps .......................................................................... 108

Nested Data Use Cases ............................................................................................................ 109

Ingesting Nested Data ..................................................................................................... 109

Aggregating Nested Data with Subqueries ........................................................................ 109

Joining Amazon Redshift and Nested Data ........................................................................ 110

Nested Data Limitations .......................................................................................................... 111

Managing Database Security ............................................................................................................ 112

Amazon Redshift Security Overview .......................................................................................... 112

Default Database User Privileges .............................................................................................. 113

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Superusers ............................................................................................................................. 113

Users ..................................................................................................................................... 114

Creating, Altering, and Deleting Users ............................................................................... 114

Groups .................................................................................................................................. 114

Creating, Altering, and Deleting Groups ............................................................................. 115

Schemas ................................................................................................................................ 115

Creating, Altering, and Deleting Schemas .......................................................................... 115

Search Path ................................................................................................................... 116

Schema-Based Privileges ................................................................................................. 116

Example for Controlling User and Group Access ......................................................................... 116

Designing Tables ............................................................................................................................ 118

Choosing a Column Compression Type ...................................................................................... 118

Compression Encodings ................................................................................................... 119

Testing Compression Encodings ........................................................................................ 125

Example: Choosing Compression Encodings for the CUSTOMER Table .................................... 127

Choosing a Data Distribution Style ........................................................................................... 129

Data Distribution Concepts .............................................................................................. 129

Distribution Styles .......................................................................................................... 130

Viewing Distribution Styles .............................................................................................. 131

Evaluating Query Patterns ............................................................................................... 132

Designating Distribution Styles ......................................................................................... 132

Evaluating the Query Plan ............................................................................................... 133

Query Plan Example ....................................................................................................... 134

Distribution Examples ..................................................................................................... 138

Choosing Sort Keys ................................................................................................................. 140

Compound Sort Key ........................................................................................................ 141

Interleaved Sort Key ....................................................................................................... 141

Comparing Sort Styles .................................................................................................... 142

Deﬁning Constraints ............................................................................................................... 145

Analyzing Table Design ........................................................................................................... 146

Using Amazon Redshift Spectrum to Query External Data ................................................................... 148

Amazon Redshift Spectrum Overview ....................................................................................... 148

Amazon Redshift Spectrum Regions .................................................................................. 149

Amazon Redshift Spectrum Considerations ........................................................................ 149

Getting Started With Amazon Redshift Spectrum ....................................................................... 150

Prerequisites .................................................................................................................. 150

Steps ............................................................................................................................ 150

Step 1. Create an IAM Role .............................................................................................. 150

Step 2: Associate the IAM Role with Your Cluster ................................................................ 151

Step 3: Create an External Schema and an External Table .................................................... 152

Step 4: Query Your Data in Amazon S3 ............................................................................. 152

IAM Policies for Amazon Redshift Spectrum ............................................................................... 154

Amazon S3 Permissions ................................................................................................... 155

Cross-Account Amazon S3 Permissions .............................................................................. 156

Grant or Restrict Access Using Redshift Spectrum ............................................................... 156

Minimum Permissions ..................................................................................................... 157

Chaining IAM Roles ......................................................................................................... 158

Access AWS Glue Data .................................................................................................... 158

Creating Data Files for Queries in Amazon Redshift Spectrum ...................................................... 164

Creating External Schemas ...................................................................................................... 165

Working with External Catalogs ........................................................................................ 167

Creating External Tables .......................................................................................................... 171

Pseudocolumns .............................................................................................................. 172

Partitioning Redshift Spectrum External Tables .................................................................. 173

Mapping to ORC Columns ............................................................................................... 177

Improving Amazon Redshift Spectrum Query Performance .......................................................... 179

Monitoring Metrics .................................................................................................................. 181

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Troubleshooting Queries .......................................................................................................... 181

Retries Exceeded ............................................................................................................ 182

No Rows Returned for a Partitioned Table ......................................................................... 182

Not Authorized Error ....................................................................................................... 182

Incompatible Data Formats .............................................................................................. 182

Syntax Error When Using Hive DDL in Amazon Redshift ....................................................... 183

Permission to Create Temporary Tables ............................................................................. 183

Loading Data ................................................................................................................................. 184

Using COPY to Load Data ........................................................................................................ 184

Credentials and Access Permissions ................................................................................... 185

Preparing Your Input Data ............................................................................................... 186

Loading Data from Amazon S3 ........................................................................................ 187

Loading Data from Amazon EMR ...................................................................................... 196

Loading Data from Remote Hosts ..................................................................................... 200

Loading from Amazon DynamoDB .................................................................................... 206

Verifying That the Data Was Loaded Correctly ................................................................... 208

Validating Input Data ...................................................................................................... 208

Automatic Compression ................................................................................................... 209

Optimizing for Narrow Tables .......................................................................................... 211

Default Values ................................................................................................................ 211

Troubleshooting ............................................................................................................. 211

Updating with DML ................................................................................................................ 216

Updating and Inserting ........................................................................................................... 216

Merge Method 1: Replacing Existing Rows ......................................................................... 216

Merge Method 2: Specifying a Column List ........................................................................ 217

Creating a Temporary Staging Table ................................................................................. 217

Performing a Merge Operation by Replacing Existing Rows .................................................. 217

Performing a Merge Operation by Specifying a Column List ................................................. 218

Merge Examples ............................................................................................................. 219

Performing a Deep Copy ......................................................................................................... 221

Analyzing Tables .................................................................................................................... 223

Analyzing Tables ............................................................................................................ 223

Analysis of New Table Data ............................................................................................. 224

ANALYZE Command History ............................................................................................. 227

Vacuuming Tables ................................................................................................................... 228

VACUUM Frequency ........................................................................................................ 228

Sort Stage and Merge Stage ............................................................................................ 229

Vacuum Threshold .......................................................................................................... 229

Vacuum Types ................................................................................................................ 229

Managing Vacuum Times ................................................................................................. 230

Vacuum Column Limit Exceeded Error ............................................................................... 236

Managing Concurrent Write Operations ..................................................................................... 238

Serializable Isolation ....................................................................................................... 238

Write and Read-Write Operations ..................................................................................... 239

Concurrent Write Examples .............................................................................................. 240

Unloading Data .............................................................................................................................. 242

Unloading Data to Amazon S3 ................................................................................................. 242

Unloading Encrypted Data Files ................................................................................................ 245

Unloading Data in Delimited or Fixed-Width Format ................................................................... 246

Reloading Unloaded Data ........................................................................................................ 247

Creating User-Deﬁned Functions ...................................................................................................... 248

UDF Security and Privileges ..................................................................................................... 248

Creating a Scalar SQL UDF ...................................................................................................... 248

Scalar SQL Function Example ........................................................................................... 249

Creating a Scalar Python UDF .................................................................................................. 249

Scalar Python UDF Example ............................................................................................. 250

Python UDF Data Types .................................................................................................. 250

API Version 2012-12-01

vii

Amazon Redshift Database Developer Guide

ANYELEMENT Data Type ................................................................................................. 251

Python Language Support ............................................................................................... 251

UDF Constraints ............................................................................................................. 254

Naming UDFs ......................................................................................................................... 254

Overloading Function Names ........................................................................................... 255

Preventing UDF Naming Conﬂicts ..................................................................................... 255

Logging Errors and Warnings ................................................................................................... 255

Tuning Query Performance .............................................................................................................. 257

Query Processing .................................................................................................................... 257

Query Planning And Execution Workﬂow ........................................................................... 257

Reviewing Query Plan Steps ............................................................................................ 259

Query Plan .................................................................................................................... 260

Factors Aﬀecting Query Performance ................................................................................ 266

Analyzing and Improving Queries ............................................................................................. 267

Query Analysis Workﬂow ................................................................................................. 267

Reviewing Query Alerts ................................................................................................... 268

Analyzing the Query Plan ................................................................................................ 269

Analyzing the Query Summary ......................................................................................... 270

Improving Query Performance ......................................................................................... 275

Diagnostic Queries for Query Tuning ................................................................................. 277

Troubleshooting Queries .......................................................................................................... 280

Connection Fails ............................................................................................................. 281

Query Hangs .................................................................................................................. 281

Query Takes Too Long .................................................................................................... 282

Load Fails ...................................................................................................................... 283

Load Takes Too Long ...................................................................................................... 283

Load Data Is Incorrect ..................................................................................................... 283

Setting the JDBC Fetch Size Parameter ............................................................................. 284

Implementing Workload Management ............................................................................................... 285

Deﬁning Query Queues ........................................................................................................... 285

Concurrency Level .......................................................................................................... 286

User Groups ................................................................................................................... 287

Query Groups ................................................................................................................ 287

Wildcards ....................................................................................................................... 287

WLM Memory Percent to Use ........................................................................................... 288

WLM Timeout ................................................................................................................ 288

Query Monitoring Rules .................................................................................................. 288

WLM Query Queue Hopping .................................................................................................... 288

WLM Timeout Queue Hopping ......................................................................................... 289

WLM Timeout Reassigned and Restarted Queries ................................................................ 289

QMR Hop Action Queue Hopping ..................................................................................... 289

QMR Hop Action Reassigned and Restarted Queries ............................................................ 290

WLM Query Queue Hopping Summary .............................................................................. 290

Short Query Acceleration ........................................................................................................ 291

Maximum SQA Run Time ................................................................................................. 292

Monitoring SQA .............................................................................................................. 292

Modifying the WLM Conﬁguration ............................................................................................ 293

WLM Queue Assignment Rules ................................................................................................. 293

Queue Assignments Example ........................................................................................... 295

Assigning Queries to Queues ................................................................................................... 296

Assigning Queries to Queues Based on User Groups ............................................................ 296

Assigning a Query to a Query Group ................................................................................. 296

Assigning Queries to the Superuser Queue ........................................................................ 297

Dynamic and Static Properties ................................................................................................. 297

WLM Dynamic Memory Allocation .................................................................................... 298

Dynamic WLM Example ................................................................................................... 298

Query Monitoring Rules .......................................................................................................... 299

API Version 2012-12-01

viii

Amazon Redshift Database Developer Guide

Deﬁning a Query Monitor Rule ......................................................................................... 300

Query Monitoring Metrics ................................................................................................ 301

Query Monitoring Rules Templates ................................................................................... 302

System Tables and Views for Query Monitoring Rules ......................................................... 303

WLM System Tables and Views ................................................................................................ 304

SQL Reference ............................................................................................................................... 306

Amazon Redshift SQL ............................................................................................................. 306

SQL Functions Supported on the Leader Node ................................................................... 306

Amazon Redshift and PostgreSQL .................................................................................... 307

Using SQL ............................................................................................................................. 312

SQL Reference Conventions ............................................................................................. 312

Basic Elements ............................................................................................................... 313

Expressions .................................................................................................................... 337

Conditions ..................................................................................................................... 340

SQL Commands ...................................................................................................................... 357

ABORT .......................................................................................................................... 359

ALTER DATABASE ........................................................................................................... 360

ALTER DEFAULT PRIVILEGES ............................................................................................ 361

ALTER GROUP ................................................................................................................ 363

ALTER SCHEMA .............................................................................................................. 364

ALTER TABLE ................................................................................................................. 365

ALTER TABLE APPEND ..................................................................................................... 374

ALTER USER ................................................................................................................... 377

ANALYZE ....................................................................................................................... 380

ANALYZE COMPRESSION ................................................................................................. 382

BEGIN ........................................................................................................................... 384

CANCEL ......................................................................................................................... 385

CLOSE ........................................................................................................................... 387

COMMENT ..................................................................................................................... 388

COMMIT ........................................................................................................................ 389

COPY ............................................................................................................................ 390

CREATE DATABASE .......................................................................................................... 448

CREATE EXTERNAL SCHEMA ............................................................................................ 449

CREATE EXTERNAL TABLE ................................................................................................ 452

CREATE FUNCTION ......................................................................................................... 463

CREATE GROUP .............................................................................................................. 467

CREATE LIBRARY ............................................................................................................ 468

CREATE SCHEMA ............................................................................................................ 470

CREATE TABLE ............................................................................................................... 471

CREATE TABLE AS ........................................................................................................... 483

CREATE USER ................................................................................................................. 490

CREATE VIEW ................................................................................................................. 493

DEALLOCATE .................................................................................................................. 496

DECLARE ....................................................................................................................... 496

DELETE ......................................................................................................................... 499

DROP DATABASE ............................................................................................................ 500

DROP FUNCTION ............................................................................................................ 501

DROP GROUP ................................................................................................................ 502

DROP LIBRARY ............................................................................................................... 502

DROP SCHEMA ............................................................................................................... 503

DROP TABLE .................................................................................................................. 504

DROP USER ................................................................................................................... 507

DROP VIEW ................................................................................................................... 508

END .............................................................................................................................. 509

EXECUTE ....................................................................................................................... 510

EXPLAIN ........................................................................................................................ 511

FETCH ........................................................................................................................... 515

API Version 2012-12-01

Amazon Redshift Database Developer Guide

GRANT .......................................................................................................................... 516

INSERT .......................................................................................................................... 520

LOCK ............................................................................................................................ 524

PREPARE ....................................................................................................................... 525

RESET ........................................................................................................................... 527

REVOKE ......................................................................................................................... 527

ROLLBACK ..................................................................................................................... 531

SELECT .......................................................................................................................... 532

SELECT INTO .................................................................................................................. 560

SET ............................................................................................................................... 560

SET SESSION AUTHORIZATION ........................................................................................ 563

SET SESSION CHARACTERISTICS ....................................................................................... 564

SHOW ........................................................................................................................... 564

START TRANSACTION ...................................................................................................... 565

TRUNCATE ..................................................................................................................... 565

UNLOAD ........................................................................................................................ 566

UPDATE ......................................................................................................................... 580

VACUUM ........................................................................................................................ 584

SQL Functions Reference ......................................................................................................... 588

Leader Node–Only Functions ........................................................................................... 588

Compute Node–Only Functions ........................................................................................ 589

Aggregate Functions ....................................................................................................... 590

Bit-Wise Aggregate Functions .......................................................................................... 605

Window Functions .......................................................................................................... 610

Conditional Expressions ................................................................................................... 654

Date and Time Functions ................................................................................................. 663

Math Functions .............................................................................................................. 700

String Functions ............................................................................................................. 724

JSON Functions .............................................................................................................. 761

Data Type Formatting Functions ....................................................................................... 767

System Administration Functions ...................................................................................... 777

System Information Functions .......................................................................................... 780

Reserved Words ...................................................................................................................... 794

System Tables Reference ................................................................................................................. 797

System Tables and Views ......................................................................................................... 797

Types of System Tables and Views ............................................................................................ 797

Visibility of Data in System Tables and Views ............................................................................. 798

Filtering System-Generated Queries .................................................................................. 798

STL Tables for Logging ........................................................................................................... 798

STL_AGGR ..................................................................................................................... 800

STL_ALERT_EVENT_LOG .................................................................................................. 801

STL_ANALYZE ................................................................................................................. 803

STL_BCAST .................................................................................................................... 805

STL_COMMIT_STATS ....................................................................................................... 806

STL_CONNECTION_LOG ................................................................................................... 807

STL_DDLTEXT ................................................................................................................. 808

STL_DELETE ................................................................................................................... 810

STL_DISK_FULL_DIAG ...................................................................................................... 812

STL_DIST ....................................................................................................................... 812

STL_ERROR .................................................................................................................... 813

STL_EXPLAIN ................................................................................................................. 814

STL_FILE_SCAN .............................................................................................................. 816

STL_HASH ..................................................................................................................... 817

STL_HASHJOIN ............................................................................................................... 819

STL_INSERT ................................................................................................................... 820

STL_LIMIT ...................................................................................................................... 821

STL_LOAD_COMMITS ...................................................................................................... 823

API Version 2012-12-01

Amazon Redshift Database Developer Guide

STL_LOAD_ERRORS ......................................................................................................... 825

STL_LOADERROR_DETAIL ................................................................................................ 827

STL_MERGE ................................................................................................................... 829

STL_MERGEJOIN ............................................................................................................. 830

STL_NESTLOOP .............................................................................................................. 831

STL_PARSE .................................................................................................................... 832

STL_PLAN_INFO ............................................................................................................. 833

STL_PROJECT ................................................................................................................. 835

STL_QUERY .................................................................................................................... 837

STL_QUERY_METRICS ...................................................................................................... 838

STL_QUERYTEXT ............................................................................................................ 841

STL_REPLACEMENTS ....................................................................................................... 842

STL_RESTARTED_SESSIONS ............................................................................................. 843

STL_RETURN .................................................................................................................. 844

STL_S3CLIENT ................................................................................................................ 845

STL_S3CLIENT_ERROR ..................................................................................................... 847

STL_SAVE ...................................................................................................................... 848

STL_SCAN ...................................................................................................................... 849

STL_SESSIONS ............................................................................................................... 851

STL_SORT ...................................................................................................................... 852

STL_SSHCLIENT_ERROR ................................................................................................... 853

STL_STREAM_SEGS ......................................................................................................... 854

STL_TR_CONFLICT .......................................................................................................... 855

STL_UNDONE ................................................................................................................. 856

STL_UNIQUE .................................................................................................................. 856

STL_UNLOAD_LOG .......................................................................................................... 858

STL_USERLOG ................................................................................................................ 859

STL_UTILITYTEXT ........................................................................................................... 860

STL_VACUUM ................................................................................................................. 862

STL_WINDOW ................................................................................................................ 864

STL_WLM_ERROR ........................................................................................................... 865

STL_WLM_RULE_ACTION ................................................................................................. 866

STL_WLM_QUERY ........................................................................................................... 866

STV Tables for Snapshot Data .................................................................................................. 868

STV_ACTIVE_CURSORS .................................................................................................... 869

STV_BLOCKLIST ............................................................................................................. 869

STV_CURSOR_CONFIGURATION ........................................................................................ 872

STV_EXEC_STATE ............................................................................................................ 873

STV_INFLIGHT ................................................................................................................ 874

STV_LOAD_STATE ........................................................................................................... 875

STV_LOCKS .................................................................................................................... 876

STV_PARTITIONS ............................................................................................................ 877

STV_QUERY_METRICS ..................................................................................................... 879

STV_RECENTS ................................................................................................................ 882

STV_SESSIONS ............................................................................................................... 883

STV_SLICES ................................................................................................................... 884

STV_STARTUP_RECOVERY_STATE ..................................................................................... 885

STV_TBL_PERM .............................................................................................................. 886

STV_TBL_TRANS ............................................................................................................. 888

STV_WLM_QMR_CONFIG ................................................................................................. 889

STV_WLM_CLASSIFICATION_CONFIG ................................................................................. 890

STV_WLM_QUERY_QUEUE_STATE ..................................................................................... 891

STV_WLM_QUERY_STATE ................................................................................................ 892

STV_WLM_QUERY_TASK_STATE ........................................................................................ 893

STV_WLM_SERVICE_CLASS_CONFIG .................................................................................. 894

STV_WLM_SERVICE_CLASS_STATE .................................................................................... 896

System Views ......................................................................................................................... 896

API Version 2012-12-01

Amazon Redshift Database Developer Guide

SVV_COLUMNS .............................................................................................................. 897

SVL_COMPILE ................................................................................................................. 899

SVV_DISKUSAGE ............................................................................................................. 900

SVV_EXTERNAL_COLUMNS .............................................................................................. 902

SVV_EXTERNAL_DATABASES ............................................................................................ 902

SVV_EXTERNAL_PARTITIONS ............................................................................................ 903

SVV_EXTERNAL_SCHEMAS ............................................................................................... 903

SVV_EXTERNAL_TABLES .................................................................................................. 904

SVV_INTERLEAVED_COLUMNS .......................................................................................... 905

SVL_QERROR ................................................................................................................. 906

SVL_QLOG ..................................................................................................................... 906

SVV_QUERY_INFLIGHT .................................................................................................... 907

SVL_QUERY_QUEUE_INFO ............................................................................................... 908

SVL_QUERY_METRICS ..................................................................................................... 909

SVL_QUERY_METRICS_SUMMARY ..................................................................................... 911

SVL_QUERY_REPORT ...................................................................................................... 912

SVV_QUERY_STATE ......................................................................................................... 914

SVL_QUERY_SUMMARY ................................................................................................... 916

SVL_S3LOG .................................................................................................................... 918

SVL_S3PARTITION .......................................................................................................... 919

SVL_S3QUERY ................................................................................................................ 920

SVL_S3QUERY_SUMMARY ................................................................................................ 921

SVL_S3RETRIES .............................................................................................................. 924

SVL_STATEMENTTEXT ..................................................................................................... 925

SVV_TABLES .................................................................................................................. 926

SVV_TABLE_INFO ............................................................................................................ 926

SVV_TRANSACTIONS ....................................................................................................... 928

SVL_USER_INFO ............................................................................................................. 929

SVL_UDF_LOG ................................................................................................................ 930

SVV_VACUUM_PROGRESS ................................................................................................ 932

SVV_VACUUM_SUMMARY ................................................................................................ 933

SVL_VACUUM_PERCENTAGE ............................................................................................. 934

System Catalog Tables ............................................................................................................ 935

PG_CLASS_INFO ............................................................................................................. 935

PG_DEFAULT_ACL ........................................................................................................... 936

PG_EXTERNAL_SCHEMA .................................................................................................. 938

PG_LIBRARY ................................................................................................................... 939

PG_STATISTIC_INDICATOR ............................................................................................... 939

PG_TABLE_DEF ............................................................................................................... 940

Querying the Catalog Tables ............................................................................................ 942

Conﬁguration Reference .................................................................................................................. 947

Modifying the Server Conﬁguration .......................................................................................... 947

analyze_threshold_percent ....................................................................................................... 948

Values (Default in Bold) ................................................................................................... 948

Description .................................................................................................................... 948

Examples ....................................................................................................................... 948

datestyle ............................................................................................................................... 948

Values (Default in Bold) ................................................................................................... 948

Description .................................................................................................................... 948

Example ........................................................................................................................ 948

describe_ﬁeld_name_in_uppercase ............................................................................................ 949

Values (Default in Bold) ................................................................................................... 949

Description .................................................................................................................... 948

Example ........................................................................................................................ 948

enable_result_cache_for_session ............................................................................................... 949

Values (Default in Bold) ................................................................................................... 949

Description .................................................................................................................... 948

API Version 2012-12-01

xii

Amazon Redshift Database Developer Guide

extra_ﬂoat_digits .................................................................................................................... 949

Values (Default in Bold) ................................................................................................... 949

Description .................................................................................................................... 950

max_cursor_result_set_size ...................................................................................................... 950

Values (Default in Bold) ................................................................................................... 950

Description .................................................................................................................... 950

query_group .......................................................................................................................... 950

Values (Default in Bold) ................................................................................................... 950

Description .................................................................................................................... 950

search_path ........................................................................................................................... 951

Values (Default in Bold) ................................................................................................... 951

Description .................................................................................................................... 951

Example ........................................................................................................................ 951

statement_timeout ................................................................................................................. 952

Values (Default in Bold) ................................................................................................... 952

Description .................................................................................................................... 952

Example ........................................................................................................................ 952

timezone ............................................................................................................................... 952

Values (Default in Bold) ................................................................................................... 952

Syntax ........................................................................................................................... 953

Description .................................................................................................................... 953

Time Zone Formats ......................................................................................................... 953

Examples ....................................................................................................................... 954

wlm_query_slot_count ............................................................................................................ 955

Values (Default in Bold) ................................................................................................... 955

Description .................................................................................................................... 955

Examples ....................................................................................................................... 955

Sample Database ............................................................................................................................ 957

CATEGORY Table .................................................................................................................... 958

DATE Table ............................................................................................................................ 958

EVENT Table .......................................................................................................................... 959

VENUE Table .......................................................................................................................... 959

USERS Table .......................................................................................................................... 960

LISTING Table ........................................................................................................................ 960

SALES Table ........................................................................................................................... 961

Time Zone Names and Abbreviations ................................................................................................ 962

Time Zone Names .................................................................................................................. 962

Time Zone Abbreviations ......................................................................................................... 971

Document History .......................................................................................................................... 975

Earlier Updates ....................................................................................................................... 977

API Version 2012-12-01

xiii

Amazon Redshift Database Developer Guide

Are You a First-Time Amazon Redshift User?

Welcome

Topics

•Are You a First-Time Amazon Redshift User? (p. 1)

•Are You a Database Developer? (p. 2)

•Prerequisites (p. 3)

This is the Amazon Redshift Database Developer Guide.

Amazon Redshift is an enterprise-level, petabyte scale, fully managed data warehousing service.

This guide focuses on using Amazon Redshift to create and manage a data warehouse. If you work with

databases as a designer, software developer, or administrator, it gives you the information you need to

design, build, query, and maintain your data warehouse.

Are You a First-Time Amazon Redshift User?

If you are a ﬁrst-time user of Amazon Redshift, we recommend that you begin by reading the following

sections.

• Service Highlights and Pricing – The product detail page provides the Amazon Redshift value

proposition, service highlights, and pricing.

• Getting Started – Amazon Redshift Getting Started includes an example that walks you through the

process of creating an Amazon Redshift data warehouse cluster, creating database tables, uploading

data, and testing queries.

After you complete the Getting Started guide, we recommend that you explore one of the following

guides:

•Amazon Redshift Cluster Management Guide – The Cluster Management guide shows you how to

create and manage Amazon Redshift clusters.

If you are an application developer, you can use the Amazon Redshift Query API to manage clusters

programmatically. Additionally, the AWS SDK libraries that wrap the underlying Amazon Redshift

API can help simplify your programming tasks. If you prefer a more interactive way of managing

clusters, you can use the Amazon Redshift console and the AWS command line interface (AWS CLI). For

information about the API and CLI, go to the following manuals:

•API Reference

•CLI Reference

• Amazon Redshift Database Developer Guide (this document) – If you are a database developer, the

Database Developer Guide explains how to design, build, query, and maintain the databases that make

up your data warehouse.

If you are transitioning to Amazon Redshift from another relational database system or data warehouse

application, you should be aware of important diﬀerences in how Amazon Redshift is implemented. For

a summary of the most important considerations for designing tables and loading data, see Amazon

Redshift Best Practices for Designing Tables (p. 26) and Amazon Redshift Best Practices for Loading

Data (p. 29). Amazon Redshift is based on PostgreSQL 8.0.2. For a detailed list of the diﬀerences

between Amazon Redshift and PostgreSQL, see Amazon Redshift and PostgreSQL (p. 307).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Are You a Database Developer?

If you are a database user, database designer, database developer, or database administrator, the

following table will help you ﬁnd what you’re looking for.

If you want to ... We recommend

Quickly start using

Amazon Redshift

Begin by following the steps in Amazon Redshift Getting Started to quickly

deploy a cluster, connect to a database, and try out some queries.

When you are ready to build your database, load data into tables, and

write queries to manipulate data in the data warehouse, return here to the

Database Developer Guide.

Learn about the

internal architecture of

the Amazon Redshift

data warehouse.

The Amazon Redshift System Overview (p. 4) gives a high-level overview

of Amazon Redshift's internal architecture.

If you want a broader overview of the Amazon Redshift web service, go to

the Amazon Redshift product detail page.

Create databases,

tables, users, and other

database objects.

Getting Started Using Databases (p. 13) is a quick introduction to the

basics of SQL development.

The Amazon Redshift SQL (p. 306) has the syntax and examples for

Amazon Redshift SQL commands and functions and other SQL elements.

Amazon Redshift Best Practices for Designing Tables (p. 26) provides a

summary of our recommendations for choosing sort keys, distribution keys,

and compression encodings.

Learn how to design

tables for optimum

performance.

Designing Tables (p. 118) details considerations for applying compression

to the data in table columns and choosing distribution and sort keys.

Load data. Loading Data (p. 184) explains the procedures for loading large datasets

from Amazon DynamoDB tables or from ﬂat ﬁles stored in Amazon S3

buckets.

Amazon Redshift Best Practices for Loading Data (p. 29) provides for tips

for loading your data quickly and eﬀectively.

Manage users, groups,

and database security.

Managing Database Security (p. 112) covers database security topics.

Monitor and optimize

system performance.

The System Tables Reference (p. 797) details system tables and views

that you can query for the status of the database and monitor queries and

processes.

You should also consult the Amazon Redshift Cluster Management Guide to

learn how to use the AWS Management Console to check the system health,

monitor metrics, and back up and restore clusters.

Analyze and report

information from very

large datasets.

Many popular software vendors are certifying Amazon Redshift with their

oﬀerings to enable you to continue to use the tools you use today. For more

information, see the Amazon Redshift partner page.

The SQL Reference (p. 306) has all the details for the SQL expressions,

commands, and functions Amazon Redshift supports.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

Before you use this guide, you should complete these tasks.

• Install a SQL client.

• Launch an Amazon Redshift cluster.

• Connect your SQL client to the cluster master database.

For step-by-step instructions, see Amazon Redshift Getting Started.

You should also know how to use your SQL client and should have a fundamental understanding of the

SQL language.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Data Warehouse System Architecture

Amazon Redshift System Overview

Topics

•Data Warehouse System Architecture (p. 4)

•Performance (p. 6)

•Columnar Storage (p. 8)

•Internal Architecture and System Operation (p. 10)

•Workload Management (p. 11)

•Using Amazon Redshift with Other Services (p. 11)

An Amazon Redshift data warehouse is an enterprise-class relational database query and management

system.

Amazon Redshift supports client connections with many types of applications, including business

intelligence (BI), reporting, data, and analytics tools.

When you execute analytic queries, you are retrieving, comparing, and evaluating large amounts of data

in multiple-stage operations to produce a ﬁnal result.

Amazon Redshift achieves eﬃcient storage and optimum query performance through a combination

of massively parallel processing, columnar data storage, and very eﬃcient, targeted data compression

encoding schemes. This section presents an introduction to the Amazon Redshift system architecture.

Data Warehouse System Architecture

This section introduces the elements of the Amazon Redshift data warehouse architecture as shown in

the following ﬁgure.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Data Warehouse System Architecture

Client applications

Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools

and business intelligence (BI) reporting, data mining, and analytics tools. Amazon Redshift is based on

industry-standard PostgreSQL, so most existing SQL client applications will work with only minimal

changes. For information about important diﬀerences between Amazon Redshift SQL and PostgreSQL,

see Amazon Redshift and PostgreSQL (p. 307).

Connections

Amazon Redshift communicates with client applications by using industry-standard JDBC and ODBC

drivers for PostgreSQL. For more information, see Amazon Redshift and PostgreSQL JDBC and

ODBC (p. 308).

Clusters

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more

compute nodes, an additional leader node coordinates the compute nodes and handles external

communication. Your client application interacts directly only with the leader node. The compute nodes

are transparent to external applications.

Leader node

The leader node manages communications with client programs and all communication with compute

nodes. It parses and develops execution plans to carry out database operations, in particular, the series

of steps necessary to obtain results for complex queries. Based on the execution plan, the leader node

compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to

each compute node.

The leader node distributes SQL statements to the compute nodes only when a query references tables

that are stored on the compute nodes. All other queries run exclusively on the leader node. Amazon

Redshift is designed to implement certain SQL functions only on the leader node. A query that uses any

of these functions will return an error if it references tables that reside on the compute nodes. For more

information, see SQL Functions Supported on the Leader Node (p. 306).

Compute nodes

The leader node compiles code for individual elements of the execution plan and assigns the code to

individual compute nodes. The compute nodes execute the compiled code and send intermediate results

back to the leader node for ﬁnal aggregation.

Each compute node has its own dedicated CPU, memory, and attached disk storage, which are

determined by the node type. As your workload grows, you can increase the compute capacity and

storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

Amazon Redshift provides two node types; dense storage nodes and dense compute nodes. Each node

provides two storage choices. You can start with a single 160 GB node and scale up to multiple 16 TB

nodes to support a petabyte of data or more.

For a more detailed explanation of data warehouse clusters and nodes, see Internal Architecture and

System Operation (p. 10).

Node slices

A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and

disk space, where it processes a portion of the workload assigned to the node. The leader node manages

distributing data to the slices and apportions the workload for any queries or other database operations

to the slices. The slices then work in parallel to complete the operation.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Performance

The number of slices per node is determined by the node size of the cluster. For more information about

the number of slices for each node size, go to About Clusters and Nodes in the Amazon Redshift Cluster

Management Guide.

When you create a table, you can optionally specify one column as the distribution key. When the table

is loaded with data, the rows are distributed to the node slices according to the distribution key that is

deﬁned for a table. Choosing a good distribution key enables Amazon Redshift to use parallel processing

to load data and execute queries eﬃciently. For information about choosing a distribution key, see

Choose the Best Distribution Style (p. 27).

Internal network

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom

communication protocols to provide private, very high-speed network communication between the

leader node and compute nodes. The compute nodes run on a separate, isolated network that client

applications never access directly.

Databases

A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client

communicates with the leader node, which in turn coordinates query execution with the compute nodes.

Amazon Redshift is a relational database management system (RDBMS), so it is compatible with

other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including

online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is

optimized for high-performance analysis and reporting of very large datasets.

Amazon Redshift is based on PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of

very important diﬀerences that you need to take into account as you design and develop your data

warehouse applications. For information about how Amazon Redshift SQL diﬀers from PostgreSQL, see

Amazon Redshift and PostgreSQL (p. 307).

Performance

Amazon Redshift achieves extremely fast query execution by employing these performance features.

Topics

•Massively Parallel Processing (p. 6)

•Columnar Data Storage (p. 7)

•Data Compression (p. 7)

•Query Optimizer (p. 7)

•Result Caching (p. 7)

•Compiled Code (p. 8)

Massively Parallel Processing

Massively parallel processing (MPP) enables fast execution of the most complex queries operating on

large amounts of data. Multiple compute nodes handle all query processing leading up to ﬁnal result

aggregation, with each core of each node executing the same compiled query segments on portions of

the entire data.

Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed

in parallel. By selecting an appropriate distribution key for each table, you can optimize the distribution

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Columnar Data Storage

of data to balance the workload and minimize movement of data from node to node. For more

information, see Choose the Best Distribution Style (p. 27).

Loading data from ﬂat ﬁles takes advantage of parallel processing by spreading the workload across

multiple nodes while simultaneously reading from multiple ﬁles. For more information about how to

load data into tables, see Amazon Redshift Best Practices for Loading Data (p. 29).

Columnar Data Storage

Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an

important factor in optimizing analytic query performance. Storing database table information in a

columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to

load from disk. Loading less data into memory enables Amazon Redshift to perform more in-memory

processing when executing queries. See Columnar Storage (p. 8) for a more detailed explanation.

When columns are sorted appropriately, the query processor is able to rapidly ﬁlter out a large subset of

data blocks. For more information, see Choose the Best Sort Key (p. 27).

Data Compression

Data compression reduces storage requirements, thereby reducing disk I/O, which improves query

performance. When you execute a query, the compressed data is read into memory, then uncompressed

during query execution. Loading less data into memory enables Amazon Redshift to allocate more

memory to analyzing the data. Because columnar storage stores similar data sequentially, Amazon

Redshift is able to apply adaptive compression encodings speciﬁcally tied to columnar data types. The

best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal

compression encodings when you load the table with data. To learn more about using automatic data

compression, see Loading Tables with Automatic Compression (p. 209).

Query Optimizer

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and

also takes advantage of the columnar-oriented data storage. The Amazon Redshift query optimizer

implements signiﬁcant enhancements and extensions for processing complex analytic queries that often

include multi-table joins, subqueries, and aggregation. To learn more about optimizing queries, see

Tuning Query Performance (p. 257).

Result Caching

To reduce query execution time and improve system performance, Amazon Redshift caches the results

of certain types of queries in memory on the leader node. When a user submits a query, Amazon

Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the

result cache, Amazon Redshift uses the cached results and doesn’t execute the query. Result caching is

transparent to the user.

Result caching is enabled by default. To disable result caching for the current session, set the

enable_result_cache_for_session (p. 949) parameter to off.

Amazon Redshift uses cached results for a new query when all of the following are true:

• The user submitting the query has access privilege to the objects used in the query.

• The table or views in the query haven't been modiﬁed.

• The query doesn't use a function that must be evaluated each time it's run, such as GETDATE.

• The query doesn't reference Amazon Redshift Spectrum external tables.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Compiled Code

• Conﬁguration parameters that might aﬀect query results are unchanged.

• The query syntactically matches the cached query.

To maximize cache eﬀectiveness and eﬃcient use of resources, Amazon Redshift doesn't cache some

large query result sets. Amazon Redshift determines whether to cache query results based on a number

of factors. These factors include the number of entries in the cache and the instance type of your

Amazon Redshift cluster.

To determine whether a query used the result cache, query the SVL_QLOG (p. 906) system view. If a

query used the result cache, the source_query column returns the query ID of the source query. If result

caching wasn't used, the source_query column value is NULL.

The following example shows that queries submitted by userid 104 and userid 102 use the result cache

from queries run by userid 100.

select userid, query, elapsed, source_query from svl_qlog

where userid > 1

order by query desc;

userid | query | elapsed | source_query

-------+--------+----------+-------------

104 | 629035 | 27 | 628919

104 | 629034 | 60 | 628900

104 | 629033 | 23 | 628891

102 | 629017 | 1229393 |

102 | 628942 | 28 | 628919

102 | 628941 | 57 | 628900

102 | 628940 | 26 | 628891

100 | 628919 | 84295686 |

100 | 628900 | 87015637 |

100 | 628891 | 58808694 |

For details about the queries used to create the results shown in the previous example, see Step 2: Test

System Performance to Establish a Baseline (p. 49) in the Tuning Table Design (p. 45) tutorial.

Compiled Code

The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling

the query eliminates the overhead associated with an interpreter and therefore increases the execution

speed, especially for complex queries. The compiled code is cached and shared across sessions on the

same cluster, so subsequent executions of the same query will be faster, often even with diﬀerent

parameters.

The execution engine compiles diﬀerent code for the JDBC connection protocol and for ODBC and psql

(libq) connection protocols, so two clients using diﬀerent protocols will each incur the ﬁrst-time cost

of compiling the code. Other clients that use the same protocol, however, will beneﬁt from sharing the

cached code.

Columnar Storage

Columnar storage for database tables is an important factor in optimizing analytic query performance

because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need

to load from disk.

The following series of illustrations describe how columnar data storage implements eﬃciencies and

how that translates into eﬃciencies when retrieving data into memory.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Columnar Storage

This ﬁrst illustration shows how records from database tables are typically stored into disk blocks by row.

In a typical relational database table, each row contains ﬁeld values for a single record. In row-wise

database storage, data blocks store values sequentially for each consecutive column making up the

entire row. If block size is smaller than the size of a record, storage for an entire record may take more

than one block. If block size is larger than the size of a record, storage for an entire record may take

less than one block, resulting in an ineﬃcient use of disk space. In online transaction processing (OLTP)

applications, most transactions involve frequently reading and writing all of the values for entire records,

typically one record or a small number of records at a time. As a result, row-wise storage is optimal for

OLTP databases.

The next illustration shows how with columnar storage, the values for each column are stored

sequentially into disk blocks.

Using columnar storage, each data block stores values of a single column for multiple rows. As records

enter the system, Amazon Redshift transparently converts the data to columnar storage for each of the

columns.

In this simpliﬁed example, using columnar storage, each data block holds column ﬁeld values for as

many as three times as many records as row-based storage. This means that reading the same number

of column ﬁeld values for the same number of records requires a third of the I/O operations compared

to row-wise storage. In practice, using tables with very large numbers of columns and very large row

counts, storage eﬃciency is even greater.

An added advantage is that, since each block holds the same type of data, block data can use a

compression scheme selected speciﬁcally for the column data type, further reducing disk space and

I/O. For more information about compression encodings based on data types, see Compression

Encodings (p. 119).

The savings in space for storing data on disk also carries over to retrieving and then storing that data in

memory. Since many database operations only need to access or operate on one or a small number of

columns at a time, you can save memory space by only retrieving blocks for columns you actually need

for a query. Where OLTP transactions typically involve most or all of the columns in a row for a small

number of records, data warehouse queries commonly read only a few columns for a very large number

of rows. This means that reading the same number of column ﬁeld values for the same number of rows

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Internal Architecture and System Operation

requires a fraction of the I/O operations and uses a fraction of the memory that would be required for

processing row-wise blocks. In practice, using tables with very large numbers of columns and very large

row counts, the eﬃciency gains are proportionally greater. For example, suppose a table contains 100

columns. A query that uses ﬁve columns will only need to read about ﬁve percent of the data contained

in the table. This savings is repeated for possibly billions or even trillions of records for large databases.

In contrast, a row-wise database would read the blocks that contain the 95 unneeded columns as well.

Typical database block sizes range from 2 KB to 32 KB. Amazon Redshift uses a block size of 1 MB,

which is more eﬃcient and further reduces the number of I/O requests needed to perform any database

loading or other operations that are part of query execution.

Internal Architecture and System Operation

The following diagram shows a high level view of internal components and functionality of the Amazon

Redshift data warehouse.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Workload Management

Amazon Redshift workload management (WLM) enables users to ﬂexibly manage priorities within

workloads so that short, fast-running queries won't get stuck in queues behind long-running queries.

Amazon Redshift WLM creates query queues at runtime according to service classes, which deﬁne the

conﬁguration parameters for various types of queues, including internal system queues and user-

accessible queues. From a user perspective, a user-accessible service class and a queue are functionally

equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service

class as well as a runtime queue.

When you run a query, WLM assigns the query to a queue according to the user's user group or by

matching a query group that is listed in the queue conﬁguration with a query group label that the user

sets at runtime.

By default, Amazon Redshift conﬁgures one queue with a concurrency level of ﬁve, which enables up to

ﬁve queries to run concurrently, plus one predeﬁned Superuser queue, with a concurrency level of one.

You can deﬁne up to eight queues. Each queue can be conﬁgured with a maximum concurrency level

of 50. The maximum total concurrency level for all user-deﬁned queues (not including the Superuser

queue) is 50.

The easiest way to modify the WLM conﬁguration is by using the Amazon Redshift Management Console.

You can also use the Amazon Redshift command line interface (CLI) or the Amazon Redshift API.

For more information about implementing and using workload management, see Implementing

Workload Management (p. 285).

Using Amazon Redshift with Other Services

Amazon Redshift integrates with other AWS services to enable you to move, transform, and load your

data quickly and reliably, using data security features.

Moving Data Between Amazon Redshift and Amazon

Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon

Redshift leverages parallel processing to read and load data from multiple data ﬁles stored in Amazon S3

buckets. For more information, see Loading Data from Amazon S3 (p. 187).

You can also use parallel processing to export data from your Amazon Redshift data warehouse to

multiple data ﬁles on Amazon S3. For more information, see Unloading Data (p. 242).

Using Amazon Redshift with Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. You can use the COPY command to load

an Amazon Redshift table with data from a single Amazon DynamoDB table. For more information, see

Loading Data from an Amazon DynamoDB Table (p. 206).

Importing Data from Remote Hosts over SSH

You can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such

as Amazon EMR clusters, Amazon EC2 instances, or other computers. COPY connects to the remote hosts

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Automating Data Loads Using AWS Data Pipeline

using SSH and executes commands on the remote hosts to generate data. Amazon Redshift supports

multiple simultaneous connections. The COPY command reads and loads the output from multiple host

sources in parallel. For more information, see Loading Data from Remote Hosts (p. 200).

Automating Data Loads Using AWS Data Pipeline

You can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon

Redshift. By using the built-in scheduling capabilities of AWS Data Pipeline, you can schedule and

execute recurring jobs without having to write your own complex data transfer or transformation

logic. For example, you can set up a recurring job to automatically copy data from Amazon DynamoDB

into Amazon Redshift. For a tutorial that walks you through the process of creating a pipeline that

periodically moves data from Amazon S3 to Amazon Redshift, see Copy Data to Amazon Redshift Using

AWS Data Pipeline in the AWS Data Pipeline Developer Guide.

Migrating Data Using AWS Database Migration

Service (AWS DMS)

You can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can

migrate your data to and from most widely used commercial and open-source databases such as Oracle,

PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and

MySQL. For more information, see Using an Amazon Redshift Database as a Target for AWS Database

Migration Service.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 1: Create a Database

Getting Started Using Databases

Topics

•Step 1: Create a Database (p. 13)

•Step 2: Create a Database User (p. 14)

•Step 3: Create a Database Table (p. 14)

•Step 4: Load Sample Data (p. 15)

•Step 5: Query the System Tables (p. 16)

•Step 6: Cancel a Query (p. 18)

•Step 7: Clean Up Your Resources (p. 20)

This section describes the basic steps to begin using the Amazon Redshift database.

The examples in this section assume you have signed up for the Amazon Redshift data warehouse

service, created a cluster, and established a connection to the cluster from your SQL query tool. For

information about these tasks, see Amazon Redshift Getting Started.

Important

The cluster that you deployed for this exercise will be running in a live environment. As long as

it is running, it will accrue charges to your AWS account. For more pricing information, go to the

Amazon Redshift pricing page.

To avoid unnecessary charges, you should delete your cluster when you are done with it. The

ﬁnal step of the exercise explains how to do so.

Step 1: Create a Database

After you have veriﬁed that your cluster is up and running, you can create your ﬁrst database. This

database is where you will actually create tables, load data, and run queries. A single cluster can host

multiple databases. For example, you can have a TICKIT database and an ORDERS database on the same

cluster.

After you connect to the initial cluster database, the database you created when you launched the

cluster, you use the initial database as the base for creating a new database.

For example, to create a database named tickit, issue the following command:

create database tickit;

For this exercise, we'll accept the defaults. For information about more command options, see CREATE

DATABASE (p. 448) in the SQL Command Reference.

After you have created the TICKIT database, you can connect to the new database from your SQL client.

Use the same connection parameters as you used for your current connection, but change the database

name to tickit.

You do not need to change the database to complete the remainder of this tutorial. If you prefer not to

connect to the TICKIT database, you can try the rest of the examples in this section using the default

database.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Create a Database User

By default, only the master user that you created when you launched the cluster has access to the

initial database in the cluster. To grant other users access, you must create one or more user accounts.

Database user accounts are global across all the databases in a cluster; they do not belong to individual

databases.

Use the CREATE USER command to create a new database user. When you create a new user, you specify

the name of the new user and a password. A password is required. It must have between 8 and 64

characters, and it must include at least one uppercase letter, one lowercase letter, and one numeral.

For example, to create a user named GUEST with password ABCd4321, issue the following command:

create user guest password 'ABCd4321';

For information about other command options, see CREATE USER (p. 490) in the SQL Command

Reference.

Delete a Database User

You won't need the GUEST user account for this tutorial, so you can delete it. If you delete a database

user account, the user will no longer be able to access any of the cluster databases.

Issue the following command to drop the GUEST user:

drop user guest;

The master user you created when you launched your cluster continues to have access to the database.

Important

Amazon Redshift strongly recommends that you do not delete the master user.

For information about command options, see DROP USER (p. 507) in the SQL Reference.

Step 3: Create a Database Table

After you create your new database, you create tables to hold your database data. You specify any

column information for the table when you create the table.

For example, to create a table named testtable with a single column named testcol for an integer

data type, issue the following command:

create table testtable (testcol int);

The PG_TABLE_DEF system table contains information about all the tables in the cluster. To verify the

result, issue the following SELECT command to query the PG_TABLE_DEF system table.

select * from pg_table_def where tablename = 'testtable';

The query result should look something like this:

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Insert Data Rows into a Table

----------+---------+-------+-------+--------+-------+--------+---------

(1 row)

By default, new database objects, such as tables, are created in a schema named "public". For more

information about schemas, see Schemas (p. 115) in the Managing Database Security section.

The encoding, distkey, and sortkey columns are used by Amazon Redshift for parallel processing.

For more information about designing tables that incorporate these elements, see Amazon Redshift Best

Practices for Designing Tables (p. 26).

Insert Data Rows into a Table

After you create a table, you can insert rows of data into that table.

Note

The INSERT (p. 520) command inserts individual rows into a database table. For standard bulk

loads, use the COPY (p. 390) command. For more information, see Use a COPY Command to

Load Data (p. 30).

For example, to insert a value of 100 into the testtable table (which contains a single column), issue

the following command:

insert into testtable values (100);

Select Data from a Table

After you create a table and populate it with data, use a SELECT statement to display the data contained

in the table. The SELECT * statement returns all the column names and row values for all of the data in a

table and is a good way to verify that recently added data was correctly inserted into the table.

To view the data that you entered in the testtable table, issue the following command:

select * from testtable;

The result will look like this:

testcol

---------

100

(1 row)

For more information about using the SELECT statement to query tables, see SELECT (p. 532) in the

SQL Command Reference.

Step 4: Load Sample Data

Most of the examples in this guide use the TICKIT sample database. If you want to follow the examples

using your SQL query tool, you will need to load the sample data for the TICKIT database.

The sample data for this tutorial is provided in Amazon S3 buckets that give read access to all

authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will work.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 5: Query the System Tables

To load the sample data for the TICKIT database, you will ﬁrst create the tables, then use the COPY

command to load the tables with sample data that is stored in an Amazon S3 bucket. For steps to create

tables and load sample data, see Amazon Redshift Getting Started Guide.

Step 5: Query the System Tables

In addition to the tables that you create, your database contains a number of system tables. These

system tables contain information about your installation and about the various queries and processes

that are running on the system. You can query these system tables to collect information about your

database.

Note

The description for each table in the System Tables Reference indicates whether a table is visible

to all users or visible only to superusers. You must be logged in as a superuser to query tables

that are visible only to superusers.

Amazon Redshift provides access to the following types of system tables:

•STL Tables for Logging (p. 798)

These system tables are generated from Amazon Redshift log ﬁles to provide a history of the system.

Logging tables have an STL preﬁx.

•STV Tables for Snapshot Data (p. 868)

These tables are virtual system tables that contain snapshots of the current system data. Snapshot

tables have an STV preﬁx.

•System Views (p. 896)

System views contain a subset of data found in several of the STL and STV system tables. Systems

views have an SVV or SVL preﬁx.

•System Catalog Tables (p. 935)

The system catalog tables store schema metadata, such as information about tables and columns.

System catalog tables have a PG preﬁx.

You may need to specify the process ID associated with a query to retrieve system table information

about that query. For information, see Determine the Process ID of a Running Query (p. 18).

View a List of Table Names

For example, to view a list of all tables in the public schema, you can query the PG_TABLE_DEF system

catalog table.

select distinct(tablename) from pg_table_def where schemaname = 'public';

The result will look something like this:

tablename

---------

category

date

event

listing

sales

testtable

API Version 2012-12-01

Amazon Redshift Database Developer Guide

View Database Users

users

venue

View Database Users

You can query the PG_USER catalog to view a list of all database users, along with the user ID

(USESYSID) and user privileges.

select * from pg_user;

useconfig

------------+----------+-------------+----------+-----------+----------+----------

+-----------

rdsdb | 1 | t | t | t | ******** | |

masteruser | 100 | t | t | f | ******** | |

dwuser | 101 | f | f | f | ******** | |

simpleuser | 102 | f | f | f | ******** | |

poweruser | 103 | f | t | f | ******** | |

dbuser | 104 | t | f | f | ******** | |

(6 rows)

The user name rdsdb is used internally by Amazon Redshift to perform routine administrative and

maintenance tasks. You can ﬁlter your query to show only user-deﬁned user names by adding where

usesysid > 1 to your select statement.

select * from pg_user

where usesysid > 1;

useconfig

------------+----------+-------------+----------+-----------+----------+----------

+-----------

masteruser | 100 | t | t | f | ******** | |

dwuser | 101 | f | f | f | ******** | |

simpleuser | 102 | f | f | f | ******** | |

poweruser | 103 | f | t | f | ******** | |

dbuser | 104 | t | f | f | ******** | |

(5 rows)

View Recent Queries

In the previous example, you found that the user ID (USESYSID) for masteruser is 100. To list the ﬁve

most recent queries executed by masteruser, you can query the SVL_QLOG view. The SVL_QLOG view

is a friendlier subset of information from the STL_QUERY table. You can use this view to ﬁnd the query

ID (QUERY) or process ID (PID) for a recently run query or to see how long it took a query to complete.

SVL_QLOG includes the ﬁrst 60 characters of the query string (SUBSTRING) to help you locate a speciﬁc

query. Use the LIMIT clause with your SELECT statement to limit the results to ﬁve rows.

select query, pid, elapsed, substring from svl_qlog

where userid = 100

order by starttime desc

limit 5;

The result will look something like this:

query | pid | elapsed | substring

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Determine the Process ID of a Running Query

--------+-------+----------+--------------------------------------------------------------

187752 | 18921 | 18465685 | select query, elapsed, substring from svl_qlog order by query

204168 | 5117 | 59603 | insert into testtable values (100);

187561 | 17046 | 1003052 | select * from pg_table_def where tablename = 'testtable';

187549 | 17046 | 1108584 | select * from STV_WLM_SERVICE_CLASS_CONFIG

187468 | 17046 | 5670661 | select * from pg_table_def where schemaname = 'public';

(5 rows)

Determine the Process ID of a Running Query

In the previous example you learned how to obtain the query ID and process ID (PID) for a completed

query from the SVL_QLOG view.

You might need to ﬁnd the PID for a query that is still running. For example, you will need the PID if you

need to cancel a query that is taking too long to run. You can query the STV_RECENTS system table to

obtain a list of process IDs for running queries, along with the corresponding query string. If your query

returns multiple PIDs, you can look at the query text to determine which PID you need.

To determine the PID of a running query, issue the following SELECT statement:

select pid, user_name, starttime, query

from stv_recents

where status='Running';

Step 6: Cancel a Query

If a user issues a query that is taking too long or is consuming excessive cluster resources, you might

need to cancel the query. For example, a user might want to create a list of ticket sellers that includes the

seller's name and quantity of tickets sold. The following query selects data from the SALES table USERS

table and joins the two tables by matching SELLERID and USERID in the WHERE clause.

select sellerid, firstname, lastname, sum(qtysold)

from sales, users

where sales.sellerid = users.userid

group by sellerid, firstname, lastname

order by 4 desc;

Note

This is a complex query. For this tutorial, you don't need to worry about how this query is

constructed.

The previous query runs in seconds and returns 2,102 rows.

Suppose the user forgets to put in the WHERE clause.

select sellerid, firstname, lastname, sum(qtysold)

from sales, users

group by sellerid, firstname, lastname

order by 4 desc;

The result set will include all of the rows in the SALES table multiplied by all the rows in the USERS table

(49989*3766). This is called a Cartesian join, and it is not recommended. The result is over 188 million

rows and takes a long time to run.

To cancel a running query, use the CANCEL command with the query's PID.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Cancel a Query from Another Session

To ﬁnd the process ID, query the STV_RECENTS table, as shown in the previous step. The following

example shows how you can make the results more readable by using the TRIM function to trim trailing

spaces and by showing only the ﬁrst 20 characters of the query string.

select pid, trim(user_name), starttime, substring(query,1,20)

from stv_recents

where status='Running';

The result looks something like this:

pid | btrim | starttime | substring

-------+------------+----------------------------+----------------------

18764 | masteruser | 2013-03-28 18:39:49.355918 | select sellerid, fir

(1 row)

To cancel the query with PID 18764, issue the following command:

cancel 18764;

Note

The CANCEL command will not abort a transaction. To abort or roll back a transaction, you must

use the ABORT or ROLLBACK command. To cancel a query associated with a transaction, ﬁrst

cancel the query then abort the transaction.

If the query that you canceled is associated with a transaction, use the ABORT or ROLLBACK. command

to cancel the transaction and discard any changes made to the data:

abort;

Unless you are signed on as a superuser, you can cancel only your own queries. A superuser can cancel all

queries.

Cancel a Query from Another Session

If your query tool does not support running queries concurrently, you will need to start another session

to cancel the query. For example, SQLWorkbench, which is the query tool we use in the Amazon

Redshift Getting Started, does not support multiple concurrent queries. To start another session using

SQLWorkbench, select File, New Window and connect using the same connection parameters. Then you

can ﬁnd the PID and cancel the query.

Cancel a Query Using the Superuser Queue

If your current session has too many queries running concurrently, you might not be able to run the

CANCEL command until another query ﬁnishes. In that case, you will need to issue the CANCEL command

using a diﬀerent workload management query queue.

Workload management enables you to execute queries in diﬀerent query queues so that you don't

need to wait for another query to complete. The workload manager creates a separate queue, called

the Superuser queue, that you can use for troubleshooting. To use the Superuser queue, you must be

logged on a superuser and set the query group to 'superuser' using the SET command. After running

your commands, reset the query group using the RESET command.

To cancel a query using the Superuser queue, issue these commands:

set query_group to 'superuser';

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 7: Clean Up Your Resources

cancel 18764;

reset query_group;

For information about managing query queues, see Implementing Workload Management (p. 285).

Step 7: Clean Up Your Resources

If you deployed a cluster in order to complete this exercise, when you are ﬁnished with the exercise, you

should delete the cluster so that it will stop accruing charges to your AWS account.

To delete the cluster, follow the steps in Deleting a Cluster in the Amazon Redshift Cluster Management

Guide.

If you want to keep the cluster, you might want to keep the sample data for reference. Most of the

examples in this guide use the tables you created in this exercise. The size of the data will not have any

signiﬁcant eﬀect on your available storage.

If you want to keep the cluster, but want to clean up the sample data, you can run the following

command to drop the TICKIT database:

drop database tickit;

If you didn't create a TICKIT database, or if you don't want to drop the database, run the following

commands to drop just the tables:

drop table testtable;

drop table users;

drop table venue;

drop table category;

drop table date;

drop table event;

drop table listing;

drop table sales;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Identifying the Goals of the Proof of Concept

Building a Proof of Concept for

Amazon Redshift

Amazon Redshift is a fast, scalable data warehouse that makes it simple and cost-eﬀective to analyze

all your data using standard SQL and your existing business intelligence tools. Amazon Redshift delivers

10 times faster performance than other data warehouses. It does so by using sophisticated query

optimization, columnar storage on high-performance local disks, machine learning, and massively

parallel query execution.

In the following sections, you can ﬁnd a framework for building a proof of concept with Amazon

Redshift. The framework gives you architectural best practices for designing and operating a secure,

high-performing, and cost-eﬀective Amazon Redshift data warehouse. This guidance is based on

reviewing designs of thousands of customers’ architectures across a wide variety of business types and

use cases. We compiled customer experiences to develop this set of best practices to help you identify

criteria for evaluating your data warehouse workload.

If you are a ﬁrst-time user of Amazon Redshift, we recommend that you read Getting Started with

Amazon Redshift. This guide provides a tutorial for using Amazon Redshift to create a sample cluster and

work with sample data. To get insights into the beneﬁts of using Amazon Redshift and into pricing, see

Service Highlights and Pricing Information on the marketing webpage.

Identifying the Goals of the Proof of Concept

Identifying the goals of the proof of concept plays a critical role in determining what you want to

measure as part of the evaluation process. The evaluation criteria should include the current challenges,

enhancements you want to make to improve customer experience, and methods of addressing your

current operational pain points. You can use the following questions to identify the goals of the proof of

concept:

• Do you have speciﬁc service level agreements whose terms you want to improve?

• What are your goals for scaling your Amazon Redshift data warehouse?

• What new datasets do you or your customers need to include in your data warehouse?

• What are the business-critical SQL queries you need to benchmark? Make sure to include the full range

of SQL complexities, such as the diﬀerent types of queries (for example, ingest, update, and delete).

• What are the general types of workloads you plan to test? Examples might be extract transform load

(ETL) workloads, reporting queries, and batch extracts.

After you have answered these questions, you should be able to establish a SMART goal for building your

proof of concept.

Setting Up Your Proof of Concept

You set up your Amazon Redshift proof of concept environment in two steps. First, you set up the AWS

resources. Second, you convert the schema and datasets for evaluation.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Designing and Setting Up Your Cluster

You can set up your cluster with either of the following two node types:

•Dense Storage, which enables you to create very large data warehouses using hard disk drives (HDDs)

for a very low price.

•Dense Compute, which enables you to create high-performance data warehouses using fast CPUs,

large amounts of RAM, and solid-state disks (SSDs).

The goals of your workload and your overall budget should help you determine which type of node to

select. Resizing your cluster or switching to a diﬀerent type of node is simply a button click in the AWS

Management Console. The following additional considerations can help guide you in setting up your

cluster:

• Select a cluster size that is large enough to handle your production workload. Generally, you need at

least two compute nodes (a multinode cluster). The leader node is included at no additional cost.

• Create your cluster in a virtual private cloud (VPC), which provides better performance than an EC2-

Classic installation.

• Plan to maintain at least 20 percent free space, or three times as much memory as needed by your

largest table. This extra space is needed to provide these:

• Scratch space for usage and rewriting tables

• Free space required for vacuum operations and for re-sorting tables

• Temporary tables used for storing intermediate query results

Converting Your Schema and Setting Up the Datasets

You can convert your schema, code, and data with either the AWS Schema Conversion Tool (AWS SCT) or

the AWS Database Migration Service (AWS DMS). Your best choice of tool depends on the source of your

data.

The following can help you set up your data in Amazon Redshift:

•Migrate from Oracle to Amazon Redshift – This project uses an AWS CloudFormation template, AWS

DMS, and AWS SCT to migrate your data with only a few clicks.

•Migrate Your Data Warehouse to Amazon Redshift Using the AWS SCT – This blog provides an

overview of how you can use the AWS SCT data extractors to migrate your data warehouse to Amazon

Redshift.

Cluster Design Considerations

Keep the following ﬁve attributes in mind when designing your cluster. The SET DW acronym is an easy

way to remember them:

•S – The S is for sort key. Query ﬁlters access sort key columns frequently. Follow these best practices to

select sort keys:

• Choose up to three columns to be the sort key columns

• Order the sort keys in increasing degree of speciﬁcity, but balance this with the frequency of use

For more guidance on selecting sort keys, see Choose the Best Sort Key and the AWS Big Data Blog

post The Advanced Table Design Playbook.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Amazon Redshift Evaluation Checklist

•E – The E is for encoding. Encoding sets the compression algorithm used for each column in each table.

You can either set encoding yourself, or have Amazon Redshift set this for you. For more information

on how to let Amazon Redshift choose the best compression algorithm, see Loading tables with

Automatic compression.

•T – The T is for table maintenance. The Amazon Redshift query optimizer creates more eﬃcient

execution plans when query statistics are up-to-date. Use the ANALYZE command to gather statistics

after loading, updating or deleting data from tables. Similarly, you can minimize the number of blocks

scanned with the VACUUM command. VACUUM improves performance by doing the following:

• Removing the rows that have been logically deleted from the block, resulting in fewer blocks to scan

• Keeping the data in sort key order, which helps target the speciﬁc blocks for scanning.

•D – The D is for table distribution. You have three options for table distribution:

•KEY – You designate a column for distribution.

•EVEN – Amazon Redshift assigns the compute nodes with a round-robin pattern.

•ALL – Amazon Redshift puts a complete copy of the table in the database slice of each compute

node.

• The following guidelines can help you select the best distribution pattern:

• If users frequently join a Customers table using the customer id value and doing so distributes

the rows evenly across the database slices, then customer id is a good choice for a distribution

key.

• If a table is approximately 5 million rows and contains dimension data, then choose the ALL

distribution style.

•EVEN is a safe choice for a distribution pattern, but always results in data distribution across all

compute nodes.

•W – The W is for Amazon Redshift Workload Management (WLM). If you use WLM, you control the ﬂow

of SQL statements through the compute clusters and how much system memory to allocate. For more

information on setting up WLM, see Implementing Workload Management (p. 285).

Amazon Redshift Evaluation Checklist

For best evaluation results, check the following list of items to determine if they apply to your Amazon

Redshift evaluation:

•Data load time – Using the COPY command is a common way to test how long it takes to load data.

For more information, see Best Practices for Loading Data.

•Throughput of the cluster – Measuring queries per hour is a common way to determine throughput.

To do so, set up a test to run typical queries for your workload.

•Data security – You can easily encrypt data at rest and in transit with Amazon Redshift. You also have

a number of options for managing keys, and Amazon Redshift also supports Single sign-on (SSO)

integration.

•Third-party tools integration – You can use either a JDBC or ODBC connection to integrate with

business intelligence and other external tools.

•Interoperability with other AWS services – Amazon Redshift integrates with other AWS services, such

as Amazon EMR, Amazon QuickSight, AWS Glue, Amazon S3 and Kinesis. You can use this integration

in setting up and managing your data warehouse.

•Backing up and making snapshots – Amazon Redshift automatically backs up your cluster at every 5

GB of changed data, or 8 hours (whichever occurs ﬁrst). You can also create a snapshot at any time.

•Using snapshots – Try using a snapshot and creating a second cluster as part of your evaluation.

Evaluate if your development and testing organizations can use the cluster.

•Resizing – Your evaluation should include increasing the number or types of Amazon Redshift nodes.

Your cluster remains fully accessible during the resize, although it is in a read-only mode. Evaluate if

your users can detect that the resize is under way.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Benchmarking Your Amazon Redshift Evaluation

•Support – We strongly recommend that you evaluate AWS Support as part of your evaluation.

•Oﬄoading queries and accessing infrequently used data – You can oﬄoad your queries to a separate

compute layer with Amazon Redshift Spectrum. You can also easily access infrequently used data

directly from S3 without ingesting it into your Amazon Redshift cluster.

•Operating costs – Compare the overall cost of operating your data warehouse with other options.

Amazon Redshift is fully managed, and you can perform unlimited analysis of a terabyte of your data

for approximately $1000 per year.

Benchmarking Your Amazon Redshift Evaluation

The following list of possible benchmarks might apply to your Amazon Redshift evaluation:

• Assemble a list of queries for each runtime category. Having a suﬃcient number (for example, 30 per

category) helps assure that your evaluation reﬂects a real-world data warehouse implementation.

• Add a unique identiﬁer to associate each query that you include in your evaluation with one of the

categories you establish for your evaluation. You can then use these unique identiﬁers to determine

throughput for the system tables. You can also create a query_group to organize your evaluation

queries.

For example, if you have established a "Reporting" category for your evaluation, you might create a

coding system to tag your evaluation queries with the word "Report." You can then identify individual

queries within reporting as R1, R2, and so on. The following example demonstrates this approach.

[SELECT "Reporting" as query_category, "R1" as query_id,

* FROM customers]

When you have associated a query with an evaluation category, you can then use a unique identiﬁer to

determine throughput from the system tables for each category. The following example demonstrates

how to do this.

select query, datediff(seconds, starttime, endtime)

from stl_query

where

querytxt like “%Reporting%”

and starttime >= '2018-04-15 00:00'

and endtime <'2018-04-15 23:59'

• Test throughput with historical user or ETL queries that have a variety of run times in your existing

data warehouse. Keep the following items in mind when testing throughput:

• If you are using a load testing utility (for example an open-source utility like JMeter, or a custom

utility), make sure that the tool can take the network transmission time into account.

• Make sure that the load testing utility is evaluating execution time based on throughput of the

internal system tables in Amazon Redshift.

• Identify all the various permutations that you plan to test during your evaluation. The following list

provides some common variables:

• Cluster size

• Instance type

• Load testing duration

• Concurrency settings

• WLM conﬁguration

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Additional Resources

Need more help? See, Request Support for your Amazon Redshift Proof-of-Concept.

Additional Resources

To help your Amazon Redshift evaluation, see the following:

•Top 10 Performance Tuning Techniques for Amazon Redshift on the Big Data Blog

•Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift on the Big Data

Blog

•Amazon Redshift Management Overview in the Amazon Redshift Cluster Management Guide

•Amazon Redshift Spectrum Getting Started

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Best Practices for Designing Tables

Amazon Redshift Best Practices

Following, you can ﬁnd best practices for designing tables, loading data into tables, and writing queries

for Amazon Redshift, and also a discussion of working with Amazon Redshift Advisor.

Amazon Redshift is not the same as other SQL database systems. To fully realize the beneﬁts of the

Amazon Redshift architecture, you must speciﬁcally design, build, and load your tables to use massively

parallel processing, columnar data storage, and columnar data compression. If your data loading and

query execution times are longer than you expect, or longer than you want, you might be overlooking

key information.

If you are an experienced SQL database developer, we strongly recommend that you review this topic

before you begin developing your Amazon Redshift data warehouse.

If you are new to developing SQL databases, this topic is not the best place to start. We recommend that

you begin by reading Getting Started Using Databases (p. 13) and trying the examples yourself.

In this topic, you can ﬁnd an overview of the most important development principles, along with

speciﬁc tips, examples, and best practices for implementing those principles. No single practice

can apply to every application. You should evaluate all of your options before ﬁnalizing a database

design. For more information, see Designing Tables (p. 118), Loading Data (p. 184), Tuning Query

Performance (p. 257), and the reference chapters.

Topics

•Amazon Redshift Best Practices for Designing Tables (p. 26)

•Amazon Redshift Best Practices for Loading Data (p. 29)

•Amazon Redshift Best Practices for Designing Queries (p. 32)

•Working with Recommendations from Amazon Redshift Advisor (p. 34)

Amazon Redshift Best Practices for Designing

Tables

As you plan your database, certain key table design decisions heavily inﬂuence overall query

performance. These design choices also have a signiﬁcant eﬀect on storage requirements, which in

turn aﬀects query performance by reducing the number of I/O operations and minimizing the memory

required to process queries.

In this section, you can ﬁnd a summary of the most important design decisions and presents best

practices for optimizing query performance. Designing Tables (p. 118) provides more detailed

explanations and examples of table design options.

Topics

•Take the Tuning Table Design Tutorial (p. 27)

•Choose the Best Sort Key (p. 27)

•Choose the Best Distribution Style (p. 27)

•Let COPY Choose Compression Encodings (p. 28)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Take the Tuning Table Design Tutorial

•Deﬁne Primary Key and Foreign Key Constraints (p. 28)

•Use the Smallest Possible Column Size (p. 28)

•Use Date/Time Data Types for Date Columns (p. 29)

Take the Tuning Table Design Tutorial

Tutorial: Tuning Table Design (p. 45) walks you step by step through the process of choosing sort keys,

distribution styles, and compression encodings, and shows you how to compare system performance

before and after tuning.

Choose the Best Sort Key

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift

query optimizer uses sort order when it determines optimal query plans.

•If recent data is queried most frequently, specify the timestamp column as the leading column for

the sort key.

Queries are more eﬃcient because they can skip entire blocks that fall outside the time range.

•If you do frequent range ﬁltering or equality ﬁltering on one column, specify that column as the

sort key.

Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks

the minimum and maximum column values stored on each block and can skip blocks that don't apply

to the predicate range.

•If you frequently join a table, specify the join column as both the sort key and the distribution key.

Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.

Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of

the sort merge join.

For more information about choosing and specifying sort keys, see Tutorial: Tuning Table

Design (p. 45) and Choosing Sort Keys (p. 140).

Choose the Best Distribution Style

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed

to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the

impact of the redistribution step by locating the data where it needs to be before the query is executed.

1. Distribute the fact table and one dimension table on their common columns.

Your fact table can have only one distribution key. Any tables that join on another key aren't

collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined

and the size of the joining rows. Designate both the dimension table's primary key and the fact table's

corresponding foreign key as the DISTKEY.

2. Choose the largest dimension based on the size of the ﬁltered dataset.

Only the rows that are used in the join need to be distributed, so consider the size of the dataset after

ﬁltering, not the size of the table.

3. Choose a column with high cardinality in the ﬁltered result set.

If you distribute a sales table on a date column, for example, you should probably get fairly even data

distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use Automatic Compression

predicate to ﬁlter for a narrow date period, most of the ﬁltered rows occur on a limited set of slices

and the query workload is skewed.

4. Change some dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables, you

can improve query performance signiﬁcantly by distributing the entire table to all of the nodes. Using

ALL distribution multiplies storage space requirements and increases load times and maintenance

operations, so you should weigh all factors before choosing ALL distribution.

To let Amazon Redshift choose the appropriate distribution style, don't specify DISTSTYLE.

For more information about choosing distribution styles, see Tutorial: Tuning Table Design (p. 45) and

Choosing a Data Distribution Style (p. 129).

Let COPY Choose Compression Encodings

You can specify compression encodings when you create a table, but in most cases, automatic

compression produces the best results.

The COPY command analyzes your data and applies compression encodings to an empty table

automatically as part of the load operation.

Automatic compression balances overall performance when choosing compression encodings. Range-

restricted scans might perform poorly if sort key columns are compressed much more highly than other

columns in the same query. As a result, automatic compression chooses a less eﬃcient compression

encoding to keep the sort key columns balanced with other columns.

Suppose that your table's sort key is a date or timestamp and the table uses many large varchar columns.

In this case, you might get better performance by not compressing the sort key column at all. Run

theANALYZE COMPRESSION (p. 382) command on the table, then use the encodings to create a new

table, but leave out the compression encoding for the sort key.

There is a performance cost for automatic compression encoding, but only if the table is empty

and does not already have compression encoding. For short-lived tables and tables that you create

frequently, such as staging tables, load the table once with automatic compression or run theANALYZE

COMPRESSIONcommand. Then use those encodings to create new tables. You can add the encodings to

the CREATE TABLE statement, or use CREATE TABLE LIKE to create a new table with the same encoding.

For more information, see Tutorial: Tuning Table Design (p. 45) and Loading Tables with Automatic

Compression (p. 209).

Deﬁne Primary Key and Foreign Key Constraints

Deﬁne primary key and foreign key constraints between tables wherever appropriate. Even though they

are informational only, the query optimizer uses those constraints to generate more eﬃcient query

plans.

Do not deﬁne primary key and foreign key constraints unless your application enforces the constraints.

Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints.

See Deﬁning Constraints (p. 145) for additional information about how Amazon Redshift uses

constraints.

Use the Smallest Possible Column Size

Don’t make it a practice to use the maximum column size for convenience.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Using Date/Time Data Types for Date Columns

Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size

your columns accordingly. Because Amazon Redshift compresses column data very eﬀectively, creating

columns much larger than necessary has minimal impact on the size of data tables. During processing

for complex queries, however, intermediate query results might need to be stored in temporary tables.

Because temporary tables are not compressed, unnecessarily large columns consume excessive memory

and temporary disk space, which can aﬀect query performance.

Use Date/Time Data Types for Date Columns

Amazon Redshift stores DATE and TIMESTAMP data more eﬃciently than CHAR or VARCHAR, which

results in better query performance. Use the DATE or TIMESTAMP data type, depending on the resolution

you need, rather than a character type when storing date/time information. For more information, see

Datetime Types (p. 326).

Amazon Redshift Best Practices for Loading Data

Topics

•Take the Loading Data Tutorial (p. 29)

•Take the Tuning Table Design Tutorial (p. 29)

•Use a COPY Command to Load Data (p. 30)

•Use a Single COPY Command to Load from Multiple Files (p. 30)

•Split Your Load Data into Multiple Files (p. 30)

•Compress Your Data Files (p. 30)

•Use a Manifest File (p. 30)

•Verify Data Files Before and After a Load (p. 31)

•Use a Multi-Row Insert (p. 31)

•Use a Bulk Insert (p. 31)

•Load Data in Sort Key Order (p. 31)

•Load Data in Sequential Blocks (p. 32)

•Use Time-Series Tables (p. 32)

•Use a Staging Table to Perform a Merge (Upsert) (p. 32)

•Schedule Around Maintenance Windows (p. 32)

Loading very large datasets can take a long time and consume a lot of computing resources. How your

data is loaded can also aﬀect query performance. This section presents best practices for loading data

eﬃciently using COPY commands, bulk inserts, and staging tables.

Take the Loading Data Tutorial

Tutorial: Loading Data from Amazon S3 (p. 70) walks you beginning to end through the steps to

upload data to an Amazon S3 bucket and then use the COPY command to load the data into your tables.

The tutorial includes help with troubleshooting load errors and compares the performance diﬀerence

between loading from a single ﬁle and loading from multiple ﬁles.

Take the Tuning Table Design Tutorial

Data loads are heavily inﬂuenced by table design, especially compression encodings and distribution

styles. Tutorial: Tuning Table Design (p. 45) walks you step-by-step through the process of choosing

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Use a COPY Command to Load Data

sort keys, distribution styles, and compression encodings, and shows you how to compare system

performance before and after tuning.

Use a COPY Command to Load Data

The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or

multiple data sources on remote hosts. COPY loads large amounts of data much more eﬃciently than

using INSERT statements, and stores the data more eﬀectively as well.

For more information about using the COPY command, see Loading Data from Amazon S3 (p. 187) and

Loading Data from an Amazon DynamoDB Table (p. 206).

Use a Single COPY Command to Load from Multiple

Files

Amazon Redshift automatically loads in parallel from multiple data ﬁles.

If you use multiple concurrent COPY commands to load one table from multiple ﬁles, Amazon Redshift

is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process

at the end if the table has a sort column deﬁned. For more information about using COPY to load data in

parallel, see Loading Data from Amazon S3 (p. 187).

Split Your Load Data into Multiple Files

The COPY command loads the data in parallel from multiple ﬁles, dividing the workload among the

nodes in your cluster. When you load all the data from a single large ﬁle, Amazon Redshift is forced to

perform a serialized load, which is much slower. Split your load data ﬁles so that the ﬁles are about equal

size, between 1 MB and 1 GB after compression. For optimum parallelism, the ideal size is between 1 MB

and 125 MB after compression. The number of ﬁles should be a multiple of the number of slices in your

cluster. For more information about how to split your data into ﬁles and examples of using COPY to load

data, see Loading Data from Amazon S3 (p. 187).

Compress Your Data Files

We strongly recommend that you individually compress your load ﬁles using gzip, lzop, or bzip2 when

you have large datasets.

Specify the GZIP, LZOP, or BZIP2 option with the COPY command. This example loads the TIME table

from a pipe-delimited lzop ﬁle.

copy time

from 's3://mybucket/data/timerows.lzo'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

lzop

delimiter '|';

Use a Manifest File

Amazon S3 provides eventual consistency for some operations. Thus, it's possible that new data won't

be available immediately after the upload, which can result in an incomplete data load or loading stale

data. You can manage data consistency by using a manifest ﬁle to load data. For more information, see

Managing Data Consistency (p. 189).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Verify Data Files Before and After a Load

When you load data from Amazon S3, ﬁrst upload your ﬁles to your Amazon S3 bucket, then verify that

the bucket contains all the correct ﬁles, and only those ﬁles. For more information, see Verifying That the

Correct Files Are Present in Your Bucket (p. 191).

After the load operation is complete, query the STL_LOAD_COMMITS (p. 823) system table to verify

that the expected ﬁles were loaded. For more information, see Verifying That the Data Was Loaded

Correctly (p. 208).

Use a Multi-Row Insert

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever

possible. Data compression is ineﬃcient when you add data only one row or a few rows at a time.

Multi-row inserts improve performance by batching up a series of inserts. The following example inserts

three rows into a four-column table using a single INSERT statement. This is still a small insert, shown

simply to illustrate the syntax of a multi-row insert.

insert into category_stage values

(default, default, default, default),

(20, default, 'Country', default),

(21, 'Concerts', 'Rock', default);

See INSERT (p. 520) for more details and examples.

Use a Bulk Insert

Use a bulk insert operation with a SELECT clause for high-performance data insertion.

Use the INSERT (p. 520) and CREATE TABLE AS (p. 483) commands when you need to move data or a

subset of data from one table into another.

For example, the following INSERT statement selects all of the rows from the CATEGORY table and

inserts them into the CATEGORY_STAGE table.

insert into category_stage

(select * from category);

The following example creates CATEGORY_STAGE as a copy of CATEGORY and inserts all of the rows in

CATEGORY into CATEGORY_STAGE.

create table category_stage as

select * from category;

Load Data in Sort Key Order

Load your data in sort key order to avoid needing to vacuum.

If each batch of new data follows the existing rows in your table, your data is properly stored in sort

order, and you don't need to run a vacuum. You don't need to presort the rows in each load because

COPY sorts each batch of incoming data as it loads.

For example, suppose that you load data every day based on the current day's activity. If your sort key is

a timestamp column, your data is stored in sort order. This order occurs because the current day's data is

always appended at the end of the previous day's data. For more information, see Loading Your Data in

Sort Key Order (p. 235).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Load Data in Sequential Blocks

If you need to add a large quantity of data, load the data in sequential blocks according to sort order to

eliminate the need to vacuum.

For example, suppose that you need to load a table with events from January 2017 to December

2017. Load the rows for January, then February, and so on. Your table is completely sorted when

your load completes, and you don't need to run a vacuum. For more information, see Use Time-Series

Tables (p. 32).

When loading very large datasets, the space required to sort might exceed the total available space. By

loading data in smaller blocks, you use much less intermediate sort space during each load. In addition,

loading smaller blocks make it easier to restart if the COPY fails and is rolled back.

Use Time-Series Tables

If your data has a ﬁxed retention period, we strongly recommend that you organize your data as a

sequence of time-series tables. In this sequence, each table should be identical but contain data for

diﬀerent time ranges.

You can easily remove old data simply by executing a DROP TABLE on the corresponding tables.

This approach is much faster than running a large-scale DELETE and saves you from having to run a

subsequent VACUUM process to reclaim space. You can create a UNION ALL view to hide the fact that

the data is stored in diﬀerent tables. When you delete old data, simply reﬁne your UNION ALL view to

remove the dropped tables. Similarly, as you load new time periods into new tables, add the new tables

to the view.

If you use time-series tables with a timestamp column for the sort key, you eﬀectively load your data in

sort key order. Doing this eliminates the need to vacuum to resort the data. For more information, see

Load Data in Sort Key Order (p. 31).

Use a Staging Table to Perform a Merge (Upsert)

You can eﬃciently update and insert new data by loading your data into a staging table ﬁrst.

Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert)

to insert and update data from a single data source. However, you can eﬀectively perform a merge

operation. To do so, load your data into a staging table and then join the staging table with your target

table for an UPDATE statement and an INSERT statement. For instructions, see Updating and Inserting

New Data (p. 216).

Schedule Around Maintenance Windows

If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back

and you need to restart it. Schedule long-running operations, such as large data loads or VACUUM

operation, to avoid maintenance windows. You can also minimize the risk, and make restarts easier

when they are needed, by performing data loads in smaller increments and managing the size of your

VACUUM operations. For more information, see Load Data in Sequential Blocks (p. 32) and Vacuuming

Tables (p. 228).

Amazon Redshift Best Practices for Designing

Queries

To maximize query performance, follow these recommendations when creating queries.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Best Practices for Designing Queries

• Design tables according to best practices to provide a solid foundation for query performance. For

more information, see Amazon Redshift Best Practices for Designing Tables (p. 26).

• Avoid using select *. Include only the columns you speciﬁcally need.

• Use a CASE Expression (p. 655) to perform complex aggregations instead of selecting from the same

table multiple times.

• Don’t use cross-joins unless absolutely necessary. These joins without a join condition result in the

Cartesian product of two tables. Cross-joins are typically executed as nested-loop joins, which are the

slowest of the possible join types.

• Use subqueries in cases where one table in the query is used only for predicate conditions and

the subquery returns a small number of rows (less than about 200). The following example uses a

subquery to avoid joining the LISTING table.

select sum(sales.qtysold)

from sales

where salesid in (select listid from listing where listtime > '2008-12-26');

• Use predicates to restrict the dataset as much as possible.

• In the predicate, use the least expensive operators that you can. Comparison Condition (p. 341)

operators are preferable to LIKE (p. 346) operators. LIKE operators are still preferable to SIMILAR

TO (p. 348) or POSIX Operators (p. 351).

• Avoid using functions in query predicates. Using them can drive up the cost of the query by requiring

large numbers of rows to resolve the intermediate steps of the query.

• If possible, use a WHERE clause to restrict the dataset. The query planner can then use row order to

help determine which records match the criteria, so it can skip scanning large numbers of disk blocks.

Without this, the query execution engine must scan participating columns entirely.

• Add predicates to ﬁlter tables that participate in joins, even if the predicates apply the same ﬁlters.

The query returns the same result set, but Amazon Redshift is able to ﬁlter the join tables before the

scan step and can then eﬃciently skip scanning blocks from those tables. Redundant ﬁlters aren't

needed if you ﬁlter on a column that's used in the join condition.

For example, suppose that you want to join SALES and LISTING to ﬁnd ticket sales for tickets listed

after December, grouped by seller. Both tables are sorted by date. The following query joins the tables

on their common key and ﬁlters for listing.listtime values greater than December 1.

select listing.sellerid, sum(sales.qtysold)

from sales, listing

where sales.salesid = listing.listid

and listing.listtime > '2008-12-01'

group by 1 order by 1;

The WHERE clause doesn't include a predicate for sales.saletime, so the execution engine is forced

to scan the entire SALES table. If you know the ﬁlter would result in fewer rows participating in the

join, then add that ﬁlter as well. The following example cuts execution time signiﬁcantly.

select listing.sellerid, sum(sales.qtysold)

from sales, listing

where sales.salesid = listing.listid

and listing.listtime > '2008-12-01'

and sales.saletime > '2008-12-01'

group by 1 order by 1;

• Use sort keys in the GROUP BY clause so the query planner can use more eﬃcient aggregation. A

query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns,

one of which is also the distribution key. The sort key columns in the GROUP BY list must include the

ﬁrst sort key, then other sort keys that you want to use in sort key order. For example, it is valid to use

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Working with Advisor

the ﬁrst sort key, the ﬁrst and second sort keys, the ﬁrst, second, and third sort keys, and so on. It is

not valid to use the ﬁrst and third sort keys.

You can conﬁrm the use of one-phase aggregation by running the EXPLAIN (p. 511) command and

looking for XN GroupAggregate in the aggregation step of the query.

• If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in the same

order in both. That is, use the approach just following.

group by a, b, c

order by a, b, c

Don't use the following approach.

group by b, c, a

order by a, b, c

Working with Recommendations from Amazon

Redshift Advisor

To help you improve the performance and decrease the operating costs for your Amazon Redshift

cluster, Amazon Redshift Advisor oﬀers you speciﬁc recommendations about changes to make. Advisor

develops its customized recommendations by analyzing performance and usage metrics for your cluster.

These tailored recommendations relate to operations and cluster settings. To help you prioritize your

optimizations, Advisor ranks recommendations by order of impact.

Advisor bases its recommendations on observations regarding performance statistics or operations data.

Advisor develops observations by running tests on your clusters to determine if a test value is within

a speciﬁed range. If the test result is outside of that range, Advisor generates an observation for your

cluster. At the same time, Advisor creates a recommendation about how to bring the observed value

back into the best-practice range. Advisor only displays recommendations that should have a signiﬁcant

impact on performance and operations. When Advisor determines that a recommendation has been

addressed, it removes it from your recommendation list.

For example, suppose that your data warehouse contains a large number of uncompressed table

columns. In this case, you can save on cluster storage costs by rebuilding tables using the ENCODE

parameter to specify column compression. In another example, suppose that Advisor observes that your

cluster contains a signiﬁcant amount of data in uncompressed table data. In this case, it provides you

with the SQL code block to ﬁnd the table columns that are candidates for compression and resources

that describe how to compress those columns.

Topics

•Viewing Amazon Redshift Advisor Recommendations in the Console (p. 34)

•Amazon Redshift Advisor Recommendations (p. 35)

Viewing Amazon Redshift Advisor Recommendations

in the Console

You can view Amazon Redshift Advisor analysis results and recommendations in the AWS Management

Console.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

To view Amazon Redshift Advisor recommendations in the console

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://

console.aws.amazon.com/redshift/.

2. In the navigation pane, choose Advisor.

3. Choose the cluster that you want to get recommendations for.

4. Expand each recommendation to see more details.

Amazon Redshift Advisor Recommendations

Amazon Redshift Advisor oﬀers recommendations about how to optimize your Amazon Redshift

cluster to increase performance and save on operating costs. You can ﬁnd explanations for each

recommendation in the console, as described preceding. You can ﬁnd further details on these

recommendations in the following sections.

Topics

•Compress Table Data (p. 36)

•Compress Amazon S3 File Objects Loaded by COPY (p. 37)

•Isolate Multiple Active Databases (p. 38)

•Reallocate Workload Management (WLM) Memory (p. 38)

•Skip Compression Analysis During COPY (p. 40)

•Split Amazon S3 Objects Loaded by COPY (p. 41)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

•Update Table Statistics (p. 42)

•Enable Short Query Acceleration (p. 43)

•Replace Single-Column Interleaved Sort Keys (p. 44)

Compress Table Data

Amazon Redshift is optimized to reduce your storage footprint and improve query performance by using

compression encodings. When you don't use compression, data consumes additional space and requires

additional disk I/O. Applying compression to large uncompressed columns can have a big impact on your

cluster.

Analysis

The compression analysis in Advisor tracks uncompressed storage allocated to permanent user tables.

It reviews storage metadata associated with large uncompressed columns that aren't sort key columns.

Advisor oﬀers a recommendation to rebuild tables with uncompressed columns when the total amount

of uncompressed storage exceeds 15 percent of total storage space, or at the following node-speciﬁc

thresholds.

Cluster Size Threshold

DC2.LARGE 480 GB

DC2.8XLARGE 2.56 TB

DS2.XLARGE 4 TB

DS2.8XLAGE 16 TB

Recommendation

Addressing uncompressed storage for a single table is a one-time optimization that requires the table to

be rebuilt. We recommend that you rebuild any tables that contain uncompressed columns that are both

large and frequently accessed. To identify which tables contain the most uncompressed storage, run the

following SQL command as a superuser.

SELECT

ti.schema||'.'||ti."table" tablename,

raw_size.size uncompressed_mb,

ti.size total_mb

FROM svv_table_info ti

LEFT JOIN (

SELECT tbl table_id, COUNT(*) size

FROM stv_blocklist

WHERE (tbl,col) IN (

SELECT attrelid, attnum-1

FROM pg_attribute

WHERE attencodingtype IN (0,128)

AND attnum>0 AND attsortkeyord != 1)

GROUP BY tbl) raw_size USING (table_id)

WHERE raw_size.size IS NOT NULL

ORDER BY raw_size.size DESC;

The data returned in the uncompressed_mb column represents the total number of uncompressed 1-

MB blocks for all columns in the table.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

When you rebuild the tables, use the ENCODE parameter to explicitly set column compression.

Implementation Tips

• Leave any columns that are the ﬁrst column in a compound sort key uncompressed. The Advisor

analysis doesn't count the storage consumed by those columns.

• Compressing large columns has a higher impact on performance and storage than compressing small

columns.

• If you are unsure which compression is best, use the ANALYZE COMPRESSION (p. 382) command to

suggest a compression.

• To generate the data deﬁnition language (DDL) statements for existing tables, you can use the AWS

Generate Table DDL utility, found on GitHub.

• To simplify the compression suggestions and the process of rebuilding tables, you can use the Amazon

Redshift Column Encoding Utility, found on GitHub.

Compress Amazon S3 File Objects Loaded by COPY

The COPY command takes advantage of the massively parallel processing (MPP) architecture in Amazon

Redshift to read and load data in parallel. It can read ﬁles from Amazon S3, DynamoDB tables, and text

output from one or more remote hosts.

When loading large amounts of data, we strongly recommend using the COPY command to load

compressed data ﬁles from S3. Compressing large datasets saves time uploading the ﬁles to S3. COPY

can also speed up the load process by uncompressing the ﬁles as they are read.

Analysis

Long-running COPY commands that load large uncompressed datasets often have an opportunity for

considerable performance improvement. The Advisor analysis identiﬁes COPY commands that load large

uncompressed datasets. In such a case, Advisor generates a recommendation to implement compression

on the source ﬁles in S3.

Recommendation

Ensure that each COPY that loads a signiﬁcant amount of data, or runs for a signiﬁcant duration, ingests

compressed data objects from S3. You can identify the COPY commands that load large uncompressed

datasets from S3 by running the following SQL command as a superuser.

SELECT

wq.userid, query, exec_start_time AS starttime, COUNT(*) num_files,

ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,

ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,

SUBSTRING(querytxt,1,60) copy_sql

FROM stl_s3client s

JOIN stl_query q USING (query)

JOIN stl_wlm_query wq USING (query)

WHERE s.userid>1 AND http_method = 'GET'

AND POSITION('COPY ANALYZE' IN querytxt) = 0

AND aborted = 0 AND final_state='Completed'

GROUP BY 1, 2, 3, 7

HAVING SUM(transfer_size) = SUM(data_size)

AND SUM(transfer_size)/(1024*1024) >= 5

ORDER BY 6 DESC, 5 DESC;

If the staged data remains in S3 after you load it, which is common in data lake architectures, storing this

data in a compressed form can reduce your storage costs.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

Implementation Tips

• The ideal object size is 1–128 MB after compression.

• You can compress ﬁles with gzip, lzop, or bzip2 format.

Isolate Multiple Active Databases

As a best practice, we recommend isolating databases in Amazon Redshift from one another. Queries

run in a speciﬁc database and can't access data from any other database on the cluster. However, the

queries that you run in all databases of a cluster share the same underlying cluster storage space and

compute resources. When a single cluster contains multiple active databases, their workloads are usually

unrelated.

Analysis

The Advisor analysis reviews all databases on the cluster for active workloads running at the same time.

If there are active workloads running at the same time, Advisor generates a recommendation to consider

migrating databases to separate Amazon Redshift clusters.

Recommendation

Consider moving each actively queried database to a separate dedicated cluster. Using a separate cluster

can reduce resource contention and improve query performance. It can do so because it enables you

to set the size for each cluster for the storage, cost, and performance needs of each workload. Also,

unrelated workloads often beneﬁt from diﬀerent workload management conﬁgurations.

To identify which databases are actively used, you can run this SQL command as a superuser.

SELECT database,

COUNT(*) as num_queries,

AVG(DATEDIFF(sec,starttime,endtime)) avg_duration,

MIN(starttime) as oldest_ts,

MAX(endtime) as latest_ts

FROM stl_query

WHERE userid > 1

GROUP BY database;

Implementation Tips

• Because a user must connect to each database speciﬁcally, and queries can only access a single

database, moving databases to separate clusters has minimal impact for users.

• One option to move a database is to take the following steps:

1. Temporarily restore a snapshot of the current cluster to a cluster of the same size.

2. Delete all databases from the new cluster except the target database to be moved.

3. Resize the cluster to an appropriate node type and count for the database's workload.

Reallocate Workload Management (WLM) Memory

Amazon Redshift routes user queries to Deﬁning Query Queues (p. 285) for processing. Workload

management (WLM) deﬁnes how those queries are routed to the queues. Amazon Redshift allocates each

queue a portion of the cluster's available memory. A queue's memory is divided among the queue's query

slots.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

When a queue is conﬁgured with more slots than the workload requires, the memory allocated to these

unused slots goes underutilized. Reducing the conﬁgured slots to match the peak workload requirements

redistributes the underutilized memory to active slots, and can result in improved query performance.

Analysis

The Advisor analysis reviews workload concurrency requirements to identify query queues with unused

slots. Advisor generates a recommendation to reduce the number of slots in a queue when it ﬁnds the

following:

• A queue with slots that are completely inactive throughout the analysis

• A queue with more than four slots that had at least two inactive slots throughout the analysis

Recommendation

Reducing the conﬁgured slots to match peak workload requirements redistributes underutilized memory

to active slots. Consider reducing the conﬁgured slot count for queues where the slots have never been

fully utilized. To identify these queues, you can compare the peak hourly slot requirements for each

queue by running the following SQL command as a superuser.

WITH

generate_dt_series AS (select sysdate - (n * interval '5 second') as dt from (select

row_number() over () as n from stl_scan limit 17280)),

apex AS (

SELECT iq.dt, iq.service_class, iq.num_query_tasks, count(iq.slot_count) as

service_class_queries, sum(iq.slot_count) as service_class_slots

FROM

(select gds.dt, wq.service_class, wscc.num_query_tasks, wq.slot_count

FROM stl_wlm_query wq

JOIN stv_wlm_service_class_config wscc ON (wscc.service_class = wq.service_class

AND wscc.service_class > 5)

JOIN generate_dt_series gds ON (wq.service_class_start_time <= gds.dt AND

wq.service_class_end_time > gds.dt)

WHERE wq.userid > 1 AND wq.service_class > 5) iq

GROUP BY iq.dt, iq.service_class, iq.num_query_tasks),

maxes as (SELECT apex.service_class, trunc(apex.dt) as d, date_part(h,apex.dt) as

dt_h, max(service_class_slots) max_service_class_slots

from apex group by apex.service_class, apex.dt, date_part(h,apex.dt))

SELECT apex.service_class - 5 AS queue, apex.service_class, apex.num_query_tasks AS

max_wlm_concurrency, maxes.d AS day, maxes.dt_h || ':00 - ' || maxes.dt_h || ':59' as

hour, MAX(apex.service_class_slots) as max_service_class_slots

FROM apex

JOIN maxes ON (apex.service_class = maxes.service_class AND apex.service_class_slots =

maxes.max_service_class_slots)

GROUP BY apex.service_class, apex.num_query_tasks, maxes.d, maxes.dt_h

ORDER BY apex.service_class, maxes.d, maxes.dt_h;

The max_service_class_slots column represents the maximum number of WLM query slots in the

query queue for that hour. If underutilized queues exist, implement the slot reduction optimization by

modifying a parameter group, as described in the Amazon Redshift Cluster Management Guide.

Implementation Tips

• If your workload is highly variable in volume, make sure that the analysis captured a peak utilization

period. If it didn't, run the preceding SQL repeatedly to monitor peak concurrency requirements.

• For more details on interpreting the query results from the preceding SQL code, see the

wlm_apex_hourly.sql script on GitHub.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

Skip Compression Analysis During COPY

When you load data into an empty table with compression encoding declared with the COPY command,

Amazon Redshift applies storage compression. This optimization ensures that data in your cluster is

stored eﬃciently even when loaded by end users. The analysis required to apply compression can require

signiﬁcant time.

Analysis

The Advisor analysis checks for COPY operations that were delayed by automatic compression analysis.

The analysis determines the compression encodings by sampling the data while it's being loaded. This

sampling is similar to that performed by the ANALYZE COMPRESSION (p. 382) command.

When you load data as part of a structured process, such as in an overnight extract, transform, load

(ETL) batch, you can deﬁne the compression beforehand. You can also optimize your table deﬁnitions to

permanently skip this phase without any negative impacts.

Recommendation

To improve COPY responsiveness by skipping the compression analysis phase, implement either of the

following two options:

• Use the column ENCODE parameter when creating any tables that you load using the COPY command.

• Disable compression altogether by supplying the COMPUPDATE OFF parameter in the COPY command.

The best solution is generally to use column encoding during table creation, because this approach also

maintains the beneﬁt of storing compressed data on disk. You can use the ANALYZE COMPRESSION

command to suggest compression encodings, but you must recreate the table to apply these encodings.

To automate this process, you can use the AWS ColumnEncodingUtility, found on GitHub.

To identify recent COPY operations that triggered automatic compression analysis, run the following SQL

command.

WITH xids AS (

SELECT xid FROM stl_query WHERE userid>1 AND aborted=0

AND querytxt = 'analyze compression phase 1' GROUP BY xid

INTERSECT SELECT xid FROM stl_commit_stats WHERE node=-1)

SELECT a.userid, a.query, a.xid, a.starttime, b.complyze_sec,

a.copy_sec, a.copy_sql

FROM (SELECT q.userid, q.query, q.xid, date_trunc('s',q.starttime)

starttime, substring(querytxt,1,100) as copy_sql,

ROUND(datediff(ms,starttime,endtime)::numeric / 1000.0, 2) copy_sec

FROM stl_query q JOIN xids USING (xid)

WHERE (querytxt ilike 'copy %from%' OR querytxt ilike '% copy %from%')

AND querytxt not like 'COPY ANALYZE %') a

LEFT JOIN (SELECT xid,

ROUND(sum(datediff(ms,starttime,endtime))::numeric / 1000.0,2) complyze_sec

FROM stl_query q JOIN xids USING (xid)

WHERE (querytxt like 'COPY ANALYZE %'

OR querytxt like 'analyze compression phase %')

GROUP BY xid ) b ON a.xid = b.xid

WHERE b.complyze_sec IS NOT NULL ORDER BY a.copy_sql, a.starttime;

Implementation Tips

• Ensure that all tables of signiﬁcant size created during your ETL processes (for example, staging tables

and temporary tables) declare a compression encoding for all columns except the ﬁrst sort key.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

• Estimate the expected lifetime size of the table being loaded for each of the COPY commands

identiﬁed by the SQL command preceding. If you are conﬁdent that the table will remain extremely

small, disable compression altogether with the COMPUPDATE OFF parameter. Otherwise, create the

table with explicit compression before loading it with the COPY command.

Split Amazon S3 Objects Loaded by COPY

The COPY command takes advantage of the massively parallel processing (MPP) architecture in

Amazon Redshift to read and load data from ﬁles on Amazon S3. The COPY command loads the data in

parallel from multiple ﬁles, dividing the workload among the nodes in your cluster. To achieve optimal

throughput, we strongly recommend that you divide your data into multiple ﬁles to take advantage of

parallel processing.

Analysis

The Advisor analysis identiﬁes COPY commands that load large datasets contained in a small number of

ﬁles staged in S3. Long-running COPY commands that load large datasets from a few ﬁles often have

an opportunity for considerable performance improvement. When Advisor identiﬁes that these COPY

commands are taking a signiﬁcant amount of time, it creates a recommendation to increase parallelism

by splitting the data into additional ﬁles in S3.

Recommendation

In this case, we recommend the following actions, listed in priority order:

1. Optimize COPY commands that load fewer ﬁles than the number of cluster nodes.

2. Optimize COPY commands that load fewer ﬁles than the number of cluster slices.

3. Optimize COPY commands where the number of ﬁles is not a multiple of the number of cluster slices.

Certain COPY commands load a signiﬁcant amount of data or run for a signiﬁcant duration. For these

commands, we recommend that you load a number of data objects from S3 that is equivalent to a

multiple of the number of slices in the cluster. To identify how many S3 objects each COPY command has

loaded, run the following SQL code as a superuser.

SELECT

query, COUNT(*) num_files,

ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,

ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,

SUBSTRING(querytxt,1,60) copy_sql

FROM stl_s3client s

JOIN stl_query q USING (query)

JOIN stl_wlm_query wq USING (query)

WHERE s.userid>1 AND http_method = 'GET'

AND POSITION('COPY ANALYZE' IN querytxt) = 0

AND aborted = 0 AND final_state='Completed'

GROUP BY query, querytxt

HAVING (SUM(transfer_size)/(1024*1024))/COUNT(*) >= 2

ORDER BY CASE

WHEN COUNT(*) < (SELECT max(node)+1 FROM stv_slices) THEN 1

WHEN COUNT(*) < (SELECT COUNT(*) FROM stv_slices WHERE node=0) THEN 2

ELSE 2+((COUNT(*) % (SELECT COUNT(*) FROM stv_slices))/(SELECT COUNT(*)::DECIMAL FROM

stv_slices))

END, (SUM(transfer_size)/(1024.0*1024.0))/COUNT(*) DESC;

Implementation Tips

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

• The number of slices in a node depends on the node size of the cluster. For more information about

the number of slices in the various node types, see Clusters and Nodes in Amazon Redshift in the

Amazon Redshift Cluster Management Guide.

• You can load multiple ﬁles by specifying a common preﬁx, or preﬁx key, for the set, or by explicitly

listing the ﬁles in a manifest ﬁle. For more information about loading ﬁles, see Splitting Your Data into

Multiple Files (p. 187).

• Amazon Redshift doesn't take ﬁle size into account when dividing the workload. Split your load data

ﬁles so that the ﬁles are about equal size, between 1 MB and 1 GB after compression. For optimum

parallelism, the ideal size is between 1 MB and 125 MB after compression.

Update Table Statistics

Amazon Redshift uses a cost-based query optimizer to choose the optimum execution plan for queries.

The cost estimates are based on table statistics gathered using the ANALYZE command. When statistics

are out of date or missing, the database might choose a less eﬃcient plan for query execution, especially

for complex queries. Maintaining current statistics helps complex queries run in the shortest possible

time.

Analysis

The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access

metadata associated with complex queries. If tables that are frequently accessed with complex patterns

are missing statistics, Advisor creates a critical recommendation to run ANALYZE. If tables that are

frequently accessed with complex patterns have out-of-date statistics, Advisor creates a suggested

recommendation to run ANALYZE.

Recommendation

Whenever table content changes signiﬁcantly, update statistics with ANALYZE. We recommend running

ANALYZE whenever a signiﬁcant number of new data rows are loaded into an existing table with COPY

or INSERT commands. We also recommend running ANALYZE whenever a signiﬁcant number of rows are

modiﬁed using UPDATE or DELETE commands. To identify tables with missing or out-of-date statistics,

run the following SQL command as a superuser. The results are ordered from largest to smallest table.

To identify tables with missing or out-of-date statistics, run the following SQL command as a superuser.

The results are ordered from largest to smallest table.

SELECT

ti.schema||'.'||ti."table" tablename,

ti.size table_size_mb,

ti.stats_off statistics_accuracy

FROM svv_table_info ti

WHERE ti.stats_off > 5.00

ORDER BY ti.size DESC;

Implementation Tips

The default ANALYZE threshold is 10 percent. This default means that the ANALYZE command skips a

given table if fewer than 10 percent of the table's rows have changed since the last ANALYZE. As a result,

you might choose to issue ANALYZE commands at the end of each ETL process. Taking this approach

means that ANALYZE is often skipped but also ensures that ANALYZE runs when needed.

ANALYZE statistics have the most impact for columns that are used in joins (for example, JOIN tbl_a

ON col_b) or as predicates (for example, WHERE col_b = 'xyz'). By default, ANALYZE collects

statistics for all columns in the table speciﬁed. If needed, you can reduce the time required to run

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

ANALYZE by running ANALYZE only for the columns where it has the most impact. You can run the

following SQL command to identify columns used as predicates. You can also let Amazon Redshift

choose which columns to analyze by specifying ANALYZE PREDICATE COLUMNS.

WITH predicate_column_info as (

SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num, a.attname

as col_name,

CASE

WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||')

WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')

WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')

WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')

ELSE NULL::varchar

END AS pred_ts

FROM pg_statistic s

JOIN pg_class c ON c.oid = s.starelid

JOIN pg_namespace ns ON c.relnamespace = ns.oid

JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)

SELECT schema_name, table_name, col_num, col_name,

pred_ts NOT LIKE '2000-01-01%' AS is_predicate,

CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts,

'||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,

CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts,

'||',2))::timestamp ELSE NULL::timestamp END as last_analyze

FROM predicate_column_info;

For more information, see Analyzing Tables (p. 223).

Enable Short Query Acceleration

Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running

queries. SQA executes short-running queries in a dedicated space, so that SQA queries aren't forced to

wait in queues behind longer queries. SQA only prioritizes queries that are short-running and are in a

user-deﬁned queue. With SQA, short-running queries begin running more quickly and users see results

sooner.

If you enable SQA, you can reduce or eliminate workload management (WLM) queues that are dedicated

to running short queries. In addition, long-running queries don't need to contend with short queries

for slots in a queue, so you can conﬁgure your WLM queues to use fewer query slots. When you use

lower concurrency, query throughput is increased and overall system performance is improved for most

workloads. For more information, see Short Query Acceleration (p. 291).

Analysis

Advisor checks for workload patterns and reports the number of recent queries where SQA would reduce

latency and the daily queue time for SQA-eligible queries.

Recommendation

Modify the WLM conﬁguration to enable SQA. Amazon Redshift uses a machine learning algorithm

to analyze each eligible query. Predictions improve as SQA learns from your query patterns. For more

information, see Conﬁguring Workload Management.

When you enable SQA, WLM sets the maximum run time for short queries to dynamic by default. We

recommend keeping the dynamic setting for SQA maximum run time.

Implementation Tips

To check whether SQA is enabled, run the following query. If the query returns a row, then SQA is

enabled.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Advisor Recommendations

select * from stv_wlm_service_class_config

where service_class = 14;

For more information, see Monitoring SQA (p. 292).

Replace Single-Column Interleaved Sort Keys

Some tables use an interleaved sort key on a single column. In general, such a table is less eﬃcient and

consumes more resources than a table that uses a compound sort key on a single column.

Interleaved sorting improves performance in certain cases where multiple columns are used by diﬀerent

queries for ﬁltering. Using an interleaved sort key on a single column is eﬀective only in a particular case.

That case is when queries often ﬁlter on CHAR or VARCHAR column values that have a long common

preﬁx in the ﬁrst 8 bytes. For example, URL strings are often preﬁxed with "https://". For single-

column keys, a compound sort is better than an interleaved sort for any other ﬁltering operations. A

compound sort speeds up joins, GROUP BY and ORDER BY operations, and window functions that use

PARTITION BY and ORDER BY on the sorted column. An interleaved sort doesn't beneﬁt any of those

operations. For more information, see Choosing Sort Keys (p. 140).

Using compound sort signiﬁcantly reduces maintenance overhead. Tables with compound sort keys don't

need the expensive VACUUM REINDEX operations that are necessary for interleaved sorts. In practice,

compound sort keys are more eﬀective than interleaved sort keys for the vast majority of Amazon

Redshift workloads.

Analysis

Advisor tracks tables that use an interleaved sort key on a single column.

Recommendation

If a table uses interleaved sorting on a single column, recreate the table to use a compound sort key.

When you create new tables, use a compound sort key for single-column sorts. To ﬁnd interleaved tables

that use a single-column sort key, run the following command.

SELECT schema AS schemaname, "table" AS tablename

FROM svv_table_info

WHERE table_id IN (

SELECT attrelid

FROM pg_attribute

WHERE attrelid IN (

SELECT attrelid

FROM pg_attribute

WHERE attsortkeyord <> 0

GROUP BY attrelid

HAVING MAX(attsortkeyord) = -1

)

AND NOT (atttypid IN (1042, 1043) AND atttypmod > 12)

AND attsortkeyord = -1);

For additional information about choosing the best sort style, see the AWS Big Data Blog post Amazon

Redshift Engineering's Advanced Table Design Playbook: Compound and Interleaved Sort Keys.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

Tutorial: Tuning Table Design

In this tutorial, you will learn how to optimize the design of your tables. You will start by creating

tables based on the Star Schema Benchmark (SSB) schema without sort keys, distribution styles, and

compression encodings. You will load the tables with test data and test system performance. Next, you

will apply best practices to recreate the tables using sort keys and distribution styles. You will load the

tables with test data using automatic compression and then you will test performance again so that you

can compare the performance beneﬁts of well-designed tables.

Estimated time: 60 minutes

Estimated cost: $1.00 per hour for the cluster

Prerequisites

You will need your AWS credentials (access key ID and secret access key) to load test data from Amazon

S3. If you need to create new access keys, go to Administering Access Keys for IAM Users.

Steps

•Step 1: Create a Test Data Set (p. 45)

•Step 2: Test System Performance to Establish a Baseline (p. 49)

•Step 3: Select Sort Keys (p. 52)

•Step 4: Select Distribution Styles (p. 53)

•Step 5: Review Compression Encodings (p. 57)

•Step 6: Recreate the Test Data Set (p. 59)

•Step 7: Retest System Performance After Tuning (p. 62)

•Step 8: Evaluate the Results (p. 66)

•Step 9: Clean Up Your Resources (p. 68)

•Summary (p. 68)

Step 1: Create a Test Data Set

Data warehouse databases commonly use a star schema design, in which a central fact table contains

the core data for the database and several dimension tables provide descriptive attribute information for

the fact table. The fact table joins each dimension table on a foreign key that matches the dimension's

primary key.

Star Schema Benchmark (SSB)

For this tutorial, you will use a set of ﬁve tables based on the Star Schema Benchmark (SSB) schema. The

following diagram shows the SSB data model.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

You will create a set of tables without sort keys, distribution styles, or compression encodings. Then you

will load the tables with data from the SSB data set.

1. (Optional) Launch a cluster.

If you already have a cluster that you want to use, you can skip this step. Your cluster should have at

least two nodes. For the exercises in this tutorial, you will use a four-node cluster.

To launch a dc1.large cluster with four nodes, follow the steps in Amazon Redshift Getting Started,

but select Multi Node for Cluster Type and set Number of Compute Nodes to 4.

Follow the steps to connect to your cluster from a SQL client and test a connection. You do not need

to complete the remaining steps to create tables, upload data, and try example queries.

2. Create the SSB test tables using minimum attributes.

Note

If the SSB tables already exist in the current database, you will need to drop the tables ﬁrst.

See Step 6: Recreate the Test Data Set (p. 59) for the DROP TABLE commands.

For the purposes of this tutorial, the ﬁrst time you create the tables, they will not have sort keys,

distribution styles, or compression encodings.

Execute the following CREATE TABLE commands.

CREATE TABLE part

(

p_partkey INTEGER NOT NULL,

p_name VARCHAR(22) NOT NULL,

p_mfgr VARCHAR(6) NOT NULL,

p_category VARCHAR(7) NOT NULL,

p_brand1 VARCHAR(9) NOT NULL,

p_color VARCHAR(11) NOT NULL,

p_type VARCHAR(25) NOT NULL,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

p_size INTEGER NOT NULL,

p_container VARCHAR(10) NOT NULL

);

CREATE TABLE supplier

(

s_suppkey INTEGER NOT NULL,

s_name VARCHAR(25) NOT NULL,

s_address VARCHAR(25) NOT NULL,

s_city VARCHAR(10) NOT NULL,

s_nation VARCHAR(15) NOT NULL,

s_region VARCHAR(12) NOT NULL,

s_phone VARCHAR(15) NOT NULL

);

CREATE TABLE customer

(

c_custkey INTEGER NOT NULL,

c_name VARCHAR(25) NOT NULL,

c_address VARCHAR(25) NOT NULL,

c_city VARCHAR(10) NOT NULL,

c_nation VARCHAR(15) NOT NULL,

c_region VARCHAR(12) NOT NULL,

c_phone VARCHAR(15) NOT NULL,

c_mktsegment VARCHAR(10) NOT NULL

);

CREATE TABLE dwdate

(

d_datekey INTEGER NOT NULL,

d_date VARCHAR(19) NOT NULL,

d_dayofweek VARCHAR(10) NOT NULL,

d_month VARCHAR(10) NOT NULL,

d_year INTEGER NOT NULL,

d_yearmonthnum INTEGER NOT NULL,

d_yearmonth VARCHAR(8) NOT NULL,

d_daynuminweek INTEGER NOT NULL,

d_daynuminmonth INTEGER NOT NULL,

d_daynuminyear INTEGER NOT NULL,

d_monthnuminyear INTEGER NOT NULL,

d_weeknuminyear INTEGER NOT NULL,

d_sellingseason VARCHAR(13) NOT NULL,

d_lastdayinweekfl VARCHAR(1) NOT NULL,

d_lastdayinmonthfl VARCHAR(1) NOT NULL,

d_holidayfl VARCHAR(1) NOT NULL,

d_weekdayfl VARCHAR(1) NOT NULL

);

CREATE TABLE lineorder

(

lo_orderkey INTEGER NOT NULL,

lo_linenumber INTEGER NOT NULL,

lo_custkey INTEGER NOT NULL,

lo_partkey INTEGER NOT NULL,

lo_suppkey INTEGER NOT NULL,

lo_orderdate INTEGER NOT NULL,

lo_orderpriority VARCHAR(15) NOT NULL,

lo_shippriority VARCHAR(1) NOT NULL,

lo_quantity INTEGER NOT NULL,

lo_extendedprice INTEGER NOT NULL,

lo_ordertotalprice INTEGER NOT NULL,

lo_discount INTEGER NOT NULL,

lo_revenue INTEGER NOT NULL,

lo_supplycost INTEGER NOT NULL,

lo_tax INTEGER NOT NULL,

lo_commitdate INTEGER NOT NULL,

lo_shipmode VARCHAR(10) NOT NULL

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Create a Test Data Set

);

3. Load the tables using SSB sample data.

The sample data for this tutorial is provided in an Amazon S3 buckets that give read access to all

authenticated AWS users, so any valid AWS credentials that permit access to Amazon S3 will work.

a. Create a new text ﬁle named loadssb.sql containing the following SQL.

copy customer from 's3://awssampledbuswest2/ssbgz/customer'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip compupdate off region 'us-west-2';

copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip compupdate off region 'us-west-2';

copy lineorder from 's3://awssampledbuswest2/ssbgz/lineorder'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip compupdate off region 'us-west-2';

copy part from 's3://awssampledbuswest2/ssbgz/part'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip compupdate off region 'us-west-2';

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip compupdate off region 'us-west-2';

b. Replace <Your-Access-Key-ID> and <Your-Secret-Access-Key> with your own AWS

account credentials. The segment of the credentials string that is enclosed in single quotes must

not contain any spaces or line breaks.

c. Execute the COPY commands either by running the SQL script or by copying and pasting the

commands into your SQL client.

Note

The load operation will take about 10 to 15 minutes for all ﬁve tables.

Your results should look similar to the following.

Load into table 'customer' completed, 3000000 record(s) loaded successfully.

0 row(s) affected.

copy executed successfully

Execution time: 10.28s

(Statement 1 of 5 finished)

...

Script execution finished

Total script execution time: 9m 51s

4. Sum the execution time for all ﬁve tables, or else note the total script execution time. You’ll record

that number as the load time in the benchmarks table in Step 2, following.

5. To verify that each table loaded correctly, execute the following commands.

select count(*) from LINEORDER;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

select count(*) from PART;

select count(*) from CUSTOMER;

select count(*) from SUPPLIER;

select count(*) from DWDATE;

The following results table shows the number of rows for each SSB table.

Table Name Rows

LINEORDER 600,037,902

PART 1,400,000

CUSTOMER 3,000,000

SUPPLIER 1,000,000

DWDATE 2,556

Next Step

Step 2: Test System Performance to Establish a Baseline (p. 49)

Step 2: Test System Performance to Establish a

Baseline

As you test system performance before and after tuning your tables, you will record the following

details:

• Load time

• Storage use

• Query performance

The examples in this tutorial are based on using a four-node dw2.large cluster. Your results will be

diﬀerent, even if you use the same cluster conﬁguration. System performance is inﬂuenced by many

factors, and no two systems will perform exactly the same.

You will record your results using the following benchmarks table.

Benchmark Before After

Load time (ﬁve tables)  

Storage Use

LINEORDER  

PART  

CUSTOMER  

DWDATE  

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Test System Performance to Establish a Baseline

Benchmark Before After

SUPPLIER  

Total storage  

Query execution time

Query 1  

Query 2  

Query 3  

Total execution time  

To Test System Performance to Establish a Baseline

1. Note the cumulative load time for all ﬁve tables and enter it in the benchmarks table in the Before

column.

This is the value you noted in the previous step.

2. Record storage use.

Determine how many 1 MB blocks of disk space are used for each table by querying the

STV_BLOCKLIST table and record the results in your benchmarks table.

select stv_tbl_perm.name as table, count(*) as mb

from stv_blocklist, stv_tbl_perm

where stv_blocklist.tbl = stv_tbl_perm.id

and stv_blocklist.slice = stv_tbl_perm.slice

and stv_tbl_perm.name in ('lineorder','part','customer','dwdate','supplier')

group by stv_tbl_perm.name

order by 1 asc;

Your results should look similar to this:

table | mb

----------+------

customer | 384

dwdate | 160

lineorder | 51024

part | 200

supplier | 152

3. Test query performance.

The ﬁrst time you run a query, Amazon Redshift compiles the code, and then sends compiled code

to the compute nodes. When you compare the execution times for queries, you should not use the

results for the ﬁrst time you execute the query. Instead, compare the times for the second execution

of each query. For more information, see Factors Aﬀecting Query Performance (p. 266).

Note

To reduce query execution time and improve system performance, Amazon Redshift caches

the results of certain types of queries in memory on the leader node. When result caching is

enabled, subsequent queries run much faster, which invalidates performance comparisons.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Test System Performance to Establish a Baseline

To disable result caching for the current session, set the enable_result_cache_for_session (p. 949)

parameter to off, as shown following.

set enable_result_cache_for_session to off;

Run the following queries twice to eliminate compile time. Record the second time for each query in

the benchmarks table.

-- Query 1

-- Restrictions on only one dimension.

select sum(lo_extendedprice*lo_discount) as revenue

from lineorder, dwdate

where lo_orderdate = d_datekey

and d_year = 1997

and lo_discount between 1 and 3

and lo_quantity < 24;

-- Query 2

-- Restrictions on two dimensions

select sum(lo_revenue), d_year, p_brand1

from lineorder, dwdate, part, supplier

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and lo_suppkey = s_suppkey

and p_category = 'MFGR#12'

and s_region = 'AMERICA'

group by d_year, p_brand1

order by d_year, p_brand1;

-- Query 3

-- Drill down in time to just one month

select c_city, s_city, d_year, sum(lo_revenue) as revenue

from customer, lineorder, supplier, dwdate

where lo_custkey = c_custkey

and lo_suppkey = s_suppkey

and lo_orderdate = d_datekey

and (c_city='UNITED KI1' or

c_city='UNITED KI5')

and (s_city='UNITED KI1' or

s_city='UNITED KI5')

and d_yearmonth = 'Dec1997'

group by c_city, s_city, d_year

order by d_year asc, revenue desc;

Your results for the second time will look something like this:

SELECT executed successfully

Execution time: 6.97s

(Statement 1 of 3 finished)

SELECT executed successfully

Execution time: 12.81s

(Statement 2 of 3 finished)

SELECT executed successfully

Execution time: 13.39s

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

(Statement 3 of 3 finished)

Script execution finished

Total script execution time: 33.17s

The following benchmarks table shows the example results for the cluster used in this tutorial.

Benchmark Before After

Load time (ﬁve tables) 10m 23s 

Storage Use

LINEORDER 51024 

PART 200 

CUSTOMER 384 

DWDATE 160 

SUPPLIER 152 

Total storage 51920 

Query execution time

Query 1 6.97 

Query 2 12.81 

Query 3 13.39 

Total execution time 33.17 

Next Step

Step 3: Select Sort Keys (p. 52)

Step 3: Select Sort Keys

When you create a table, you can specify one or more columns as the sort key. Amazon Redshift stores

your data on disk in sorted order according to the sort key. How your data is sorted has an important

eﬀect on disk I/O, columnar compression, and query performance.

In this step, you choose sort keys for the SSB tables based on these best practices:

• If recent data is queried most frequently, specify the timestamp column as the leading column for the

sort key.

• If you do frequent range ﬁltering or equality ﬁltering on one column, specify that column as the sort

key.

• If you frequently join a (dimension) table, specify the join column as the sort key.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Select Sort Keys

1. Evaluate your queries to ﬁnd timestamp columns that are used to ﬁlter the results.

For example, LINEORDER frequently uses equality ﬁlters using lo_orderdate.

where lo_orderdate = d_datekey and d_year = 1997

2. Look for columns that are used in range ﬁlters and equality ﬁlters. For example, LINEORDER also

uses lo_orderdate for range ﬁltering.

where lo_orderdate = d_datekey and d_year >= 1992 and d_year <= 1997

3. Based on the ﬁrst two best practices, lo_orderdate is a good choice for sort key.

In the tuning table, specify lo_orderdate as the sort key for LINEORDER.

4. The remaining tables are dimensions, so, based on the third best practice, specify their primary keys

as sort keys.

The following tuning table shows the chosen sort keys. You ﬁll in the Distribution Style column in Step 4:

Select Distribution Styles (p. 53).

Table name Sort Key Distribution Style

LINEORDER lo_orderdate 

PART p_partkey 

CUSTOMER c_custkey 

SUPPLIER s_suppkey 

DWDATE d_datekey 

Next Step

Step 4: Select Distribution Styles (p. 53)

Step 4: Select Distribution Styles

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node

slices according to the table's distribution style. The number of slices per node depends on the node

size of the cluster. For example, the dc1.large cluster that you are using in this tutorial has four nodes

with two slices each, so the cluster has a total of eight slices. The nodes all participate in parallel query

execution, working on data that is distributed across the slices.

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed

to perform any joins and aggregations. Redistribution might involve either sending speciﬁc rows to

nodes for joining or broadcasting an entire table to all of the nodes.

You should assign distribution styles to achieve these goals.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Distribution Styles

• Collocate the rows from joining tables

When the rows for joining columns are on the same slices, less data needs to be moved during query

execution.

• Distribute data evenly among the slices in a cluster.

If data is distributed evenly, workload can be allocated evenly to all the slices.

These goals may conﬂict in some cases, and you will need to evaluate which strategy is the best choice

for overall system performance. For example, even distribution might place all matching values for a

column on the same slice. If a query uses an equality ﬁlter on that column, the slice with those values

will carry a disproportionate share of the workload. If tables are collocated based on a distribution key,

the rows might be distributed unevenly to the slices because the keys are distributed unevenly through

the table.

In this step, you evaluate the distribution of the SSB tables with respect to the goals of data distribution,

and then select the optimum distribution styles for the tables.

Distribution Styles

When you create a table, you designate one of three distribution styles: KEY, ALL, or EVEN.

KEY distribution

The rows are distributed according to the values in one column. The leader node will attempt to place

matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader

node collocates the rows on the slices according to the values in the joining columns so that matching

values from the common columns are physically stored together.

ALL distribution

A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place

only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for

every join that the table participates in.

EVEN distribution

The rows are distributed across the slices in a round-robin fashion, regardless of the values in any

particular column. EVEN distribution is appropriate when a table does not participate in joins or when

there is not a clear choice between KEY distribution and ALL distribution. EVEN distribution is the default

distribution style.

For more information, see Distribution Styles (p. 130).

To Select Distribution Styles

When you execute a query, the query optimizer redistributes the rows to the compute nodes as needed

to perform any joins and aggregations. By locating the data where it needs to be before the query is

executed, you can minimize the impact of the redistribution step.

The ﬁrst goal is to distribute the data so that the matching rows from joining tables are collocated, which

means that the matching rows from joining tables are located on the same node slice.

1. To look for redistribution steps in the query plan, execute an EXPLAIN command followed by the

query. This example uses Query 2 from our set of test queries.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Select Distribution Styles

explain

select sum(lo_revenue), d_year, p_brand1

from lineorder, dwdate, part, supplier

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and lo_suppkey = s_suppkey

and p_category = 'MFGR#12'

and s_region = 'AMERICA'

group by d_year, p_brand1

order by d_year, p_brand1;

The following shows a portion of the query plan. Look for labels that begin with DS_BCAST or

DS_DIST labels

QUERY PLAN

XN Merge (cost=1038007224737.84..1038007224738.54 rows=280 width=20)

Merge Key: dwdate.d_year, part.p_brand1

-> XN Network (cost=1038007224737.84..1038007224738.54 rows=280 width=20)

Send to leader

-> XN Sort (cost=1038007224737.84..1038007224738.54 rows=280 width=20)

Sort Key: dwdate.d_year, part.p_brand1

-> XN HashAggregate (cost=38007224725.76..38007224726.46 rows=280

-> XN Hash Join DS_BCAST_INNER (cost=30674.95..38007188507.46

Hash Cond: ("outer".lo_orderdate = "inner".d_datekey)

-> XN Hash Join DS_BCAST_INNER

(cost=30643.00..37598119820.65

Hash Cond: ("outer".lo_suppkey = "inner".s_suppkey)

-> XN Hash Join DS_BCAST_INNER

Hash Cond: ("outer".lo_partkey =

"inner".p_partkey)

-> XN Seq Scan on lineorder

-> XN Hash (cost=17500.00..17500.00 rows=56000

-> XN Seq Scan on part

(cost=0.00..17500.00

Filter: ((p_category)::text =

-> XN Hash (cost=12500.00..12500.00 rows=201200

-> XN Seq Scan on supplier

(cost=0.00..12500.00

Filter: ((s_region)::text =

'AMERICA'::text)

-> XN Hash (cost=25.56..25.56 rows=2556 width=8)

-> XN Seq Scan on dwdate (cost=0.00..25.56 rows=2556

DS_BCAST_INNER indicates that the inner join table was broadcast to every slice. A DS_DIST_BOTH

label, if present, would indicate that both the outer join table and the inner join table were

redistributed across the slices. Broadcasting and redistribution can be expensive steps in terms of

query performance. You want to select distribution strategies that reduce or eliminate broadcast

and distribution steps. For more information about evaluating the EXPLAIN plan, see Evaluating

Query Patterns (p. 132).

2. Distribute the fact table and one dimension table on their common columns.

The following diagram shows the relationships between the fact table, LINEORDER, and the

dimension tables in the SSB schema.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Select Distribution Styles

Each table can have only one distribution key, which means that only one pair of tables in the

schema can be collocated on their common columns. The central fact table is the clear ﬁrst choice.

For the second table in the pair, choose the largest dimension that commonly joins the fact table. In

this design, LINEORDER is the fact table, and PART is the largest dimension. PART joins LINEORDER

on its primary key, p_partkey.

Designate lo_partkey as the distribution key for LINEORDER and p_partkey as the distribution

key for PART so that the matching values for the joining keys will be collocated on the same slices

when the data is loaded.

3. Change some dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables,

you can often improve query performance signiﬁcantly by distributing the entire table to all of the

nodes. ALL distribution guarantees that the joining rows will be collocated on every slice. You should

weigh all factors before choosing ALL distribution. Using ALL distribution multiplies storage space

requirements and increases load times and maintenance operations.

CUSTOMER, SUPPLIER, and DWDATE also join the LINEORDER table on their primary keys; however,

LINEORDER will be collocated with PART, so you will set the remaining tables to use DISTSTYLE ALL.

Because the tables are relatively small and are not updated frequently, using ALL distribution will

have minimal impact on storage and load times.

4. Use EVEN distribution for the remaining tables.

All of the tables have been assigned with DISTKEY or ALL distribution styles, so you won't assign

EVEN to any tables. After evaluating your performance results, you might decide to change some

tables from ALL to EVEN distribution.

The following tuning table shows the chosen distribution styles.

Table name Sort Key Distribution Style

LINEORDER lo_orderdate lo_partkey

PART p_partkey p_partkey

CUSTOMER c_custkey ALL

SUPPLIER s_suppkey ALL

DWDATE d_datekey ALL

You can ﬁnd the steps for setting the distribution style in Step 6: Recreate the Test Data Set (p. 59).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

For more information, see Choose the Best Distribution Style (p. 27).

Next Step

Step 5: Review Compression Encodings (p. 57)

Step 5: Review Compression Encodings

Compression is a column-level operation that reduces the size of data when it is stored. Compression

conserves storage space and reduces the size of data that is read from storage, which reduces the

amount of disk I/O and therefore improves query performance.

By default, Amazon Redshift stores data in its raw, uncompressed format. When you create tables in an

Amazon Redshift database, you can deﬁne a compression type, or encoding, for the columns. For more

information, see Compression Encodings (p. 119).

You can apply compression encodings to columns in tables manually when you create the tables, or you

can use the COPY command to analyze the load data and apply compression encodings automatically.

To Review Compression Encodings

1. Find how much space each column uses.

Query the STV_BLOCKLIST system view to ﬁnd the number of 1 MB blocks each column uses. The

MAX aggregate function returns the highest block number for each column. This example uses col

< 17 in the WHERE clause to exclude system-generated columns.

Execute the following command.

select col, max(blocknum)

from stv_blocklist b, stv_tbl_perm p

where (b.tbl=p.id) and name ='lineorder'

and col < 17

group by name, col

order by col;

Your results will look similar to the following.

col | max

----+-----

0 | 572

1 | 572

2 | 572

3 | 572

4 | 572

5 | 572

6 | 1659

7 | 715

8 | 572

9 | 572

10 | 572

11 | 572

12 | 572

13 | 572

14 | 572

15 | 572

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Review Compression Encodings

16 | 1185

(17 rows)

2. Experiment with the diﬀerent encoding methods.

In this step, you create a table with identical columns, except that each column uses a diﬀerent

compression encoding. Then you insert a large number of rows, using data from the p_name column

in the PART table, so that every column has the same data. Finally, you will examine the table to

compare the eﬀects of the diﬀerent encodings on column sizes.

a. Create a table with the encodings that you want to compare.

create table encodingshipmode (

moderaw varchar(22) encode raw,

modebytedict varchar(22) encode bytedict,

modelzo varchar(22) encode lzo,

moderunlength varchar(22) encode runlength,

modetext255 varchar(22) encode text255,

modetext32k varchar(22) encode text32k);

b. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.

The command will take a couple minutes to execute.

insert into encodingshipmode

select lo_shipmode as moderaw, lo_shipmode as modebytedict, lo_shipmode as modelzo,

lo_shipmode as moderunlength, lo_shipmode as modetext255,

lo_shipmode as modetext32k

from lineorder where lo_orderkey < 200000000;

c. Query the STV_BLOCKLIST system table to compare the number of 1 MB disk blocks used by

each column.

select col, max(blocknum)

from stv_blocklist b, stv_tbl_perm p

where (b.tbl=p.id) and name = 'encodingshipmode'

and col < 6

group by name, col

order by col;

The query returns results similar to the following. Depending on how your cluster is conﬁgured,

your results will be diﬀerent, but the relative sizes should be similar.

col | max

–------+-----

0 | 221

1 | 26

2 | 61

3 | 192

4 | 54

5 | 105

(6 rows)

The columns show the results for the following encodings:

• Raw

• Bytedict

• LZO

• Runlength

• Text255

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

• Text32K

You can see that Bytedict encoding on the second column produced the best results for this

data set, with a compression ratio of better than 8:1. Diﬀerent data sets will produce diﬀerent

results, of course.

3. Use the ANALYZE COMPRESSION command to view the suggested encodings for an existing table.

Execute the following command.

analyze compression lineorder;

Your results should look similar to the following.

Table | Column | Encoding

-----------+------------------+-------------------

lineorder lo_orderkey delta

lineorder lo_linenumber delta

lineorder lo_custkey raw

lineorder lo_partkey raw

lineorder lo_suppkey raw

lineorder lo_orderdate delta32k

lineorder lo_orderpriority bytedict

lineorder lo_shippriority runlength

lineorder lo_quantity delta

lineorder lo_extendedprice lzo

lineorder lo_ordertotalprice lzo

lineorder lo_discount delta

lineorder lo_revenue lzo

lineorder lo_supplycost delta32k

lineorder lo_tax delta

lineorder lo_commitdate delta32k

lineorder lo_shipmode bytedict

Notice that ANALYZE COMPRESSION chose BYTEDICT encoding for the lo_shipmode column.

For an example that walks through choosing manually applied compression encodings, see Example:

Choosing Compression Encodings for the CUSTOMER Table (p. 127).

4. Apply automatic compression to the SSB tables.

By default, the COPY command automatically applies compression encodings when you load data

into an empty table that has no compression encodings other than RAW encoding. For this tutorial,

you will let the COPY command automatically select and apply optimal encodings for the tables as

part of the next step, Recreate the test data set.

For more information, see Loading Tables with Automatic Compression (p. 209).

Next Step

Step 6: Recreate the Test Data Set (p. 59)

Step 6: Recreate the Test Data Set

Now that you have chosen the sort keys and distribution styles for each of the tables, you can create the

tables using those attributes and reload the data. You will allow the COPY command to analyze the load

data and apply compression encodings automatically.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Recreate the Test Data Set

1. You need to drop the SSB tables before you run the CREATE TABLE commands.

Execute the following commands.

drop table part cascade;

drop table supplier cascade;

drop table customer cascade;

drop table dwdate cascade;

drop table lineorder cascade;

2. Create the tables with sort keys and distribution styles.

Execute the following set of SQL CREATE TABLE commands.

CREATE TABLE part (

p_partkey integer not null sortkey distkey,

p_name varchar(22) not null,

p_mfgr varchar(6) not null,

p_category varchar(7) not null,

p_brand1 varchar(9) not null,

p_color varchar(11) not null,

p_type varchar(25) not null,

p_size integer not null,

p_container varchar(10) not null

);

CREATE TABLE supplier (

s_suppkey integer not null sortkey,

s_name varchar(25) not null,

s_address varchar(25) not null,

s_city varchar(10) not null,

s_nation varchar(15) not null,

s_region varchar(12) not null,

s_phone varchar(15) not null)

diststyle all;

CREATE TABLE customer (

c_custkey integer not null sortkey,

c_name varchar(25) not null,

c_address varchar(25) not null,

c_city varchar(10) not null,

c_nation varchar(15) not null,

c_region varchar(12) not null,

c_phone varchar(15) not null,

c_mktsegment varchar(10) not null)

diststyle all;

CREATE TABLE dwdate (

d_datekey integer not null sortkey,

d_date varchar(19) not null,

d_dayofweek varchar(10) not null,

d_month varchar(10) not null,

d_year integer not null,

d_yearmonthnum integer not null,

d_yearmonth varchar(8) not null,

d_daynuminweek integer not null,

d_daynuminmonth integer not null,

d_daynuminyear integer not null,

d_monthnuminyear integer not null,

d_weeknuminyear integer not null,

d_sellingseason varchar(13) not null,

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Recreate the Test Data Set

d_lastdayinweekfl varchar(1) not null,

d_lastdayinmonthfl varchar(1) not null,

d_holidayfl varchar(1) not null,

d_weekdayfl varchar(1) not null)

diststyle all;

CREATE TABLE lineorder (

lo_orderkey integer not null,

lo_linenumber integer not null,

lo_custkey integer not null,

lo_partkey integer not null distkey,

lo_suppkey integer not null,

lo_orderdate integer not null sortkey,

lo_orderpriority varchar(15) not null,

lo_shippriority varchar(1) not null,

lo_quantity integer not null,

lo_extendedprice integer not null,

lo_ordertotalprice integer not null,

lo_discount integer not null,

lo_revenue integer not null,

lo_supplycost integer not null,

lo_tax integer not null,

lo_commitdate integer not null,

lo_shipmode varchar(10) not null

);

3. Load the tables using the same sample data.

a. Open the loadssb.sql script that you created in the ﬁrst step.

b. Delete compupdate off from each COPY statement. This time, you will allow COPY to apply

compression encodings.

For reference, the edited script should look like the following:

copy customer from 's3://awssampledbuswest2/ssbgz/customer'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip region 'us-west-2';

copy dwdate from 's3://awssampledbuswest2/ssbgz/dwdate'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip region 'us-west-2';

copy lineorder from 's3://awssampledbuswest2/ssbgz/lineorder'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip region 'us-west-2';

copy part from 's3://awssampledbuswest2/ssbgz/part'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip region 'us-west-2';

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-

Secret-Access-Key>'

gzip region 'us-west-2';

c. Save the ﬁle.

d. Execute the COPY commands either by running the SQL script or by copying and pasting the

commands into your SQL client.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Note

The load operation will take about 10 to 15 minutes. This might be a good time to get

another cup of tea or feed the ﬁsh.

Your results should look similar to the following.

Warnings:

Load into table 'customer' completed, 3000000 record(s) loaded successfully.

...

Script execution finished

Total script execution time: 12m 15s

e. Record the load time in the benchmarks table.

Benchmark Before After

Load time (ﬁve tables) 10m 23s 12m 15s

Storage Use

LINEORDER 51024 

PART 384 

CUSTOMER 200 

DWDATE 160 

SUPPLIER 152 

Total storage 51920 

Query execution time

Query 1 6.97 

Query 2 12.81 

Query 3 13.39 

Total execution time 33.17 

Next Step

Step 7: Retest System Performance After Tuning (p. 62)

Step 7: Retest System Performance After Tuning

After recreating the test data set with the selected sort keys, distribution styles, and compressions

encodings, you will retest the system performance.

To Retest System Performance After Tuning

1. Record storage use.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

Determine how many 1 MB blocks of disk space are used for each table by querying the

STV_BLOCKLIST table and record the results in your benchmarks table.

select stv_tbl_perm.name as "table", count(*) as "blocks (mb)"

from stv_blocklist, stv_tbl_perm

where stv_blocklist.tbl = stv_tbl_perm.id

and stv_blocklist.slice = stv_tbl_perm.slice

and stv_tbl_perm.name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')

group by stv_tbl_perm.name

order by 1 asc;

Your results will look similar to this:

table | blocks (mb)

-----------+-----------------

customer 604

dwdate 160

lineorder 27152

part 200

supplier 236

2. Check for distribution skew.

Uneven distribution, or data distribution skew, forces some nodes to do more work than others,

which limits query performance.

To check for distribution skew, query the SVV_DISKUSAGE system view. Each row in SVV_DISKUSAGE

records the statistics for one disk block. The num_values column gives the number of rows in that

disk block, so sum(num_values) returns the number of rows on each slice.

Execute the following query to see the distribution for all of the tables in the SSB database.

select trim(name) as table, slice, sum(num_values) as rows, min(minvalue),

max(maxvalue)

from svv_diskusage

where name in ('customer', 'part', 'supplier', 'dwdate', 'lineorder')

and col =0

group by name, slice

order by name, slice;

Your results will look something like this:

table | slice | rows | min | max

-----------+-------+----------+----------+-----------

customer | 0 | 3000000 | 1 | 3000000

customer | 2 | 3000000 | 1 | 3000000

customer | 4 | 3000000 | 1 | 3000000

customer | 6 | 3000000 | 1 | 3000000

dwdate | 0 | 2556 | 19920101 | 19981230

dwdate | 2 | 2556 | 19920101 | 19981230

dwdate | 4 | 2556 | 19920101 | 19981230

dwdate | 6 | 2556 | 19920101 | 19981230

lineorder | 0 | 75029991 | 3 | 599999975

lineorder | 1 | 75059242 | 7 | 600000000

lineorder | 2 | 75238172 | 1 | 599999975

lineorder | 3 | 75065416 | 1 | 599999973

lineorder | 4 | 74801845 | 3 | 599999975

lineorder | 5 | 75177053 | 1 | 599999975

lineorder | 6 | 74631775 | 1 | 600000000

lineorder | 7 | 75034408 | 1 | 599999974

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

part | 0 | 175006 | 15 | 1399997

part | 1 | 175199 | 1 | 1399999

part | 2 | 175441 | 4 | 1399989

part | 3 | 175000 | 3 | 1399995

part | 4 | 175018 | 5 | 1399979

part | 5 | 175091 | 11 | 1400000

part | 6 | 174253 | 2 | 1399969

part | 7 | 174992 | 13 | 1399996

supplier | 0 | 1000000 | 1 | 1000000

supplier | 2 | 1000000 | 1 | 1000000

supplier | 4 | 1000000 | 1 | 1000000

supplier | 6 | 1000000 | 1 | 1000000

(28 rows)

The following chart illustrates the distribution of the three largest tables. (The columns are not to

scale.) Notice that because CUSTOMER uses ALL distribution, it was distributed to only one slice per

node.

The distribution is relatively even, so you don't need to adjust for distribution skew.

3. Run an EXPLAIN command with each query to view the query plans.

The following example shows the EXPLAIN command with Query 2.

explain

select sum(lo_revenue), d_year, p_brand1

from lineorder, dwdate, part, supplier

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and lo_suppkey = s_suppkey

and p_category = 'MFGR#12'

and s_region = 'AMERICA'

group by d_year, p_brand1

order by d_year, p_brand1;

In the EXPLAIN plan for Query 2, notice that the DS_BCAST_INNER labels have been replaced by

DS_DIST_ALL_NONE and DS_DIST_NONE, which means that no redistribution was required for those

steps, and the query should run much more quickly.

QUERY PLAN

XN Merge (cost=1000014243538.45..1000014243539.15 rows=280 width=20)

Merge Key: dwdate.d_year, part.p_brand1

-> XN Network (cost=1000014243538.45..1000014243539.15 rows=280 width=20)

Send to leader

-> XN Sort (cost=1000014243538.45..1000014243539.15 rows=280 width=20)

Sort Key: dwdate.d_year, part.p_brand1

-> XN HashAggregate (cost=14243526.37..14243527.07 rows=280 width=20)

-> XN Hash Join DS_DIST_ALL_NONE (cost=30643.30..14211277.03

rows=4299912

Hash Cond: ("outer".lo_orderdate = "inner".d_datekey)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To Retest System Performance After Tuning

-> XN Hash Join DS_DIST_ALL_NONE

(cost=30611.35..14114497.06

Hash Cond: ("outer".lo_suppkey = "inner".s_suppkey)

-> XN Hash Join DS_DIST_NONE

(cost=17640.00..13758507.64

Hash Cond: ("outer".lo_partkey =

"inner".p_partkey)

-> XN Seq Scan on lineorder

(cost=0.00..6000378.88

-> XN Hash (cost=17500.00..17500.00 rows=56000

width=16)

-> XN Seq Scan on part

(cost=0.00..17500.00

Filter: ((p_category)::text =

'MFGR#12'::text)

-> XN Hash (cost=12500.00..12500.00 rows=188541

width=4)

-> XN Seq Scan on supplier

(cost=0.00..12500.00

Filter: ((s_region)::text =

'AMERICA'::text)

-> XN Hash (cost=25.56..25.56 rows=2556 width=8)

-> XN Seq Scan on dwdate (cost=0.00..25.56 rows=2556

width=8)

4. Run the same test queries again.

If you reconnected to the database since your ﬁrst set of tests, disable result caching for this session.

To disable result caching for the current session, set the enable_result_cache_for_session (p. 949)

parameter to off, as shown following.

set enable_result_cache_for_session to off;

As you did earlier, run the following queries twice to eliminate compile time. Record the second time

for each query in the benchmarks table.

-- Query 1

-- Restrictions on only one dimension.

select sum(lo_extendedprice*lo_discount) as revenue

from lineorder, dwdate

where lo_orderdate = d_datekey

and d_year = 1997

and lo_discount between 1 and 3

and lo_quantity < 24;

-- Query 2

-- Restrictions on two dimensions

select sum(lo_revenue), d_year, p_brand1

from lineorder, dwdate, part, supplier

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and lo_suppkey = s_suppkey

and p_category = 'MFGR#12'

and s_region = 'AMERICA'

group by d_year, p_brand1

order by d_year, p_brand1;

-- Query 3

-- Drill down in time to just one month

select c_city, s_city, d_year, sum(lo_revenue) as revenue

from customer, lineorder, supplier, dwdate

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

where lo_custkey = c_custkey

and lo_suppkey = s_suppkey

and lo_orderdate = d_datekey

and (c_city='UNITED KI1' or

c_city='UNITED KI5')

and (s_city='UNITED KI1' or

s_city='UNITED KI5')

and d_yearmonth = 'Dec1997'

group by c_city, s_city, d_year

order by d_year asc, revenue desc;

The following benchmarks table shows the results based on the cluster used in this example. Your results

will vary based on a number of factors, but the relative results should be similar.

Benchmark Before After

Load time (ﬁve tables) 10m 23s 12m 15s

Storage Use

LINEORDER 51024 27152

PART 200 200

CUSTOMER 384 604

DWDATE 160 160

SUPPLIER 152 236

Total storage 51920 28352

Query execution time

Query 1 6.97 3.19

Query 2 12.81 9.02

Query 3 13.39 10.54

Total execution time 33.17 22.75

Next Step

Step 8: Evaluate the Results (p. 66)

Step 8: Evaluate the Results

You tested load times, storage requirements, and query execution times before and after tuning the

tables, and recorded the results.

The following table shows the example results for the cluster that was used for this tutorial. Your results

will be diﬀerent, but should show similar improvements.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 8: Evaluate the Results

Benchmark Before After Change %

Load time (ﬁve

tables)

623 732 109 17.5%

Storage Use

LINEORDER 51024 27152 -23872 -46.8%

PART 200 200 0 0%

CUSTOMER 384 604 220 57.3%

DWDATE 160 160 0 0%

SUPPLIER 152 236 84 55.3%

Total storage 51920 28352 -23568 -45.4%

Query execution time

Query 1 6.97 3.19 -3.78 -54.2%

Query 2 12.81 9.02 -3.79 -29.6%

Query 3 13.39 10.54 -2.85 -21.3%

Total execution

time

33.17 22.75 -10.42 -31.4%

Load time

Load time increased by 17.5%.

Sorting, compression, and distribution increase load time. In particular, in this case, you used automatic

compression, which increases the load time for empty tables that don't already have compression

encodings. Subsequent loads to the same tables would be faster. You also increased load time by using

ALL distribution. You could reduce load time by using EVEN or DISTKEY distribution instead for some of

the tables, but that decision needs to be weighed against query performance.

Storage requirements

Storage requirements were reduced by 45.4%.

Some of the storage improvement from using columnar compression was oﬀset by using ALL

distribution on some of the tables. Again, you could improve storage use by using EVEN or DISTKEY

distribution instead for some of the tables, but that decision needs to be weighed against query

performance.

Distribution

You veriﬁed that there is no distribution skew as a result of your distribution choices.

By checking the EXPLAIN plan, you saw that data redistribution was eliminated for the test queries.

Query execution time

Total query execution time was reduced by 31.4%.

The improvement in query performance was due to a combination of optimizing sort keys, distribution

styles, and compression. Often, query performance can be improved even further by rewriting

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

queries and conﬁguring workload management (WLM). For more information, see Tuning Query

Performance (p. 257).

Next Step

Step 9: Clean Up Your Resources (p. 68)

Step 9: Clean Up Your Resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,

you should return your environment to the previous state by following the steps in Step 5: Revoke Access

and Delete Your Sample Cluster in the Amazon Redshift Getting Started.

If you want to keep the cluster, but recover the storage used by the SSB tables, execute the following

commands.

drop table part cascade;

drop table supplier cascade;

drop table customer cascade;

drop table dwdate cascade;

drop table lineorder cascade;

Next Step

Summary (p. 68)

Summary

In this tutorial, you learned how to optimize the design of your tables by applying table design best

practices.

You chose sort keys for the SSB tables based on these best practices:

• If recent data is queried most frequently, specify the timestamp column as the leading column for the

sort key.

• If you do frequent range ﬁltering or equality ﬁltering on one column, specify that column as the sort

key.

• If you frequently join a (dimension) table, specify the join column as the sort key.

You applied the following best practices to improve the distribution of the tables.

• Distribute the fact table and one dimension table on their common columns

• Change some dimension tables to use ALL distribution

You evaluated the eﬀects of compression on a table and determined that using automatic compression

usually produces the best results.

For more information, see the following links:

•Amazon Redshift Best Practices for Designing Tables (p. 26)

•Choose the Best Sort Key (p. 27)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

•Choosing a Data Distribution Style (p. 129)

•Choosing a Column Compression Type (p. 118)

•Analyzing Table Design (p. 146)

Next Step

For your next step, if you haven't done so already, we recommend taking Tutorial: Loading Data from

Amazon S3 (p. 70).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Prerequisites

Tutorial: Loading Data from Amazon

In this tutorial, you will walk through the process of loading data into your Amazon Redshift database

tables from data ﬁles in an Amazon Simple Storage Service (Amazon S3) bucket from beginning to end.

In this tutorial, you will:

• Download data ﬁles that use CSV, character-delimited, and ﬁxed width formats.

• Create an Amazon S3 bucket and then upload the data ﬁles to the bucket.

• Launch an Amazon Redshift cluster and create database tables.

• Use COPY commands to load the tables from the data ﬁles on Amazon S3.

• Troubleshoot load errors and modify your COPY commands to correct the errors.

Estimated time: 60 minutes

Estimated cost: $1.00 per hour for the cluster

Prerequisites

You will need the following prerequisites:

• An AWS account to launch an Amazon Redshift cluster and to create a bucket in Amazon S3.

• Your AWS credentials (an access key ID and secret access key) to load test data from Amazon S3. If you

need to create new access keys, go to Administering Access Keys for IAM Users.

This tutorial is designed so that it can be taken by itself. In addition to this tutorial, we recommend

completing the following tutorials to gain a more complete understanding of how to design and use

Amazon Redshift databases:

•Amazon Redshift Getting Started walks you through the process of creating an Amazon Redshift

cluster and loading sample data.

•Tutorial: Tuning Table Design (p. 45) walks you step by step through the process of designing and

tuning tables, including choosing sort keys, distribution styles, and compression encodings, and

evaluating system performance before and after tuning.

Overview

You can add data to your Amazon Redshift tables either by using an INSERT command or by using a

COPY command. At the scale and speed of an Amazon Redshift data warehouse, the COPY command is

many times faster and more eﬃcient than INSERT commands.

The COPY command uses the Amazon Redshift massively parallel processing (MPP) architecture to

read and load data in parallel from multiple data sources. You can load from data ﬁles on Amazon S3,

Amazon EMR, or any remote host accessible through a Secure Shell (SSH) connection, or you can load

directly from an Amazon DynamoDB table.

In this tutorial, you will use the COPY command to load data from Amazon S3. Many of the principles

presented here apply to loading from other data sources as well.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Steps

To learn more about using the COPY command, see these resources:

•Amazon Redshift Best Practices for Loading Data (p. 29)

•Loading Data from Amazon EMR (p. 196)

•Loading Data from Remote Hosts (p. 200)

•Loading Data from an Amazon DynamoDB Table (p. 206)

Steps

•Step 1: Launch a Cluster (p. 71)

•Step 2: Download the Data Files (p. 72)

•Step 3: Upload the Files to an Amazon S3 Bucket (p. 72)

•Step 4: Create the Sample Tables (p. 74)

•Step 5: Run the COPY Commands (p. 76)

•Step 6: Vacuum and Analyze the Database (p. 87)

•Step 7: Clean Up Your Resources (p. 88)

Step 1: Launch a Cluster

If you already have a cluster that you want to use, you can skip this step.

For the exercises in this tutorial, you will use a four-node cluster. Follow the steps in Amazon Redshift

Getting Started, but select Multi Node for Cluster Type and set Number of Compute Nodes to 4.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Follow the Getting Started steps to connect to your cluster from a SQL client and test a connection.

You do not need to complete the remaining Getting Started steps to create tables, upload data, and try

example queries.

Next Step

Step 2: Download the Data Files (p. 72)

Step 2: Download the Data Files

In this step, you will download a set of sample data ﬁles to your computer. In the next step, you will

upload the ﬁles to an Amazon S3 bucket.

To download the data ﬁles

1. Download the zipped ﬁle from the following link: LoadingDataSampleFiles.zip

2. Extract the ﬁles to a folder on your computer.

3. Verify that your folder contains the following ﬁles.

customer-fw-manifest

customer-fw.tbl-000

customer-fw.tbl-000.bak

customer-fw.tbl-001

customer-fw.tbl-002

customer-fw.tbl-003

customer-fw.tbl-004

customer-fw.tbl-005

customer-fw.tbl-006

customer-fw.tbl-007

customer-fw.tbl.log

dwdate-tab.tbl-000

dwdate-tab.tbl-001

dwdate-tab.tbl-002

dwdate-tab.tbl-003

dwdate-tab.tbl-004

dwdate-tab.tbl-005

dwdate-tab.tbl-006

dwdate-tab.tbl-007

part-csv.tbl-000

part-csv.tbl-001

part-csv.tbl-002

part-csv.tbl-003

part-csv.tbl-004

part-csv.tbl-005

part-csv.tbl-006

part-csv.tbl-007

Next Step

Step 3: Upload the Files to an Amazon S3 Bucket (p. 72)

Step 3: Upload the Files to an Amazon S3 Bucket

In this step, you create an Amazon S3 bucket and upload the data ﬁles to the bucket.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

To upload the ﬁles to an Amazon S3 bucket

1. Create a bucket in Amazon S3.

a. Sign in to the AWS Management Console and open the Amazon S3 console at https://

console.aws.amazon.com/s3/.

b. Click Create Bucket.

c. In the Bucket Name box of the Create a Bucket dialog box, type a bucket name.

The bucket name you choose must be unique among all existing bucket names in Amazon

S3. One way to help ensure uniqueness is to preﬁx your bucket names with the name of your

organization. Bucket names must comply with certain rules. For more information, go to Bucket

Restrictions and Limitations in the Amazon Simple Storage Service Developer Guide.

d. Select a region.

Create the bucket in the same region as your cluster. If your cluster is in the Oregon region, click

Oregon.

e. Click Create.

When Amazon S3 successfully creates your bucket, the console displays your empty bucket in

the Buckets panel.

2. Create a folder.

a. Click the name of the new bucket.

b. Click the Actions button, and click Create Folder in the drop-down list.

c. Name the new folder load.

Note

The bucket that you created is not in a sandbox. In this exercise, you will add objects

to a real bucket, and you will be charged a nominal amount for the time that you store

the objects in the bucket. For more information about Amazon S3 pricing, go to the

Amazon S3 Pricing page.

3. Upload the data ﬁles to the new Amazon S3 bucket.

a. Click the name of the data folder.

b. In the Upload - Select Files wizard, click Add Files.

A ﬁle selection dialog box opens.

c. Select all of the ﬁles you downloaded and extracted, and then click Open.

d. Click Start Upload.

User Credentials

The Amazon Redshift COPY command must have access to read the ﬁle objects in the Amazon S3 bucket.

If you use the same user credentials to create the Amazon S3 bucket and to run the Amazon Redshift

COPY command, the COPY command will have all necessary permissions. If you want to use diﬀerent

user credentials, you can grant access by using the Amazon S3 access controls. The Amazon Redshift

COPY command requires at least ListBucket and GetObject permissions to access the ﬁle objects in

the Amazon S3 bucket. For more information about controlling access to Amazon S3 resources, go to

Managing Access Permissions to Your Amazon S3 Resources.

Next Step

Step 4: Create the Sample Tables (p. 74)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Create the Sample Tables

For this tutorial, you will use a set of ﬁve tables based on the Star Schema Benchmark (SSB) schema. The

following diagram shows the SSB data model.

If the SSB tables already exist in the current database, you will need to drop the tables to remove them

from the database before you create them using the CREATE TABLE commands in the next step. The

tables used in this tutorial might have diﬀerent attributes than the existing tables.

To create the sample tables

1. To drop the SSB tables, execute the following commands.

drop table part cascade;

drop table supplier;

drop table customer;

drop table dwdate;

drop table lineorder;

2. Execute the following CREATE TABLE commands.

CREATE TABLE part

(

p_partkey INTEGER NOT NULL,

p_name VARCHAR(22) NOT NULL,

p_mfgr VARCHAR(6),

p_category VARCHAR(7) NOT NULL,

p_brand1 VARCHAR(9) NOT NULL,

p_color VARCHAR(11) NOT NULL,

p_type VARCHAR(25) NOT NULL,

p_size INTEGER NOT NULL,

p_container VARCHAR(10) NOT NULL

);

CREATE TABLE supplier

(

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 4: Create the Sample Tables

s_suppkey INTEGER NOT NULL,

s_name VARCHAR(25) NOT NULL,

s_address VARCHAR(25) NOT NULL,

s_city VARCHAR(10) NOT NULL,

s_nation VARCHAR(15) NOT NULL,

s_region VARCHAR(12) NOT NULL,

s_phone VARCHAR(15) NOT NULL

);

CREATE TABLE customer

(

c_custkey INTEGER NOT NULL,

c_name VARCHAR(25) NOT NULL,

c_address VARCHAR(25) NOT NULL,

c_city VARCHAR(10) NOT NULL,

c_nation VARCHAR(15) NOT NULL,

c_region VARCHAR(12) NOT NULL,

c_phone VARCHAR(15) NOT NULL,

c_mktsegment VARCHAR(10) NOT NULL

);

CREATE TABLE dwdate

(

d_datekey INTEGER NOT NULL,

d_date VARCHAR(19) NOT NULL,

d_dayofweek VARCHAR(10) NOT NULL,

d_month VARCHAR(10) NOT NULL,

d_year INTEGER NOT NULL,

d_yearmonthnum INTEGER NOT NULL,

d_yearmonth VARCHAR(8) NOT NULL,

d_daynuminweek INTEGER NOT NULL,

d_daynuminmonth INTEGER NOT NULL,

d_daynuminyear INTEGER NOT NULL,

d_monthnuminyear INTEGER NOT NULL,

d_weeknuminyear INTEGER NOT NULL,

d_sellingseason VARCHAR(13) NOT NULL,

d_lastdayinweekfl VARCHAR(1) NOT NULL,

d_lastdayinmonthfl VARCHAR(1) NOT NULL,

d_holidayfl VARCHAR(1) NOT NULL,

d_weekdayfl VARCHAR(1) NOT NULL

);

CREATE TABLE lineorder

(

lo_orderkey INTEGER NOT NULL,

lo_linenumber INTEGER NOT NULL,

lo_custkey INTEGER NOT NULL,

lo_partkey INTEGER NOT NULL,

lo_suppkey INTEGER NOT NULL,

lo_orderdate INTEGER NOT NULL,

lo_orderpriority VARCHAR(15) NOT NULL,

lo_shippriority VARCHAR(1) NOT NULL,

lo_quantity INTEGER NOT NULL,

lo_extendedprice INTEGER NOT NULL,

lo_ordertotalprice INTEGER NOT NULL,

lo_discount INTEGER NOT NULL,

lo_revenue INTEGER NOT NULL,

lo_supplycost INTEGER NOT NULL,

lo_tax INTEGER NOT NULL,

lo_commitdate INTEGER NOT NULL,

lo_shipmode VARCHAR(10) NOT NULL

);

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Step 5: Run the COPY Commands (p. 76)

Step 5: Run the COPY Commands

You will run COPY commands to load each of the tables in the SSB schema. The COPY command

examples demonstrate loading from diﬀerent ﬁle formats, using several COPY command options, and

troubleshooting load errors.

Topics

•COPY Command Syntax (p. 76)

•Loading the SSB Tables (p. 77)

COPY Command Syntax

The basic COPY (p. 390) command syntax is as follows.

COPY table_name [ column_list ] FROM data_source CREDENTIALS access_credentials [options]

To execute a COPY command, you provide the following values.

Table name

The target table for the COPY command. The table must already exist in the database. The table can be

temporary or persistent. The COPY command appends the new input data to any existing rows in the

table.

Column list

By default, COPY loads ﬁelds from the source data to the table columns in order. You can optionally

specify a column list, that is a comma-separated list of column names, to map data ﬁelds to speciﬁc

columns. You will not use column lists in this tutorial. For more information, see Column List (p. 407) in

the COPY command reference.

Data source

You can use the COPY command to load data from an Amazon S3 bucket, an Amazon EMR cluster, a

remote host using an SSH connection, or an Amazon DynamoDB table. For this tutorial, you will load

from data ﬁles in an Amazon S3 bucket. When loading from Amazon S3, you must provide the name of

the bucket and the location of the data ﬁles, by providing either an object path for the data ﬁles or the

location of a manifest ﬁle that explicitly lists each data ﬁle and its location.

• Key preﬁx

An object stored in Amazon S3 is uniquely identiﬁed by an object key, which includes the bucket name,

folder names, if any, and the object name. A key preﬁx refers to a set of objects with the same preﬁx.

The object path is a key preﬁx that the COPY command uses to load all objects that share the key

preﬁx. For example, the key preﬁx custdata.txt can refer to a single ﬁle or to a set of ﬁles, including

custdata.txt.001, custdata.txt.002, and so on.

• Manifest ﬁle

If you need to load ﬁles with diﬀerent preﬁxes, for example, from multiple buckets or folders, or if

you need to exclude ﬁles that share a preﬁx, you can use a manifest ﬁle. A manifest ﬁle explicitly lists

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

each load ﬁle and its unique object key. You will use a manifest ﬁle to load the PART table later in this

tutorial.

Credentials

To access the AWS resources that contain the data to load, you must provide AWS access credentials (that

is, an access key ID and a secret access key) for an AWS user or an IAM user with suﬃcient privileges.

To load data from Amazon S3, the credentials must include ListBucket and GetObject permissions.

Additional credentials are required if your data is encrypted or if you are using temporary access

credentials. For more information, see Authorization Parameters (p. 404) in the COPY command

reference. For more information about managing access, go to Managing Access Permissions to Your

Amazon S3 Resources. If you do not have an access key ID and secret access key, you will need to get

them. For more information, go to Administering Access Keys for IAM Users.

Options

You can specify a number of parameters with the COPY command to specify ﬁle formats, manage data

formats, manage errors, and control other features. In this tutorial, you will use the following COPY

command options and features:

•Key Preﬁx (p. 78)

•CSV Format (p. 78)

•NULL AS (p. 79)

•REGION (p. 80)

•Fixed-Width Format (p. 81)

•MAXERROR (p. 82)

•ACCEPTINVCHARS (p. 83)

•MANIFEST (p. 84)

•DATEFORMAT (p. 85)

•GZIP, LZOP and BZIP2 (p. 85)

•COMPUPDATE (p. 85)

•Multiple Files (p. 86)

Loading the SSB Tables

You will use the following COPY commands to load each of the tables in the SSB schema. The command

to each table demonstrates diﬀerent COPY options and troubleshooting techniques.

To load the SSB tables, follow these steps:

1. Replace the Bucket Name and AWS Credentials (p. 77)

2. Load the PART Table Using NULL AS (p. 78)

3. Load the SUPPLIER table Using REGION (p. 80)

4. Load the CUSTOMER Table Using MANIFEST (p. 81)

5. Load the DWDATE Table Using DATEFORMAT (p. 85)

6. Load the LINEORDER Table Using Multiple Files (p. 85)

Replace the Bucket Name and AWS Credentials

The COPY commands in this tutorial are presented in the following format.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

copy table from 's3://<your-bucket-name>/load/key_prefix'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

options;

For each COPY command, do the following:

1. Replace <your-bucket-name> with the name of a bucket in the same region as your cluster.

This step assumes the bucket and the cluster are in the same region. Alternatively, you can specify the

region using the REGION (p. 397) option with the COPY command.

2. Replace <Your-Access-Key-ID> and <Your-Secret-Access-Key> with your own AWS IAM

account credentials. The segment of the credentials string that is enclosed in single quotation marks

must not contain any spaces or line breaks.

Load the PART Table Using NULL AS

In this step, you will use the CSV and NULL AS options to load the PART table.

The COPY command can load data from multiple ﬁles in parallel, which is much faster than loading

from a single ﬁle. To demonstrate this principle, the data for each table in this tutorial is split into eight

ﬁles, even though the ﬁles are very small. In a later step, you will compare the time diﬀerence between

loading from a single ﬁle and loading from multiple ﬁles. For more information, see Split Your Load Data

into Multiple Files (p. 30).

Key Preﬁx

You can load from multiple ﬁles by specifying a key preﬁx for the ﬁle set, or by explicitly listing the ﬁles

in a manifest ﬁle. In this step, you will use a key preﬁx. In a later step, you will use a manifest ﬁle. The key

preﬁx 's3://mybucket/load/part-csv.tbl' loads the following set of the ﬁles in the load folder.

part-csv.tbl-000

part-csv.tbl-001

part-csv.tbl-002

part-csv.tbl-003

part-csv.tbl-004

part-csv.tbl-005

part-csv.tbl-006

part-csv.tbl-007

CSV Format

CSV, which stands for comma separated values, is a common format used for importing and exporting

spreadsheet data. CSV is more ﬂexible than comma-delimited format because it enables you to

include quoted strings within ﬁelds. The default quote character for COPY from CSV format is a double

quotation mark ( " ), but you can specify another quote character by using the QUOTE AS option. When

you use the quote character within the ﬁeld, escape the character with an additional quote character.

The following excerpt from a CSV-formatted data ﬁle for the PART table shows strings enclosed in

double quotation marks ("LARGE ANODIZED BRASS") and a string enclosed in two double quotation

marks within a quoted string ("MEDIUM ""BURNISHED"" TIN").

15,dark sky,MFGR#3,MFGR#47,MFGR#3438,indigo,"LARGE ANODIZED BRASS",45,LG CASE

22,floral beige,MFGR#4,MFGR#44,MFGR#4421,medium,"PROMO, POLISHED BRASS",19,LG DRUM

23,bisque slate,MFGR#4,MFGR#41,MFGR#4137,firebrick,"MEDIUM ""BURNISHED"" TIN",42,JUMBO JAR

The data for the PART table contains characters that will cause COPY to fail. In this exercise, you will

troubleshoot the errors and correct them.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

To load data that is in CSV format, add csv to your COPY command. Execute the following command to

load the PART table.

copy part from 's3://<your-bucket-name>/load/part-csv.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

csv;

You should get an error message similar to the following.

An error occurred when executing the SQL command:

copy part from 's3://mybucket/load/part-csv.tbl'

credentials' ...

ERROR: Load into table 'part' failed. Check 'stl_load_errors' system table for details.

[SQL State=XX000]

Execution time: 1.46s

1 statement(s) failed.

To get more information about the error, query the STL_LOAD_ERRORS table. The following query uses

the SUBSTRING function to shorten columns for readability and uses LIMIT 10 to reduce the number of

rows returned. You can adjust the values in substring(filename,22,25) to allow for the length of

your bucket name.

select query, substring(filename,22,25) as filename,line_number as line,

substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as

line_text,

substring(raw_field_value,0,15) as field_text,

substring(err_reason,0,45) as reason

from stl_load_errors

order by query desc

limit 10;

--------+-------------------------+-----------+------------+------------+-----+----

333765 | part-csv.tbl-000 | 1 | | | 0 |

line_text | field_text | reason

------------------+------------+----------------------------------------------

15,NUL next, | | Missing newline: Unexpected character 0x2c f

NULL AS

The part-csv.tbl data ﬁles use the NUL terminator character (\x000 or \x0) to indicate NULL values.

Note

Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with

codepoint x000 that is often used to indicate end of record (EOR). NULL is a SQL value that

represents an absence of data.

By default, COPY treats a NUL terminator character as an EOR character and terminates the record,

which often results in unexpected results or an error. Because there is no single standard method of

indicating NULL in text data, the NULL AS COPY command option enables you to specify which character

to substitute with NULL when loading the table. In this example, you want COPY to treat the NUL

terminator character as a NULL value.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

Note

The table column that receives the NULL value must be conﬁgured as nullable. That is, it must

not include the NOT NULL constraint in the CREATE TABLE speciﬁcation.

To load PART using the NULL AS option, execute the following COPY command.

copy part from 's3://<your-bucket-name>/load/part-csv.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

csv

null as '\000';

To verify that COPY loaded NULL values, execute the following command to select only the rows that

contain NULL.

select p_partkey, p_name, p_mfgr, p_category from part where p_mfgr is null;

p_partkey | p_name | p_mfgr | p_category

-----------+----------+--------+------------

15 | NUL next | | MFGR#47

81 | NUL next | | MFGR#23

133 | NUL next | | MFGR#44

(2 rows)

Load the SUPPLIER table Using REGION

In this step you will use the DELIMITER and REGION options to load the SUPPLIER table.

Note

The ﬁles for loading the SUPPLIER table are provided in an AWS sample bucket. You don't need

to upload ﬁles for this step.

Character-Delimited Format

The ﬁelds in a character-delimited ﬁle are separated by a speciﬁc character, such as a pipe character ( | ),

a comma ( , ) or a tab ( \t ). Character-delimited ﬁles can use any single ASCII character, including one

of the nonprinting ASCII characters, as the delimiter. You specify the delimiter character by using the

DELIMITER option. The default delimiter is a pipe character ( | ).

The following excerpt from the data for the SUPPLIER table uses pipe-delimited format.

1|1|257368|465569|41365|19950218|2-HIGH|0|17|2608718|9783671|4|2504369|92072|2|19950331|

TRUCK

1|2|257368|201928|8146|19950218|2-HIGH|0|36|6587676|9783671|9|5994785|109794|6|19950416|

MAIL

REGION

Whenever possible, you should locate your load data in the same AWS region as your Amazon Redshift

cluster. If your data and your cluster are in the same region, you reduce latency, minimize eventual

consistency issues, and avoid cross-region data transfer costs. For more information, see Amazon

Redshift Best Practices for Loading Data (p. 29)

If you must load data from a diﬀerent AWS region, use the REGION option to specify the AWS region in

which the load data is located. If you specify a region, all of the load data, including manifest ﬁles, must

be in the named region. For more information, see REGION (p. 397).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

If your cluster is in the US East (N. Virginia) region, execute the following command to load the SUPPLIER

table from pipe-delimited data in an Amazon S3 bucket located in the US West (Oregon) region. For this

example, do not change the bucket name.

copy supplier from 's3://awssampledbuswest2/ssbgz/supplier.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

delimiter '|'

gzip

region 'us-west-2';

If your cluster is not in the US East (N. Virginia) region, execute the following command to load the

SUPPLIER table from pipe-delimited data in an Amazon S3 bucket located in the US East (N. Virginia)

region. For this example, do not change the bucket name.

copy supplier from 's3://awssampledb/ssbgz/supplier.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

delimiter '|'

gzip

region 'us-east-1';

Load the CUSTOMER Table Using MANIFEST

In this step, you will use the FIXEDWIDTH, MAXERROR, ACCEPTINVCHARS, and MANIFEST options to load

the CUSTOMER table.

The sample data for this exercise contains characters that will cause errors when COPY attempts to load

them. You will use the MAXERRORS option and the STL_LOAD_ERRORS system table to troubleshoot the

load errors and then use the ACCEPTINVCHARS and MANIFEST options to eliminate the errors.

Fixed-Width Format

Fixed-width format deﬁnes each ﬁeld as a ﬁxed number of characters, rather than separating ﬁelds with

a delimiter. The following excerpt from the data for the CUSTOMER table uses ﬁxed-width format.

1 Customer#000000001 IVhzIApeRb MOROCCO 0MOROCCO AFRICA 25-705

2 Customer#000000002 XSTf4,NCwDVaWNe6tE JORDAN 6JORDAN MIDDLE EAST 23-453

3 Customer#000000003 MG9kdTD ARGENTINA5ARGENTINAAMERICA 11-783

The order of the label/width pairs must match the order of the table columns exactly. For more

information, see FIXEDWIDTH (p. 408).

The ﬁxed-width speciﬁcation string for the CUSTOMER table data is as follows.

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,

c_region :12, c_phone:15,c_mktsegment:10'

To load the CUSTOMER table from ﬁxed-width data, execute the following command.

copy customer

from 's3://<your-bucket-name>/load/customer-fw.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,

c_phone:15,c_mktsegment:10';

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

You should get an error message, similar to the following.

An error occurred when executing the SQL command:

copy customer

from 's3://mybucket/load/customer-fw.tbl'

credentials'aws_access_key_id=...

ERROR: Load into table 'customer' failed. Check 'stl_load_errors' system table for

details. [SQL State=XX000]

Execution time: 2.95s

1 statement(s) failed.

MAXERROR

By default, the ﬁrst time COPY encounters an error, the command fails and returns an error message. To

save time during testing, you can use the MAXERROR option to instruct COPY to skip a speciﬁed number

of errors before it fails. Because we expect errors the ﬁrst time we test loading the CUSTOMER table

data, add maxerror 10 to the COPY command.

To test using the FIXEDWIDTH and MAXERROR options, execute the following command.

copy customer

from 's3://<your-bucket-name>/load/customer-fw.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12,

c_phone:15,c_mktsegment:10'

maxerror 10;

This time, instead of an error message, you get a warning message similar to the following.

Warnings:

Load into table 'customer' completed, 112497 record(s) loaded successfully.

Load into table 'customer' completed, 7 record(s) could not be loaded. Check

'stl_load_errors' system table for details.

The warning indicates that COPY encountered seven errors. To check the errors, query the

STL_LOAD_ERRORS table, as shown in the following example.

select query, substring(filename,22,25) as filename,line_number as line,

substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as

line_text,

substring(raw_field_value,0,15) as field_text,

substring(err_reason,0,45) as error_reason

from stl_load_errors

order by query desc, filename

limit 7;

The results of the STL_LOAD_ERRORS query should look similar to the following.

line_text | field_text | error_reason

--------+---------------------------+------+-----------

+------------+-----+-------------------------------+------------

+----------------------------------------------

334489 | customer-fw.tbl.log | 2 | c_custkey | int4 | -1 | customer-fw.tbl

| customer-f | Invalid digit, Value 'c', Pos 0, Type: Integ

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

334489 | customer-fw.tbl.log | 6 | c_custkey | int4 | -1 | Complete

| Complete | Invalid digit, Value 'C', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 3 | c_custkey | int4 | -1 | #Total rows

| #Total row | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 5 | c_custkey | int4 | -1 | #Status

| #Status | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl.log | 1 | c_custkey | int4 | -1 | #Load file

| #Load file | Invalid digit, Value '#', Pos 0, Type: Integ

334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1

Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8

334489 | customer-fw.tbl000 | 1 | c_address | varchar | 34 | 1

Customer#000000001 | .Mayag.ezR | String contains invalid or unsupported UTF8

(7 rows)

By examining the results, you can see that there are two messages in the error_reasons column:

•Invalid digit, Value '#', Pos 0, Type: Integ

These errors are caused by the customer-fw.tbl.log ﬁle. The problem is that it is a log ﬁle, not a

data ﬁle, and should not be loaded. You can use a manifest ﬁle to avoid loading the wrong ﬁle.

•String contains invalid or unsupported UTF8

The VARCHAR data type supports multibyte UTF-8 characters up to three bytes. If the load data

contains unsupported or invalid characters, you can use the ACCEPTINVCHARS option to replace each

invalid character with a speciﬁed alternative character.

Another problem with the load is more diﬃcult to detect—the load produced unexpected results. To

investigate this problem, execute the following command to query the CUSTOMER table.

select c_custkey, c_name, c_address

from customer

order by c_custkey

limit 10;

c_custkey | c_name | c_address

-----------+---------------------------+---------------------------

2 | Customer#000000002 | XSTf4,NCwDVaWNe6tE

3 | Customer#000000003 | MG9kdTD

4 | Customer#000000004 | XxVSJsL

5 | Customer#000000005 | KvpyuHCplrB84WgAi

6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx

(10 rows)

The rows should be unique, but there are duplicates.

Another way to check for unexpected results is to verify the number of rows that were loaded. In our

case, 100000 rows should have been loaded, but the load message reported loading 112497 records.

The extra rows were loaded because the COPY loaded an extraneous ﬁle, customer-fw.tbl0000.bak.

In this exercise, you will use a manifest ﬁle to avoid loading the wrong ﬁles.

ACCEPTINVCHARS

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

By default, when COPY encounters a character that is not supported by the column's data type, it skips

the row and returns an error. For information about invalid UTF-8 characters, see Multibyte Character

Load Errors (p. 214).

You could use the MAXERRORS option to ignore errors and continue loading, then query

STL_LOAD_ERRORS to locate the invalid characters, and then ﬁx the data ﬁles. However, MAXERRORS

is best used for troubleshooting load problems and should generally not be used in a production

environment.

The ACCEPTINVCHARS option is usually a better choice for managing invalid characters.

ACCEPTINVCHARS instructs COPY to replace each invalid character with a speciﬁed valid character

and continue with the load operation. You can specify any valid ASCII character, except NULL, as the

replacement character. The default replacement character is a question mark ( ? ). COPY replaces

multibyte characters with a replacement string of equal length. For example, a 4-byte character would

be replaced with '????'.

COPY returns the number of rows that contained invalid UTF-8 characters, and it adds an entry to the

STL_REPLACEMENTS system table for each aﬀected row, up to a maximum of 100 rows per node slice.

Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.

ACCEPTINVCHARS is valid only for VARCHAR columns.

For this step, you will add the ACCEPTINVCHARS with the replacement character '^'.

MANIFEST

When you COPY from Amazon S3 using a key preﬁx, there is a risk that you will load unwanted tables.

For example, the 's3://mybucket/load/ folder contains eight data ﬁles that share the key preﬁx

customer-fw.tbl: customer-fw.tbl0000, customer-fw.tbl0001, and so on. However, the same

folder also contains the extraneous ﬁles customer-fw.tbl.log and customer-fw.tbl-0001.bak.

To ensure that you load all of the correct ﬁles, and only the correct ﬁles, use a manifest ﬁle. The manifest

is a text ﬁle in JSON format that explicitly lists the unique object key for each source ﬁle to be loaded.

The ﬁle objects can be in diﬀerent folders or diﬀerent buckets, but they must be in the same region. For

more information, see MANIFEST (p. 396).

The following shows the customer-fw-manifest text.

{

"entries": [

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-000"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-001"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-002"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-003"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-004"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-005"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-006"},

{"url":"s3://<your-bucket-name>/load/customer-fw.tbl-007"}

]

}

To load the data for the CUSTOMER table using the manifest ﬁle

1. Open the ﬁle customer-fw-manifest in a text editor.

2. Replace <your-bucket-name> with the name of your bucket.

3. Save the ﬁle.

4. Upload the ﬁle to the load folder on your bucket.

5. Execute the following COPY command.

copy customer from 's3://<your-bucket-name>/load/customer-fw-manifest'

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15,

c_region :12, c_phone:15,c_mktsegment:10'

maxerror 10

acceptinvchars as '^'

manifest;

Load the DWDATE Table Using DATEFORMAT

In this step, you will use the DELIMITER and DATEFORMAT options to load the DWDATE table.

When loading DATE and TIMESTAMP columns, COPY expects the default format, which is YYYY-MM-DD

for dates and YYYY-MM-DD HH:MI:SS for time stamps. If the load data does not use a default format,

you can use DATEFORMAT and TIMEFORMAT to specify the format.

The following excerpt shows date formats in the DWDATE table. Notice that the date formats in column

two are inconsistent.

19920104 1992-01-04 Sunday January 1992 199201 Jan1992 1 4 4 1...

19920112 January 12, 1992 Monday January 1992 199201 Jan1992 2 12 12 1...

19920120 January 20, 1992 Tuesday January 1992 199201 Jan1992 3 20 20 1...

DATEFORMAT

You can specify only one date format. If the load data contains inconsistent formats, possibly in diﬀerent

columns, or if the format is not known at load time, you use DATEFORMAT with the 'auto' argument.

When 'auto' is speciﬁed, COPY will recognize any valid date or time format and convert it to the

default format. The 'auto' option recognizes several formats that are not supported when using a

DATEFORMAT and TIMEFORMAT string. For more information, see Using Automatic Recognition with

DATEFORMAT and TIMEFORMAT (p. 433).

To load the DWDATE table, execute the following COPY command.

copy dwdate from 's3://<your-bucket-name>/load/dwdate-tab.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

delimiter '\t'

dateformat 'auto';

Load the LINEORDER Table Using Multiple Files

This step uses the GZIP and COMPUPDATE options to load the LINEORDER table.

In this exercise, you will load the LINEORDER table from a single data ﬁle, and then load it again from

multiple ﬁles in order to compare the load times for the two methods.

Note

The ﬁles for loading the LINEORDER table are provided in an AWS sample bucket. You don't

need to upload ﬁles for this step.

GZIP, LZOP and BZIP2

You can compress your ﬁles using either gzip, lzop, or bzip2 compression formats. When loading from

compressed ﬁles, COPY uncompresses the ﬁles during the load process. Compressing your ﬁles saves

storage space and shortens upload times.

COMPUPDATE

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Loading the SSB Tables

When COPY loads an empty table with no compression encodings, it analyzes the load data to determine

the optimal encodings. It then alters the table to use those encodings before beginning the load. This

analysis process takes time, but it occurs, at most, once per table. To save time, you can skip this step by

turning COMPUPDATE oﬀ. To enable an accurate evaluation of COPY times, you will turn COMPUPDATE

oﬀ for this step.

Multiple Files

The COPY command can load data very eﬃciently when it loads from multiple ﬁles in parallel instead

of loading from a single ﬁle. If you split your data into ﬁles so that the number of ﬁles is a multiple of

the number of slices in your cluster, Amazon Redshift divides the workload and distributes the data

evenly among the slices. The number of slices per node depends on the node size of the cluster. For more

information about the number of slices that each node size has, go to About Clusters and Nodes in the

Amazon Redshift Cluster Management Guide.

For example, the dc1.large compute nodes used in this tutorial have two slices each, so the four-node

cluster has eight slices. In previous steps, the load data was contained in eight ﬁles, even though the ﬁles

are very small. In this step, you will compare the time diﬀerence between loading from a single large ﬁle

and loading from multiple ﬁles.

The ﬁles you will use for this tutorial contain about 15 million records and occupy about 1.2 GB. These

ﬁles are very small in Amazon Redshift scale, but suﬃcient to demonstrate the performance advantage

of loading from multiple ﬁles. The ﬁles are large enough that the time required to download them and

then upload them to Amazon S3 is excessive for this tutorial, so you will load the ﬁles directly from an

AWS sample bucket.

The following screenshot shows the data ﬁles for LINEORDER.

To evaluate the performance of COPY with multiple ﬁles

1. Execute the following command to COPY from a single ﬁle. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-single.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

gzip

compupdate off

region 'us-east-1';

2. Your results should be similar to the following. Note the execution time.

Warnings:

Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 6: Vacuum and Analyze the Database

0 row(s) affected.

copy executed successfully

Execution time: 51.56s

3. Execute the following command to COPY from multiple ﬁles. Do not change the bucket name.

copy lineorder from 's3://awssampledb/load/lo/lineorder-multi.tbl'

credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-

Access-Key>'

gzip

compupdate off

region 'us-east-1';

4. Your results should be similar to the following. Note the execution time.

Warnings:

Load into table 'lineorder' completed, 14996734 record(s) loaded successfully.

0 row(s) affected.

copy executed successfully

Execution time: 17.7s

5. Compare execution times.

In our example, the time to load 15 million records decreased from 51.56 seconds to 17.7 seconds, a

reduction of 65.7 percent.

These results are based on using a four-node cluster. If your cluster has more nodes, the time savings

is multiplied. For typical Amazon Redshift clusters, with tens to hundreds of nodes, the diﬀerence

is even more dramatic. If you have a single node cluster, there is little diﬀerence between the

execution times.

Next Step

Step 6: Vacuum and Analyze the Database (p. 87)

Step 6: Vacuum and Analyze the Database

Whenever you add, delete, or modify a signiﬁcant number of rows, you should run a VACUUM command

and then an ANALYZE command. A vacuum recovers the space from deleted rows and restores the sort

order. The ANALYZE command updates the statistics metadata, which enables the query optimizer to

generate more accurate query plans. For more information, see Vacuuming Tables (p. 228).

If you load the data in sort key order, a vacuum is fast. In this tutorial, you added a signiﬁcant number of

rows, but you added them to empty tables. That being the case, there is no need to resort, and you didn't

delete any rows. COPY automatically updates statistics after loading an empty table, so your statistics

should be up-to-date. However, as a matter of good housekeeping, you will complete this tutorial by

vacuuming and analyzing your database.

To vacuum and analyze the database, execute the following commands.

vacuum;

analyze;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

Step 7: Clean Up Your Resources (p. 88)

Step 7: Clean Up Your Resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,

you should return your environment to the previous state by following the steps in Step 5: Revoke Access

and Delete Your Sample Cluster in the Amazon Redshift Getting Started.

If you want to keep the cluster, but recover the storage used by the SSB tables, execute the following

commands.

drop table part;

drop table supplier;

drop table customer;

drop table dwdate;

drop table lineorder;

Summary (p. 88)

Summary

In this tutorial, you uploaded data ﬁles to Amazon S3 and then used COPY commands to load the data

from the ﬁles into Amazon Redshift tables.

You loaded data using the following formats:

• Character-delimited

• CSV

• Fixed-width

You used the STL_LOAD_ERRORS system table to troubleshoot load errors, and then used the REGION,

MANIFEST, MAXERROR, ACCEPTINVCHARS, DATEFORMAT, and NULL AS options to resolve the errors.

You applied the following best practices for loading data:

•Use a COPY Command to Load Data (p. 30)

•Split Your Load Data into Multiple Files (p. 30)

•Use a Single COPY Command to Load from Multiple Files (p. 30)

•Compress Your Data Files (p. 30)

•Use a Manifest File (p. 30)

•Verify Data Files Before and After a Load (p. 31)

For more information about Amazon Redshift best practices, see the following links:

•Amazon Redshift Best Practices for Loading Data (p. 29)

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Next Step

•Amazon Redshift Best Practices for Designing Tables (p. 26)

•Amazon Redshift Best Practices for Designing Queries (p. 32)

Next Step

For your next step, if you haven't done so already, we recommend taking Tutorial: Tuning Table

Design (p. 45).

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Overview

Tutorial: Conﬁguring Workload

Management (WLM) Queues to

Improve Query Processing

Overview

This tutorial walks you through the process of conﬁguring workload management (WLM) in Amazon

Redshift. By conﬁguring WLM, you can improve query performance and resource allocation in your

cluster.

Amazon Redshift routes user queries to queues for processing. WLM deﬁnes how those queries are

routed to the queues. By default, Amazon Redshift has two queues available for queries: one for

superusers, and one for users. The superuser queue cannot be conﬁgured and can only process one query

at a time. You should reserve this queue for troubleshooting purposes only. The user queue can process

up to ﬁve queries at a time, but you can conﬁgure this by changing the concurrency level of the queue if

needed.

When you have several users running queries against the database, you might ﬁnd another conﬁguration

to be more eﬃcient. For example, if some users run resource-intensive operations, such as VACUUM,

these might have a negative impact on less-intensive queries, such as reports. You might consider adding

additional queues and conﬁguring them for diﬀerent workloads.

Estimated time: 75 minutes

Estimated cost: 50 cents

Prerequisites

You will need an Amazon Redshift cluster, the sample TICKIT database, and the psql client tool. If you do

not already have these set up, go to Amazon Redshift Getting Started and Connect to Your Cluster by

Using the psql Tool.

Sections

•Section 1: Understanding the Default Queue Processing Behavior (p. 90)

•Section 2: Modifying the WLM Query Queue Conﬁguration (p. 94)

•Section 3: Routing Queries to Queues Based on User Groups and Query Groups (p. 98)

•Section 4: Using wlm_query_slot_count to Temporarily Override Concurrency Level in a

Queue (p. 101)

•Section 5: Cleaning Up Your Resources (p. 103)

Section 1: Understanding the Default Queue

Processing Behavior

Before you start to conﬁgure WLM, it’s useful to understand the default behavior of queue processing in

Amazon Redshift. In this section, you’ll create two database views that return information from several

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 1: Create the WLM_QUEUE_STATE_VW View

system tables. Then you’ll run some test queries to see how queries are routed by default. For more

information about system tables, see System Tables Reference (p. 797).

Step 1: Create the WLM_QUEUE_STATE_VW View

In this step, you’ll create a view called WLM_QUEUE_STATE_VW. This view returns information from the

following system tables.

•STV_WLM_CLASSIFICATION_CONFIG (p. 890)

•STV_WLM_SERVICE_CLASS_CONFIG (p. 894)

•STV_WLM_SERVICE_CLASS_STATE (p. 896)

You’ll use this view throughout the tutorial to monitor what happens to queues after you change the

WLM conﬁguration. The following table describes the data that the WLM_QUEUE_STATE_VW view

returns.

Column Description

queue The number associated with the row that represents a queue. Queue number

determines the order of the queues in the database.

description A value that describes whether the queue is available only to certain user

groups, to certain query groups, or all types of queries.

slots The number of slots allocated to the queue.

mem The amount of memory, in MB per slot, allocated to the queue.

max_execution_time The amount of time a query is allowed to run before it is terminated.

user_* A value that indicates whether wildcard characters are allowed in the WLM

conﬁguration to match user groups.

query_* A value that indicates whether wildcard characters are allowed in the WLM

conﬁguration to match query groups.

queued The number of queries that are waiting in the queue to be processed.

executing The number of queries that are currently executing.

executed The number of queries that have executed.

To Create the WLM_QUEUE_STATE_VW View

1. Open psql and connect to your TICKIT sample database. If you do not have this database, see

Prerequisites (p. 90).

2. Run the following query to create the WLM_QUEUE_STATE_VW view.

create view WLM_QUEUE_STATE_VW as

select (config.service_class-5) as queue

, trim (class.condition) as description

, config.num_query_tasks as slots

, config.query_working_mem as mem

, config.max_execution_time as max_time

, config.user_group_wild_card as "user_*"

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Create the WLM_QUERY_STATE_VW View

, config.query_group_wild_card as "query_*"

, state.num_queued_queries queued

, state.num_executing_queries executing

, state.num_executed_queries executed

from

STV_WLM_CLASSIFICATION_CONFIG class,

STV_WLM_SERVICE_CLASS_CONFIG config,

STV_WLM_SERVICE_CLASS_STATE state

where

class.action_service_class = config.service_class

and class.action_service_class = state.service_class

and config.service_class > 4

order by config.service_class;

3. Run the following query to see the information that the view contains.

select * from wlm_queue_state_vw;

The following is an example result.

Step 2: Create the WLM_QUERY_STATE_VW View

In this step, you’ll create a view called WLM_QUERY_STATE_VW. This view returns information from the

STV_WLM_QUERY_STATE (p. 892) system table.

You’ll use this view throughout the tutorial to monitor the queries that are running. The following table

describes the data that the WLM_QUERY_STATE_VW view returns.

Column Description

query The query ID.

queue The queue number.

slot_count The number of slots allocated to the query.

start_time The time that the query started.

state The state of the query, such as executing.

queue_time The number of microseconds that the query has spent in the queue.

exec_time The number of microseconds that the query has been executing.

To Create the WLM_QUERY_STATE_VW View

1. In psql, run the following query to create the WLM_QUERY_STATE_VW view.

create view WLM_QUERY_STATE_VW as

select query, (service_class-5) as queue, slot_count, trim(wlm_start_time) as start_time,

trim(state) as state, trim(queue_time) as queue_time, trim(exec_time) as exec_time

from stv_wlm_query_state;

2. Run the following query to see the information that the view contains.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Run Test Queries

select * from wlm_query_state_vw;

The following is an example result.

Step 3: Run Test Queries

In this step, you’ll run queries from multiple connections in psql and review the system tables to

determine how the queries were routed for processing.

For this step, you will need two psql windows open:

• In psql window 1, you’ll run queries that monitor the state of the queues and queries using the views

you already created in this tutorial.

• In psql window 2, you’ll run long-running queries to change the results you ﬁnd in psql window 1.

To Run the Test Queries

1. Open two psql windows. If you already have one window open, you only need to open a second

window. You can use the same user account for both of these connections.

2. In psql window 1, run the following query.

select * from wlm_query_state_vw;

The following is an example result.

This query returns a self-referential result. The query that is currently executing is the SELECT

statement from this view. A query on this view will always return at least one result. You’ll compare

this result with the result that occurs after starting the long-running query in the next step.

3. In psql window 2, you'll run a query from the TICKIT sample database. This query should run for

approximately a minute so that you have time to explore the results of the WLM_QUEUE_STATE_VW

view and the WLM_QUERY_STATE_VW view that you created earlier. If you ﬁnd that the query

does not run long enough for you to query both views, you can increase the value of the ﬁlter on

l.listid to make it run longer.

Note

To reduce query execution time and improve system performance, Amazon Redshift caches

the results of certain types of queries in memory on the leader node. When result caching is

enabled, subsequent queries run much faster. To prevent the query from running to quickly,

disable result caching for the current session.

To disable result caching for the current session, set the enable_result_cache_for_session (p. 949)

parameter to off, as shown following.

set enable_result_cache_for_session to off;

In psql window 2, run the following query.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 2: Modifying the WLM Query Queue Conﬁguration

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid < 100000;

4. In psql window 1, query WLM_QUEUE_STATE_VW and WLM_QUERY_STATE_VW and compare the

results to your earlier results.

select * from wlm_queue_state_vw;

select * from wlm_query_state_vw;

The following are example results.

Note the following diﬀerences between your previous queries and the results in this step:

• There are two rows now in WLM_QUERY_STATE_VW. One result is the self-referential query for

running a SELECT operation on this view. The second result is the long-running query from the

previous step.

• The executing column in WLM_QUEUE_STATE_VW has increased from 1 to 2. This column entry means

that there are two queries running in the queue.

• The executed column is incremented each time you run a query in the queue.

The WLM_QUEUE_STATE_VW view is useful for getting an overall view of the queues and how many

queries are being processed in each queue. The WLM_QUERY_STATE_VW view is useful for getting a

more detailed view of the individual queries that are currently running.

Section 2: Modifying the WLM Query Queue

Conﬁguration

Now that you understand how queues work by default, you'll learn how to conﬁgure query queues in

WLM. In this section, you’ll create and conﬁgure a new parameter group for your cluster. You’ll create

two additional user queues and conﬁgure them to accept queries based on the queries’ user group or

query group labels. Any queries that do not get routed to one of these two queues will be routed to the

default queue at run time.

Step 1: Create a Parameter Group

In this step, you’ll create a new parameter group to use to conﬁgure WLM for this tutorial.

To Create a Parameter Group

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://

console.aws.amazon.com/redshift/.

2. In the navigation pane, choose Parameter Groups.

3. Choose Create Cluster Parameter Group.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Conﬁgure WLM

4. In the Create Cluster Parameter Group dialog box, type wlmtutorial in the Parameter Group

Name ﬁeld and type WLM tutorial in the Description ﬁeld. You can leave the Parameter Group

Family setting as is. Then choose Create.

Step 2: Conﬁgure WLM

In this step, you’ll modify the default settings of your new parameter group. You’ll add two new query

queues to the WLM conﬁguration and specify diﬀerent settings for each queue.

To Modify Parameter Group Settings

1. On the Parameter Groups page of the Amazon Redshift console, click the magnifying glass icon next

to wlmtutorial. Doing this opens up the Parameters page for wlmtutorial.

2. Choose the WLM tab. Click Add New Queue twice to add two new queues to this WLM conﬁguration.

Conﬁgure the queues with the following values.

• For queue 1, type 2 in the Concurrency ﬁle, test in the Query Groups box, and 30 in the %

Memory box. Leave the other boxes empty.

• For queue 2, type 3 in the Concurrency box, admin in the User Groups box, and 40 in the %

Memory box. Leave the other boxes empty.

• Don't make any changes to the default queue. WLM automatically assigns unallocated memory to

the default queue.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Associate the Parameter Group with Your Cluster

3. Click Save Changes.

Step 3: Associate the Parameter Group with Your

Cluster

In this step, you’ll open your sample cluster and associate it with the new parameter group. After you do

this, you’ll reboot the cluster so that Amazon Redshift can apply the new settings to the database.

To Associate the Parameter Group with Your Cluster

1. In the navigation pane, click Clusters, and then click your cluster to open it. If you are using the same

cluster from Amazon Redshift Getting Started, your cluster will be named examplecluster.

2. On the Conﬁguration tab, click Modify in the Cluster menu.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Associate the Parameter Group with Your Cluster

3. In the Modify Cluster dialog box, select wlmtutorial from the Cluster Parameter Group menu, and

then click Modify.

The statuses shown in the Cluster Parameter Group and Parameter Group Apply Status will change

from in-sync to applying as shown in the following.

After the new parameter group is applied to the cluster, the Cluster Properties and Cluster Status

show the new parameter group that you associated with the cluster. You need to reboot the cluster so

that these settings can be applied to the database also.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Section 3: Routing Queries to Queues

Based on User Groups and Query Groups

4. In the Cluster menu, click Reboot. The status shown in Cluster Status will change from available to

rebooting. After the cluster is rebooted, the status will return to available.

Section 3: Routing Queries to Queues Based on

User Groups and Query Groups

Now that you have your cluster associated with a new parameter group, and you have conﬁgured WLM,

you’ll run some queries to see how Amazon Redshift routes queries into queues for processing.

Step 1: View Query Queue Conﬁguration in the

Database

First, verify that the database has the WLM conﬁguration that you expect.

To View the Query Queue Conﬁguration

1. Open psql and run the following query. The query uses the WLM_QUEUE_STATE_VW view you created

in Step 1: Create the WLM_QUEUE_STATE_VW View (p. 91). If you already had a session connected

to the database prior to the cluster reboot, you’ll need to reconnect.

select * from wlm_queue_state_vw;

The following is an example result.

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 2: Run a Query Using the Query Group Queue

Compare these results to the results you received in Step 1: Create the WLM_QUEUE_STATE_VW

View (p. 91). Notice that there are now two additional queues. Queue 1 is now the queue for the

test query group, and queue 2 is the queue for the admin user group.

Queue 3 is now the default queue. The last queue in the list is always the default queue, and that

is the queue to which queries are routed by default if no user group or query group is speciﬁed in a

query.

2. Run the following query to conﬁrm that your query now runs in queue 3.

select * from wlm_query_state_vw;

The following is an example result.

Step 2: Run a Query Using the Query Group Queue

To Run a Query Using the Query Group Queue

1. Run the following query to route it to the test query group.

set query_group to test;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

2. From the other psql window, run the following query.

select * from wlm_query_state_vw;

The following is an example result.

The query was routed to the test query group, which is queue 1 now.

3. Select all from the queue state view.

select * from wlm_queue_state_vw;

You'll see a result similar to the following.

4. Now, reset the query group and run the long query again:

reset query_group;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

API Version 2012-12-01

Amazon Redshift Database Developer Guide

Step 3: Create a Database User and Group

5. Run the queries against the views to see the results.

select * from wlm_queue_state_vw;

select * from wlm_query_state_vw;

The following are example results.

The result should be that the query is now running in queue 3 again.

Step 3: Create a Database User and Group

In Step 1: Create a Parameter Group (p. 94), you conﬁgured one of your query queues with a user

group named admin. Before you can run any queries in this queue, you need to create the user group in

the database and add a user to the group. Then you’ll log on with psql using the new user’s credentials

and run queries. You need to run queries as a superuser, such as the masteruser, to create database users.

To Create a New Database User and User Group

1. In the database, create a new database user named adminwlm by running the following command in a

psql window.

create user adminwlm createuser password '123Admin';

2. Then, run the following commands to create the new user group and add your new adminwlm user to

it.

create group admin;

alter group admin add user adminwlm;

Step 4: Run a Query Using the User Group Queue

Next you’ll run a query and route it to the user group queue. You do this when you want to route your

query to a queue that is conﬁgured to handle the type of query you want to run.

To Run a Query Using the User Group Queue

1. In psql window 2, run the following queries to switch to the adminwlm account and run a query as

that user.

set session authorization 'adminwlm';

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

2. In psql window 1, run the following query to see the query queue that the queries are routed to.

select * from wlm_query_state_vw;

API Version 2012-12-01

100

Amazon Redshift Database Developer Guide

Section 4: Using wlm_query_slot_count to

Temporarily Override Concurrency Level in a Queue

select * from wlm_queue_state_vw;

The following are example results.

Note that the queue this query ran in is queue 2, the admin user queue. Any time you run queries

logged in as this user, they will run in queue 2 unless you specify a diﬀerent query group to use.

3. Now run the following query from psql window 2.

set query_group to test;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

4. In psql window 1, run the following query to see the query queue that the queries are routed to.

select * from wlm_queue_state_vw;

select * from wlm_query_state_vw;

The following are example results.

5. When you’re done, reset the query group.

reset query_group;

Section 4: Using wlm_query_slot_count to

Temporarily Override Concurrency Level in a

Queue

Sometimes, users might temporarily need more resources for a particular query. If so, they can use the

wlm_query_slot_count conﬁguration setting to temporarily override the way slots are allocated in a

query queue. Slots are units of memory and CPU that are used to process queries. You might override the

slot count when you have occasional queries that take a lot of resources in the cluster, such as when you

perform a VACUUM operation in the database.

If you ﬁnd that users often need to set wlm_query_slot_count for certain types of queries, you should

consider adjusting the WLM conﬁguration and giving users a queue that better suits the needs of their

queries. For more information about temporarily overriding the concurrency level by using slot count,

see wlm_query_slot_count (p. 955).

API Version 2012-12-01

101

Amazon Redshift Database Developer Guide

Step 1: Override the Concurrency

Level Using wlm_query_slot_count

Step 1: Override the Concurrency Level Using

wlm_query_slot_count

For the purposes of this tutorial, we’ll run the same long-running SELECT query. We’ll run it as the

adminwlm user using wlm_query_slot_count to increase the number of slots available for the query.

To Override the Concurrency Level Using wlm_query_slot_count

1. Increase the limit on the query to make sure that you have enough time to query the

WLM_QUERY_STATE_VW view and see a result.

set wlm_query_slot_count to 3;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

2. Now, query WLM_QUERY_STATE_VW use the masteruser account to see how the query is running.

select * from wlm_query_state_vw;

The following is an example result.

Notice that the slot count for the query is 3. This count means that the query is using all three slots to

process the query, allocating all of the resources in the queue to that query.

3. Now, run the following query.

select * from WLM_QUEUE_STATE_VW;

The following is an example result.

The wlm_query_slot_count conﬁguration setting is valid for the current session only. If that session

expires, or another user runs a query, the WLM conﬁguration is used.

4. Reset the slot count and rerun the test.

reset wlm_query_slot_count;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

The following are example results.

API Version 2012-12-01

102

Amazon Redshift Database Developer Guide

Step 2: Run Queries from Diﬀerent Sessions

Next, run queries from diﬀerent sessions.

To Run Queries from Diﬀerent Sessions

1. In psql window 1 and 2, run the following to use the test query group.

set query_group to test;

2. In psql window 1, run the following long-running query.

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

3. As the long-running query is still going in psql window 1, run the following to increase the slot count

to use all the slots for the queue and then start running the long-running query.

set wlm_query_slot_count to 2;

select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;

4. Open a third psql window and query the views to see the results.

select * from wlm_queue_state_vw;

select * from wlm_query_state_vw;

The following are example results.

Notice that the ﬁrst query is using one of the slots allocated to queue 1 to run the query, and that

there is one query that is waiting in the queue (where queued is 1 and state is QueuedWaiting).

Once the ﬁrst query completes, the second one will begin executing. This execution happens because

both queries are routed to the test query group, and the second query must wait for enough slots to

begin processing.

Section 5: Cleaning Up Your Resources

Your cluster continues to accrue charges as long as it is running. When you have completed this tutorial,

you should return your environment to the previous state by following the steps in Step 6: Find

Additional Resources and Reset Your Environment in Amazon Redshift Getting Started.

For more information about WLM, see Implementing Workload Management (p. 285).

API Version 2012-12-01

103

Amazon Redshift Database Developer Guide

Overview

Tutorial: Querying Nested Data with

Amazon Redshift Spectrum

Overview

Amazon Redshift Spectrum supports querying nested data in Parquet, ORC, JSON, and Ion ﬁle formats.

Redshift Spectrum accesses the data using external tables. You can create external tables that use the

complex data types struct, array, and map.

For example, suppose that your data ﬁle contains the following data in Amazon S3 in a folder named

customers.

{ Id: 1,

Name: {Given:"John", Family:"Smith"},

Phones: ["123-457789"],

Orders: [ {Date: "Mar 1,2018 11:59:59", Price: 100.50}

{Date: "Mar 1,2018 09:10:00", Price: 99.12} ]

}

{ Id: 2,

Name: {Given:"Jenny", Family:"Doe"},

Phones: ["858-8675309", "415-9876543"],

Orders: [ ]

}

{ Id: 3,

Name: {Given:"Andy", Family:"Jones"},

Phones: [ ]

Orders: [ {Date: "Mar 2,2018 08:02:15", Price: 13.50} ]

}

You can use Redshift Spectrum to query this data. The following tutorial shows you how to do so.

For tutorial prerequisites, steps, and nested data use cases, see the following topics:

•Prerequisites (p. 104)

•Step 1: Create an External Table That Contains Nested Data (p. 105)

•Step 2: Query Your Nested Data in Amazon S3 with SQL Extensions (p. 105)

•Nested Data Use Cases (p. 109)

•Nested Data Limitations (p. 111)

Prerequisites

If you are not using Redshift Spectrum yet, follow the steps in the Getting Started with Amazon Redshift

Spectrum (p. 150) tutorial before continuing.

API Version 2012-12-01

104

Amazon Redshift Database Developer Guide

Step 1: Create an External Table That Contains Nested Data

Step 1: Create an External Table That Contains

Nested Data

To create the external table for this tutorial, run the following command.

CREATE EXTERNAL TABLE spectrum.customers (

id int,

name struct<given:varchar(20), family:varchar(20)>,

phones array<varchar(20)>,

orders array<struct<shipdate:timestamp, price:double precision>>

)

STORED AS PARQUET

LOCATION 's3://awssampledbuswest2/nested_example/customers/';

In the example preceding, the external table spectrum.customers uses the struct and array data

types to deﬁne columns with nested data. Amazon Redshift Spectrum supports querying nested data in

Parquet, ORC, JSON, and Ion ﬁle formats. The LOCATION parameter has to refer to the Amazon S3 folder

that contains the nested data or ﬁles.

Note

Amazon Redshift doesn't support complex data types in an Amazon Redshift database table.

You can use complex data types only with Redshift Spectrum external tables.

You can nest array and struct types at any level. For example, you can deﬁne a column named

toparray as shown in the following example.

toparray array<struct<nestedarray:

array<struct<morenestedarray:

array<string>>>>>

You can also nest struct types as shown for column x in the following example.

x struct<a: string,

b: struct<c: integer,

d: struct<e: string>

Step 2: Query Your Nested Data in Amazon S3 with

SQL Extensions

Redshift Spectrum supports querying array, map, and struct complex types through extensions to the

Amazon Redshift SQL syntax.

Extension 1: Access to Columns of Structs

You can extract data from struct columns using a dot notation that concatenates ﬁeld names into

paths. For example, the following query returns given and family names for customers. The given

API Version 2012-12-01

105

Amazon Redshift Database Developer Guide

Extension 2: Ranging Over Arrays in a FROM Clause

name is accessed by the long path c.name.given. The family name is accessed by the long path

c.name.family.

SELECT c.id, c.name.given, c.name.family

FROM spectrum.customers c;

The preceding query returns the following data.

id | given | family

---|-------|-------

1 | John | Smith

2 | Jenny | Doe

3 | Andy | Jones

(3 rows)

A struct can be a column of another struct, which can be a column of another struct, at any level.

The paths that access columns in such deeply nested structs can be arbitrarily long. For example, see

the deﬁnition for the column x in the following example.

x struct<a: string,

b: struct<c: integer,

d: struct<e: string>

You can access the data in e as x.b.d.e.

Note

You use structs only to describe the path to the ﬁelds that they contain. You can't access them

directly in a query or return them from a query.

Extension 2: Ranging Over Arrays in a FROM Clause

You can extract data from array columns (and, by extension, map columns) by specifying the array

columns in a FROM clause in place of table names. The extension applies to the FROM clause of the main

query, and also the FROM clauses of subqueries. You can't reference array elements by position, such as

c.orders[0].

By combining ranging over arrays with joins, you can achieve various kinds of unnesting, as explained

in the following use cases.

Unnesting Using Inner Joins

The following query selects customer IDs and order ship dates for customers that have orders. The SQL

extension in the FROM clause c.orders o depends on the alias c.

SELECT c.id, o.shipdate

FROM spectrum.customers c, c.orders o

API Version 2012-12-01

106

Amazon Redshift Database Developer Guide

Extension 2: Ranging Over Arrays in a FROM Clause

For each customer c that has orders, the FROM clause returns one row for each order o of the customer

c. That row combines the customer row c and the order row o. Then the SELECT clause keeps only the

c.id and o.shipdate. The result is the following.

id| shipdate

--|----------------------

1 |2018-03-01 11:59:59

1 |2018-03-01 09:10:00

3 |2018-03-02 08:02:15

(3 rows)

The alias c provides access to the customer ﬁelds, and the alias o provides access to the order ﬁelds.

The semantics are similar to standard SQL. You can think of the FROM clause as executing the following

nested loop, which is followed by SELECT choosing the ﬁelds to output.

for each customer c in spectrum.customers

for each order o in c.orders

output c.id and o.shipdate

Therefore, if a customer doesn't have an order, the customer doesn't appear in the result.

You can also think of this as the FROM clause performing a JOIN with the customers table and the

orders array. In fact, you can also write the query as shown in the following example.

SELECT c.id, o.shipdate

FROM spectrum.customers c INNER JOIN c.orders o ON true

Note

If a schema named c exists with a table named orders, then c.orders refers to the table

orders, and not the array column of customers.

Unnesting Using Left Joins

The following query outputs all customer names and their orders. If a customer hasn't placed an order,

the customer's name is still returned. However, in this case the order columns are NULL, as shown in the

following example for Jenny Doe.

SELECT c.id, c.name.given, c.name.family, o.shipdate, o.price

FROM spectrum.customers c LEFT JOIN c.orders o ON true

The preceding query returns the following data.

id | given | family | shipdate | price

----|---------|---------|----------------------|--------

1 | John | Smith | 2018-03-01 11:59:59 | 100.5

2 | John | Smith | 2018-03-01 09:10:00 | 99.12

2 | Jenny | Doe | |

API Version 2012-12-01

107

Amazon Redshift Database Developer Guide

Extension 3: Accessing an Array

of Scalars Directly Using an Alias

3 | Andy | Jones | 2018-03-02 08:02:15 | 13.5

(4 rows)

Extension 3: Accessing an Array of Scalars Directly

Using an Alias

When an alias p in a FROM clause ranges over an array of scalars, the query refers to the values of p

simply as p. For example, the following query produces pairs of customer names and phone numbers.

SELECT c.name.given, c.name.family, p AS phone

FROM spectrum.customers c LEFT JOIN c.phones p ON true

The preceding query returns the following data.

given | family | phone

-------|----------|-----------

John | Smith | 123-4577891

Jenny | Doe | 858-8675309

Jenny | Doe | 415-9876543

Andy | Jones |

(4 rows)

Extension 4: Accessing Elements of Maps

Redshift Spectrum treats the map data type as an array type that contains struct types with a key

column and a value column. The key must be a scalar; the value can be any data type.

For example, the following code creates an external table with a map for storing phone numbers.

CREATE EXTERNAL TABLE spectrum.customers (

id int,

name struct<given:varchar(20), family:varchar(20)>,

phones map<varchar(20), varchar(20)>,

orders array<struct<shipdate:timestamp, price:double precision>>

)

Because a map type behaves like an array type with columns key and value, you can think of the

preceding schemas as if they were the following.

CREATE EXTERNAL TABLE spectrum.customers (

id int,

name struct<given:varchar(20), family:varchar(20)>,

phones array<struct<key:varchar(20), value:varchar(20)>>,

orders array<struct<shipdate:timestamp, price:double precision>>

)

API Version 2012-12-01

108

Amazon Redshift Database Developer Guide

Nested Data Use Cases

The following query returns the names of customers with a mobile phone number and returns the

number for each name. The map query is treated as the equivalent of querying a nested array of

struct types. The following query only returns data if you have created the external table as described

previously.

SELECT c.name.given, c.name.family, p.value

FROM spectrum.customers c, c.phones p

WHERE p.key = 'mobile'

Note

The key for a map is a string for Ion and JSON ﬁle types.

Nested Data Use Cases

You can combine the extensions described previously with the usual SQL features. The following use

cases illustrate some common combinations. These examples help demonstrate how you can use nested

data. They aren't part of the tutorial.

Topics

•Ingesting Nested Data (p. 109)

•Aggregating Nested Data with Subqueries (p. 109)

•Joining Amazon Redshift and Nested Data (p. 110)

Ingesting Nested Data

You can use a CREATE TABLE AS statement to ingest data from an external table that contains complex

data types. The following query extracts all customers and their phone numbers from the external table,

using LEFT JOIN, and stores them in the Amazon Redshift table CustomerPhones.

CREATE TABLE CustomerPhones AS

SELECT c.name.given, c.name.family, p AS phone

FROM spectrum.customers c LEFT JOIN c.phones p ON true

Aggregating Nested Data with Subqueries

You can use a subquery to aggregate nested data. The following example illustrates this approach.

SELECT c.name.given, c.name.family, (SELECT COUNT(*) FROM c.orders o) AS ordercount

FROM spectrum.customers c

The following data is returned.

given | family | ordercount

API Version 2012-12-01

109

Amazon Redshift Database Developer Guide

Joining Amazon Redshift and Nested Data

--------|----------|--------------

Jenny | Doe | 0

John | Smith | 2

Andy | Jones | 1

(3 rows)

Note

When you aggregate nested data by grouping by the parent row, the most eﬃcient way is the

one shown in the previous example. In that example, the nested rows of c.orders are grouped

by their parent row c. Alternatively, if you know that id is unique for each customer and

o.shipdate is never null, you can aggregate as shown in the following example. However, this

approach generally isn't as eﬃcient as the previous example.

SELECT c.name.given, c.name.family, COUNT(o.shipdate) AS ordercount

FROM spectrum.customers c LEFT JOIN c.orders o ON true

GROUP BY c.id, c.name.given, c.name.family

You can also write the query by using a subquery in the FROM clause that refers to an alias (c) of the

ancestor query and extracts array data. The following example demonstrates this approach.

SELECT c.name.given, c.name.family, s.count AS ordercount

FROM spectrum.customers c, (SELECT count(*) AS count FROM c.orders o) s

Joining Amazon Redshift and Nested Data

You can also join Amazon Redshift data with nested data in an external table. For example, suppose that

you have the following nested data in Amazon S3.

CREATE EXTERNAL TABLE spectrum.customers2 (

id int,

name struct<given:varchar(20), family:varchar(20)>,

phones array<varchar(20)>,

orders array<struct<shipdate:timestamp, item:int>>

)

Suppose also that you have the following table in Amazon Redshift.

CREATE TABLE prices (

id int,

price double precision

)

The following query ﬁnds the total number and amount of each customer's purchases based on the

preceding. The following example is only an illustration. It only returns data if you have created the

tables described previously.

API Version 2012-12-01

110

Amazon Redshift Database Developer Guide

Nested Data Limitations

SELECT c.name.given, c.name.family, COUNT(o.date) AS ordercount, SUM(p.price) AS

ordersum

FROM spectrum.customers2 c, c.orders o, prices p ON o.item = p.id

GROUP BY c.id, c.name.given, c.name.family

Nested Data Limitations

The following limitations apply to nested data:

• An array can only contain scalars or struct types. Array types can't contain array or map types.

• Redshift Spectrum supports complex data types only as external tables.

• Query and subquery result columns must be scalar.

• If an OUTER JOIN expression refers to a nested table, it can refer only to that table and its nested

arrays (and maps). If an OUTER JOIN expression doesn't refer to a nested table, it can refer to any

number of non-nested tables.

• If a FROM clause in a subquery refers to a nested table, it can't refer to any other table.

• If a subquery depends on a nested table that refers to a parent, you can use the parent only in the

FROM clause. You can't use the query in any other clauses, such as a SELECT or WHERE clause. For

example, the following query isn't executed.

SELECT c.name.given

FROM spectrum.customers c

WHERE (SELECT COUNT(c.id) FROM c.phones p WHERE p LIKE '858%') > 1

The following query works because the parent c is used only in the FROM clause of the subquery.

SELECT c.name.given

FROM spectrum.customers c

WHERE (SELECT COUNT(*) FROM c.phones p WHERE p LIKE '858%') > 1

• A subquery that accesses nested data anywhere other than the FROM clause must return a single value.

The only exceptions are (NOT) EXISTS operators in a WHERE clause.

•(NOT) IN is not supported.

• The maximum nesting depth for all nested types is 100. This restriction applies to all ﬁle formats

(Parquet, ORC, Ion, and JSON).

• Aggregation subqueries that access nested data can only refer to arrays and maps in their FROM

clause, not to an external table.

API Version 2012-12-01

111

Amazon Redshift Database Developer Guide

Amazon Redshift Security Overview

Managing Database Security

Topics

•Amazon Redshift Security Overview (p. 112)

•Default Database User Privileges (p. 113)

•Superusers (p. 113)

•Users (p. 114)

•Groups (p. 114)

•Schemas (p. 115)

•Example for Controlling User and Group Access (p. 116)

You manage database security by controlling which users have access to which database objects.

Access to database objects depends on the privileges that you grant to user accounts or groups. The

following guidelines summarize how database security works:

• By default, privileges are granted only to the object owner.

• Amazon Redshift database users are named user accounts that can connect to a database. A user

account is granted privileges explicitly, by having those privileges assigned directly to the account, or

implicitly, by being a member of a group that is granted privileges.

• Groups are collections of users that can be collectively assigned privileges for easier security

maintenance.

• Schemas are collections of database tables and other database objects. Schemas are similar to

operating system directories, except that schemas cannot be nested. Users can be granted access to a

single schema or to multiple schemas.

For examples of security implementation, see Example for Controlling User and Group Access (p. 116).

Amazon Redshift Security Overview

Amazon Redshift database security is distinct from other types of Amazon Redshift security. In addition

to database security, which is described in this section, Amazon Redshift provides these features to

manage security:

•Sign-in credentials — Access to your Amazon Redshift Management Console is controlled by your

AWS account privileges. For more information, see Sign-In Credentials.

•Access management — To control access to speciﬁc Amazon Redshift resources, you deﬁne AWS

Identity and Access Management (IAM) accounts. For more information, see Controlling Access to

Amazon Redshift Resources.

•Cluster security groups — To grant other users inbound access to an Amazon Redshift cluster, you

deﬁne a cluster security group and associate it with a cluster. For more information, see Amazon

Redshift Cluster Security Groups.

•VPC — To protect access to your cluster by using a virtual networking environment, you can launch

your cluster in an Amazon Virtual Private Cloud (VPC). For more information, see Managing Clusters in

Virtual Private Cloud (VPC).

•Cluster encryption — To encrypt the data in all your user-created tables, you can enable cluster

encryption when you launch the cluster. For more information, see Amazon Redshift Clusters.

API Version 2012-12-01

112

Amazon Redshift Database Developer Guide

Default Database User Privileges

•SSL connections — To encrypt the connection between your SQL client and your cluster, you can use

secure sockets layer (SSL) encryption. For more information, see Connect to Your Cluster Using SSL.

•Load data encryption — To encrypt your table load data ﬁles when you upload them to Amazon

S3, you can use either server-side encryption or client-side encryption. When you load from server-

side encrypted data, Amazon S3 handles decryption transparently. When you load from client-side

encrypted data, the Amazon Redshift COPY command decrypts the data as it loads the table. For more

information, see Uploading Encrypted Data to Amazon S3 (p. 189).

•Data in transit — To protect your data in transit within the AWS cloud, Amazon Redshift uses

hardware accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD,

backup, and restore operations.

Default Database User Privileges

When you create a database object, you are its owner. By default, only a superuser or the owner of an

object can query, modify, or grant privileges on the object. For users to use an object, you must grant the

necessary privileges to the user or the group that contains the user. Database superusers have the same

privileges as database owners.

Amazon Redshift supports the following privileges: SELECT, INSERT, UPDATE, DELETE, REFERENCES,

CREATE, TEMPORARY, and USAGE. Diﬀerent privileges are associated with diﬀerent object types. For

information on database object privileges supported by Amazon Redshift, see the GRANT (p. 516)

command.

The right to modify or destroy an object is always the privilege of the owner only.

To revoke a privilege that was previously granted, use the REVOKE (p. 527) command. The privileges

of the object owner, such as DROP, GRANT, and REVOKE privileges, are implicit and cannot be granted

or revoked. Object owners can revoke their own ordinary privileges, for example, to make a table read-

only for themselves as well as others. Superusers retain all privileges regardless of GRANT and REVOKE

commands.

Superusers

Database superusers have the same privileges as database owners for all databases.

The masteruser, which is the user you created when you launched the cluster, is a superuser.

You must be a superuser to create a superuser.

Amazon Redshift system tables and system views are either visible only to superusers or visible to

all users. Only superusers can query system tables and system views that are designated "visible to

superusers." For information, see System Tables and Views (p. 797).

Superusers can view all PostgreSQL catalog tables. For information, see System Catalog Tables (p. 935).

A database superuser bypasses all permission checks. Be very careful when using a superuser role.

We recommend that you do most of your work as a role that is not a superuser. Superusers retain all

privileges regardless of GRANT and REVOKE commands.

To create a new database superuser, log on to the database as a superuser and issue a CREATE USER

command or an ALTER USER command with the CREATEUSER privilege.

create user adminuser createuser password '1234Admin';

alter user adminuser createuser;

API Version 2012-12-01

113

Amazon Redshift Database Developer Guide

Users

You can create and manage database users using the Amazon Redshift SQL commands CREATE USER

and ALTER USER, or you can conﬁgure your SQL client with custom Amazon Redshift JDBC or ODBC

drivers that manage the process of creating database users and temporary passwords as part of the

database logon process.

The drivers authenticate database users based on AWS Identity and Access Management (IAM)

authentication. If you already manage user identities outside of AWS, you can use a SAML 2.0-compliant

identity provider (IdP) to manage access to Amazon Redshift resources. You use an IAM role to conﬁgure

your IdP and AWS to permit your federated users to generate temporary database credentials and log

on to Amazon Redshift databases. For more information, see Using IAM Authentication to Generate

Database User Credentials.

Amazon Redshift user accounts can only be created and dropped by a database superuser. Users are

authenticated when they login to Amazon Redshift. They can own databases and database objects (for

example, tables) and can grant privileges on those objects to users, groups, and schemas to control

who has access to which object. Users with CREATE DATABASE rights can create databases and grant

privileges to those databases. Superusers have database ownership privileges for all databases.

Creating, Altering, and Deleting Users

Database users accounts are global across a data warehouse cluster (and not per individual database).

• To create a user use the CREATE USER (p. 490) command.

• To create a superuser use the CREATE USER (p. 490) command with the CREATEUSER option.

• To remove an existing user, use the DROP USER (p. 507) command.

• To make changes to a user account, such as changing a password, use the ALTER USER (p. 377)

command.

• To view a list of users, query the PG_USER catalog table:

select * from pg_user;

useconfig

------------+----------+-------------+----------+-----------+----------+----------

+-----------

rdsdb | 1 | t | t | t | ******** | |

masteruser | 100 | t | t | f | ******** | |

dwuser | 101 | f | f | f | ******** | |

simpleuser | 102 | f | f | f | ******** | |

poweruser | 103 | f | t | f | ******** | |

dbuser | 104 | t | f | f | ******** | |

(6 rows)

Groups

Groups are collections of users who are all granted whatever privileges are associated with the group.

You can use groups to assign privileges by role. For example, you can create diﬀerent groups for sales,

administration, and support and give the users in each group the appropriate access to the data they

require for their work. You can grant or revoke privileges at the group level, and those changes will apply

to all members of the group, except for superusers.

To view all user groups, query the PG_GROUP system catalog table:

API Version 2012-12-01

114

Amazon Redshift Database Developer Guide

Creating, Altering, and Deleting Groups

select * from pg_group;

Creating, Altering, and Deleting Groups

Only a superuser can create, alter, or drop groups.

You can perform the following actions:

• To create a group, use the CREATE GROUP (p. 467) command.

• To add users to or remove users from an existing group, use the ALTER GROUP (p. 363) command.

• To delete a group, use the DROP GROUP (p. 502) command. This command only drops the group, not

its member users.

Schemas

A database contains one or more named schemas. Each schema in a database contains tables and other

kinds of named objects. By default, a database has a single schema, which is named PUBLIC. You can use

schemas to group database objects under a common name. Schemas are similar to operating system

directories, except that schemas cannot be nested.

Identical database object names can be used in diﬀerent schemas in the same database without conﬂict.

For example, both MY_SCHEMA and YOUR_SCHEMA can contain a table named MYTABLE. Users with the

necessary privileges can access objects across multiple schemas in a database.

By default, an object is created within the ﬁrst schema in the search path of the database. For

information, see Search Path (p. 116) later in this section.

Schemas can help with organization and concurrency issues in a multi-user environment in the following

ways:

• To allow many developers to work in the same database without interfering with each other.

• To organize database objects into logical groups to make them more manageable.

• To give applications the ability to put their objects into separate schemas so that their names will not

collide with the names of objects used by other applications.

Creating, Altering, and Deleting Schemas

Any user can create schemas and alter or drop schemas they own.

You can perform the following actions:

• To create a schema, use the CREATE SCHEMA (p. 470) command.

• To change the owner of a schema, use the ALTER SCHEMA (p. 364) command.

• To delete a schema and its objects, use the DROP SCHEMA (p. 503) command.

• To create a table within a schema, create the table with the format schema_name.table_name.

To view a list of all schemas, query the PG_NAMESPACE system catalog table:

select * from pg_namespace;

API Version 2012-12-01

115

Amazon Redshift Database Developer Guide

Search Path

To view a list of tables that belong to a schema, query the PG_TABLE_DEF system catalog table. For

example, the following query returns a list of tables in the PG_CATALOG schema.

select distinct(tablename) from pg_table_def

where schemaname = 'pg_catalog';

Search Path

The search path is deﬁned in the search_path parameter with a comma-separated list of schema names.

The search path speciﬁes the order in which schemas are searched when an object, such as a table or

function, is referenced by a simple name that does not include a schema qualiﬁer.

If an object is created without specifying a target schema, the object is added to the ﬁrst schema that is

listed in search path. When objects with identical names exist in diﬀerent schemas, an object name that

does not specify a schema will refer to the ﬁrst schema in the search path that contains an object with

that name.

To change the default schema for the current session, use the SET (p. 560) command.

For more information, see the search_path (p. 951) description in the Conﬁguration Reference.

Schema-Based Privileges

Schema-based privileges are determined by the owner of the schema:

• By default, all users have CREATE and USAGE privileges on the PUBLIC schema of a database. To

disallow users from creating objects in the PUBLIC schema of a database, use the REVOKE (p. 527)

command to remove that privilege.

• Unless they are granted the USAGE privilege by the object owner, users cannot access any objects in

schemas they do not own.

• If users have been granted the CREATE privilege to a schema that was created by another user, those

users can create objects in that schema.

Example for Controlling User and Group Access

This example creates user groups and user accounts and then grants them various privileges for an

Amazon Redshift database that connects to a web application client. This example assumes three groups

of users: regular users of a web application, power users of a web application, and web developers.

1. Create the groups where the user accounts will be assigned. The following set of commands creates

three diﬀerent user groups:

create group webappusers;

create group webpowerusers;

create group webdevusers;

2. Create several database user accounts with diﬀerent privileges and add them to the groups.

a. Create two users and add them to the WEBAPPUSERS group:

create user webappuser1 password 'webAppuser1pass'

in group webappusers;

API Version 2012-12-01

116

Amazon Redshift Database Developer Guide

Example for Controlling User and Group Access

create user webappuser2 password 'webAppuser2pass'

in group webappusers;

b. Create an account for a web developer and adds it to the WEBDEVUSERS group:

create user webdevuser1 password 'webDevuser2pass'

in group webdevusers;

c. Create a superuser account. This user will have administrative rights to create other users:

create user webappadmin password 'webAppadminpass1'

createuser;

3. Create a schema to be associated with the database tables used by the web application, and grant the

various user groups access to this schema:

a. Create the WEBAPP schema:

create schema webapp;

b. Grant USAGE privileges to the WEBAPPUSERS group:

grant usage on schema webapp to group webappusers;

c. Grant USAGE privileges to the WEBPOWERUSERS group:

grant usage on schema webapp to group webpowerusers;

d. Grant ALL privileges to the WEBDEVUSERS group:

grant all on schema webapp to group webdevusers;

The basic users and groups are now set up. You can now make changes to alter the users and groups.

4. For example, the following command alters the search_path parameter for the WEBAPPUSER1.

alter user webappuser1 set search_path to webapp, public;

The SEARCH_PATH speciﬁes the schema search order for database objects, such as tables and

functions, when the object is referenced by a simple name with no schema speciﬁed.

5. You can also add users to a group after creating the group, such as adding WEBAPPUSER2 to the

WEBPOWERUSERS group:

alter group webpowerusers add user webappuser2;

API Version 2012-12-01

117

Amazon Redshift Database Developer Guide

Choosing a Column Compression Type

Designing Tables

Topics

•Choosing a Column Compression Type (p. 118)

•Choosing a Data Distribution Style (p. 129)

•Choosing Sort Keys (p. 140)

•Deﬁning Constraints (p. 145)

•Analyzing Table Design (p. 146)

A data warehouse system has very diﬀerent design goals as compared to a typical transaction-oriented

relational database system. An online transaction processing (OLTP) application is focused primarily on

single row transactions, inserts, and updates. Amazon Redshift is optimized for very fast execution of

complex analytic queries against very large data sets. Because of the massive amount of data involved in

data warehousing, you must speciﬁcally design your database to take full advantage of every available

performance optimization.

This section explains how to choose and implement compression encodings, data distribution keys, sort

keys, and table constraints, and it presents best practices for making these design decisions.

Choosing a Column Compression Type

Topics

•Compression Encodings (p. 119)

•Testing Compression Encodings (p. 125)

•Example: Choosing Compression Encodings for the CUSTOMER Table (p. 127)

Compression is a column-level operation that reduces the size of data when it is stored. Compression

conserves storage space and reduces the size of data that is read from storage, which reduces the

amount of disk I/O and therefore improves query performance.

You can apply a compression type, or encoding, to the columns in a table manually when you create the

table, or you can use the COPY command to analyze and apply compression automatically. For details

about applying automatic compression, see Loading Tables with Automatic Compression (p. 209).

Note

We strongly recommend using the COPY command to apply automatic compression.

You might choose to apply compression encodings manually if the new table shares the same data

characteristics as another table, or if in testing you discover that the compression encodings that

are applied during automatic compression are not the best ﬁt for your data. If you choose to apply

compression encodings manually, you can run the ANALYZE COMPRESSION (p. 382) command against

an already populated table and use the results to choose compression encodings.

To apply compression manually, you specify compression encodings for individual columns as part of the

CREATE TABLE statement. The syntax is as follows:

CREATE TABLE table_name (column_name

data_type ENCODE encoding-type)[, ...]

Where encoding-type is taken from the keyword table in the following section.

API Version 2012-12-01

118

Amazon Redshift Database Developer Guide

Compression Encodings

For example, the following statement creates a two-column table, PRODUCT. When data is loaded into

the table, the PRODUCT_ID column is not compressed, but the PRODUCT_NAME column is compressed,

using the byte dictionary encoding (BYTEDICT).

create table product(

product_id int encode raw,

product_name char(20) encode bytedict);

You cannot change the compression encoding for a column after the table is created. You can specify the

encoding for a column when it is added to a table using the ALTER TABLE command.

ALTER TABLE table-name ADD [ COLUMN ] column_name column_type ENCODE encoding-type

Compression Encodings

Topics

•Raw Encoding (p. 120)

•Byte-Dictionary Encoding (p. 120)

•Delta Encoding (p. 121)

•LZO Encoding (p. 122)

•Mostly Encoding (p. 122)

•Runlength Encoding (p. 124)

•Text255 and Text32k Encodings (p. 124)

•Zstandard Encoding (p. 125)

A compression encoding speciﬁes the type of compression that is applied to a column of data values as

rows are added to a table.

If no compression is speciﬁed in a CREATE TABLE or ALTER TABLE statement, Amazon Redshift

automatically assigns compression encoding as follows:

• Columns that are deﬁned as sort keys are assigned RAW compression.

• Columns that are deﬁned as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW

compression.

• All other columns are assigned LZO compression.

The following table identiﬁes the supported compression encodings and the data types that support the

encoding.

Encoding type Keyword in CREATE TABLE

and ALTER TABLE

Data types

Raw (no compression) RAW All

Byte dictionary BYTEDICT All except BOOLEAN

Delta DELTA

DELTA32K

SMALLINT, INT, BIGINT, DATE,

TIMESTAMP, DECIMAL

INT, BIGINT, DATE, TIMESTAMP,

DECIMAL

API Version 2012-12-01

119

Amazon Redshift Database Developer Guide

Compression Encodings

Encoding type Keyword in CREATE TABLE

and ALTER TABLE

Data types

LZO LZO All except BOOLEAN, REAL, and

DOUBLE PRECISION

MostlynMOSTLY8

MOSTLY16

MOSTLY32

SMALLINT, INT, BIGINT, DECIMAL

INT, BIGINT, DECIMAL

BIGINT, DECIMAL

Run-length RUNLENGTH All

Text TEXT255

TEXT32K

VARCHAR only

Zstandard ZSTD All

Raw Encoding

Raw encoding is the default encoding for columns that are designated as sort keys and columns that are

deﬁned as BOOLEAN, REAL, or DOUBLE PRECISION data types. With raw encoding, data is stored in raw,

uncompressed form.

Byte-Dictionary Encoding

In byte dictionary encoding, a separate dictionary of unique values is created for each block of column

values on disk. (An Amazon Redshift disk block occupies 1 MB.) The dictionary contains up to 256 one-

byte values that are stored as indexes to the original data values. If more than 256 values are stored in a

single block, the extra values are written into the block in raw, uncompressed form. The process repeats

for each disk block.

This encoding is very eﬀective when a column contains a limited number of unique values. This encoding

is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding

is especially space-eﬃcient if a CHAR column holds long character strings.

Note

Byte-dictionary encoding is not always eﬀective when used with VARCHAR columns. Using

BYTEDICT with large VARCHAR columns might cause excessive disk usage. We strongly

recommend using a diﬀerent encoding, such as LZO, for VARCHAR columns.

Suppose a table has a COUNTRY column with a CHAR(30) data type. As data is loaded, Amazon Redshift

creates the dictionary and populates the COUNTRY column with the index value. The dictionary

contains the indexed unique values, and the table itself contains only the one-byte subscripts of the

corresponding values.

Note

Trailing blanks are stored for ﬁxed-length character columns. Therefore, in a CHAR(30) column,

every compressed value saves 29 bytes of storage when you use the byte-dictionary encoding.

The following table represents the dictionary for the COUNTRY column:

Unique data value Dictionary index Size (ﬁxed length, 30 bytes per

value)

England 0 30

API Version 2012-12-01

120

Amazon Redshift Database Developer Guide

Compression Encodings

Unique data value Dictionary index Size (ﬁxed length, 30 bytes per

value)

United States of America 1 30

Venezuela 2 30

Sri Lanka 3 30

Argentina 4 30

Japan 5 30

Total  180

The following table represents the values in the COUNTRY column:

Original data value Original size (ﬁxed

length, 30 bytes per

value)

Compressed value

(index)

New size (bytes)

England 30 0 1

United States of

America

30 1 1

United States of

America

30 1 1

Venezuela 30 2 1

Sri Lanka 30 3 1

Argentina 30 4 1

Japan 30 5 1

Sri Lanka 30 3 1

Argentina 30 4 1

Totals 300  10

The total compressed size in this example is calculated as follows: 6 diﬀerent entries are stored in the

dictionary (6 * 30 = 180), and the table contains 10 1-byte compressed values, for a total of 190 bytes.

Delta Encoding

Delta encodings are very useful for datetime columns.

Delta encoding compresses data by recording the diﬀerence between values that follow each other in the

column. This diﬀerence is recorded in a separate dictionary for each block of column values on disk. (An

Amazon Redshift disk block occupies 1 MB.) For example, if the column contains 10 integers in sequence

from 1 to 10, the ﬁrst will be stored as a 4-byte integer (plus a 1-byte ﬂag), and the next 9 will each be

stored as a byte with the value 1, indicating that it is one greater than the previous value.

Delta encoding comes in two variations:

API Version 2012-12-01

121

Amazon Redshift Database Developer Guide

Compression Encodings

• DELTA records the diﬀerences as 1-byte values (8-bit integers)

• DELTA32K records diﬀerences as 2-byte values (16-bit integers)

If most of the values in the column could be compressed by using a single byte, the 1-byte variation

is very eﬀective; however, if the deltas are larger, this encoding, in the worst case, is somewhat less

eﬀective than storing the uncompressed data. Similar logic applies to the 16-bit version.

If the diﬀerence between two values exceeds the 1-byte range (DELTA) or 2-byte range (DELTA32K), the

full original value is stored, with a leading 1-byte ﬂag. The 1-byte range is from -127 to 127, and the 2-

byte range is from -32K to 32K.

The following table shows how a delta encoding works for a numeric column:

Original data

value

Original size

(bytes)

Diﬀerence (delta) Compressed value Compressed size

(bytes)

1 4  1 1+4 (ﬂag + actual

value)

54441

50 4 45 45 1

200 4 150 150 1+4 (ﬂag + actual

value)

185 4 -15 -15 1

220 4 35 35 1

221 4 1 1 1

Totals 28   15

LZO Encoding

LZO encoding provides a very high compression ratio with good performance. LZO encoding works

especially well for CHAR and VARCHAR columns that store very long character strings, especially free

form text, such as product descriptions, user comments, or JSON strings. LZO is the default encoding

except for columns that are designated as sort keys and columns that are deﬁned as BOOLEAN, REAL, or

DOUBLE PRECISION data types.

Mostly Encoding

Mostly encodings are useful when the data type for a column is larger than most of the stored values

require. By specifying a mostly encoding for this type of column, you can compress the majority of the

values in the column to a smaller standard storage size. The remaining values that cannot be compressed

are stored in their raw form. For example, you can compress a 16-bit column, such as an INT2 column, to

8-bit storage.

In general, the mostly encodings work with the following data types:

• SMALLINT/INT2 (16-bit)

• INTEGER/INT (32-bit)

• BIGINT/INT8 (64-bit)

• DECIMAL/NUMERIC (64-bit)

API Version 2012-12-01

122

Amazon Redshift Database Developer Guide

Compression Encodings

Choose the appropriate variation of the mostly encoding to suit the size of the data type for the

column. For example, apply MOSTLY8 to a column that is deﬁned as a 16-bit integer column. Applying

MOSTLY16 to a column with a 16-bit data type or MOSTLY32 to a column with a 32-bit data type is

disallowed.

Mostly encodings might be less eﬀective than no compression when a relatively high number of the

values in the column cannot be compressed. Before applying one of these encodings to a column, check

that most of the values that you are going to load now (and are likely to load in the future) ﬁt into the

ranges shown in the following table.

Encoding Compressed Storage Size Range of values that can be compressed

(values outside the range are stored raw)

MOSTLY8 1 byte (8 bits) -128 to 127

MOSTLY16 2 bytes (16 bits) -32768 to 32767

MOSTLY32 4 bytes (32 bits) -2147483648 to +2147483647

Note

For decimal values, ignore the decimal point to determine whether the value ﬁts into the range.

For example, 1,234.56 is treated as 123,456 and can be compressed in a MOSTLY32 column.

For example, the VENUEID column in the VENUE table is deﬁned as a raw integer column, which means

that its values consume 4 bytes of storage. However, the current range of values in the column is 0 to

309. Therefore, re-creating and reloading this table with MOSTLY16 encoding for VENUEID would reduce

the storage of every value in that column to 2 bytes.

If the VENUEID values referenced in another table were mostly in the range of 0 to 127, it might make

sense to encode that foreign-key column as MOSTLY8. Before making the choice, you would have to run

some queries against the referencing table data to ﬁnd out whether the values mostly fall into the 8-bit,

16-bit, or 32-bit range.

The following table shows compressed sizes for speciﬁc numeric values when the MOSTLY8, MOSTLY16,

and MOSTLY32 encodings are used:

Original value Original INT

or BIGINT size

(bytes)

MOSTLY8

compressed

size (bytes)

MOSTLY16

compressed size

(bytes)

MOSTLY32

compressed size

(bytes)

1412 4

10 4 1 2 4

100 4 1 2 4

1000 4 2 4

10000 4 2 4

20000 4 2 4

40000 8 4

100000 8 4

2000000000 8

Same as raw

data size

Same as raw data size

API Version 2012-12-01

123

Amazon Redshift Database Developer Guide

Compression Encodings

Runlength Encoding

Runlength encoding replaces a value that is repeated consecutively with a token that consists of

the value and a count of the number of consecutive occurrences (the length of the run). A separate

dictionary of unique values is created for each block of column values on disk. (An Amazon Redshift disk

block occupies 1 MB.) This encoding is best suited to a table in which data values are often repeated

consecutively, for example, when the table is sorted by those values.

For example, if a column in a large dimension table has a predictably small domain, such as a COLOR

column with fewer than 10 possible values, these values are likely to fall in long sequences throughout

the table, even if the data is not sorted.

We do not recommend applying runlength encoding on any column that is designated as a sort key.

Range-restricted scans perform better when blocks contain similar numbers of rows. If sort key columns

are compressed much more highly than other columns in the same query, range-restricted scans might

perform poorly.

The following table uses the COLOR column example to show how the runlength encoding works:

Original data value Original size (bytes) Compressed value

(token)

Compressed size

(bytes)

Blue 4 5

Blue 4

{2,Blue}

Green 5 6

Green 5 0

Green 5

{3,Green}

Blue 4 {1,Blue} 5

Yellow 6 7

Yellow 6 0

Yellow 6

{4,Yellow}

Totals 51 23

Text255 and Text32k Encodings

Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words

recur often. A separate dictionary of unique words is created for each block of column values on disk.

(An Amazon Redshift disk block occupies 1 MB.) The dictionary contains the ﬁrst 245 unique words in the

column. Those words are replaced on disk by a one-byte index value representing one of the 245 values,

and any words that are not represented in the dictionary are stored uncompressed. The process repeats

for each 1 MB disk block. If the indexed words occur frequently in the column, the column will yield a

high compression ratio.

For the text32k encoding, the principle is the same, but the dictionary for each block does not capture a

speciﬁc number of words. Instead, the dictionary indexes each unique word it ﬁnds until the combined

entries reach a length of 32K, minus some overhead. The index values are stored in two bytes.

API Version 2012-12-01

124

Amazon Redshift Database Developer Guide

Testing Compression Encodings

For example, consider the VENUENAME column in the VENUE table. Words such as Arena, Center, and

Theatre recur in this column and are likely to be among the ﬁrst 245 words encountered in each block

if text255 compression is applied. If so, this column will beneﬁt from compression because every time

those words appear, they will occupy only 1 byte of storage (instead of 5, 6, or 7 bytes, respectively).

Zstandard Encoding

Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse

data sets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long

and short strings, such as product descriptions, user comments, logs, and JSON strings. Where some

algorithms, such as Delta (p. 121) encoding or Mostly (p. 122) encoding, can potentially use more

storage space than no compression, ZSTD is very unlikely to increase disk usage. ZSTD supports all

Amazon Redshift data types.

Testing Compression Encodings

If you decide to manually specify column encodings, you might want to test diﬀerent encodings with

your data.

Note

We recommend that you use the COPY command to load data whenever possible, and allow the

COPY command to choose the optimal encodings based on your data. Alternatively, you can use

the ANALYZE COMPRESSION (p. 382) command to view the suggested encodings for existing

data. For details about applying automatic compression, see Loading Tables with Automatic

Compression (p. 209).

To perform a meaningful test of data compression, you need a large number of rows. For this example,

we will create a table and insert rows by using a statement that selects from two tables; VENUE and

LISTING. We will leave out the WHERE clause that would normally join the two tables; the result is that

each row in the VENUE table is joined to all of the rows in the LISTING table, for a total of over 32 million

rows. This is known as a Cartesian join and normally is not recommended, but for this purpose, it is a

convenient method of creating a lot of rows. If you have an existing table with data that you want to

test, you can skip this step.

After we have a table with sample data, we create a table with seven columns, each with a diﬀerent

compression encoding: raw, bytedict, lzo, runlength, text255, text32k, and zstd. We populate each

column with exactly the same data by executing an INSERT command that selects the data from the ﬁrst

table.

To test compression encodings:

1. (Optional) First, we'll use a Cartesian join to create a table with a large number of rows. Skip this step

if you want to test an existing table.

create table cartesian_venue(

venueid smallint not null distkey sortkey,

venuename varchar(100),

venuecity varchar(30),

venuestate char(2),

venueseats integer);

insert into cartesian_venue

select venueid, venuename, venuecity, venuestate, venueseats

from venue, listing;

2. Next, create a table with the encodings that you want to compare.

create table encodingvenue (

venueraw varchar(100) encode raw,

venuebytedict varchar(100) encode bytedict,

API Version 2012-12-01

125

Amazon Redshift Database Developer Guide

Testing Compression Encodings

venuelzo varchar(100) encode lzo,

venuerunlength varchar(100) encode runlength,

venuetext255 varchar(100) encode text255,

venuetext32k varchar(100) encode text32k,

venuezstd varchar(100) encode zstd);

3. Insert the same data into all of the columns using an INSERT statement with a SELECT clause.

insert into encodingvenue

select venuename as venueraw, venuename as venuebytedict, venuename as venuelzo,

venuename as venuerunlength, venuename as venuetext32k, venuename as venuetext255,

venuename as venuezstd

from cartesian_venue;

4. Verify the number of rows in the new table.

select count(*) from encodingvenue

count

----------

38884394

(1 row)

5. Query the STV_BLOCKLIST (p. 869) system table to compare the number of 1 MB disk blocks used

by each column.

The MAX aggregate function returns the highest block number for each column. The STV_BLOCKLIST

table includes details for three system-generated columns. This example uses col < 6 in the WHERE

clause to exclude the system-generated columns.

select col, max(blocknum)

from stv_blocklist b, stv_tbl_perm p

where (b.tbl=p.id) and name ='encodingvenue'

and col < 7

group by name, col

order by col;

The query returns the following results. The columns are numbered beginning with zero. Depending

on how your cluster is conﬁgured, your result might have diﬀerent numbers, but the relative sizes

should be similar. You can see that BYTEDICT encoding on the second column produced the best

results for this data set, with a compression ratio of better than 20:1. LZO and ZSTD encoding also

produced excellent results. Diﬀerent data sets will produce diﬀerent results, of course. When a column

contains longer text strings, LZO often produces the best compression results.

col | max

-----+-----

0 | 203

1 | 10

2 | 22

3 | 204

4 | 56

5 | 72

6 | 20

(7 rows)

If you have data in an existing table, you can use the ANALYZE COMPRESSION (p. 382) command

to view the suggested encodings for the table. For example, the following example shows the

recommended encoding for a copy of the VENUE table, CARTESIAN_VENUE, that contains 38 million

rows. Notice that ANALYZE COMPRESSION recommends LZO encoding for the VENUENAME column.

API Version 2012-12-01

126

Amazon Redshift Database Developer Guide

Example: Choosing Compression

Encodings for the CUSTOMER Table

ANALYZE COMPRESSION chooses optimal compression based on multiple factors, which include percent

of reduction. In this speciﬁc case, BYTEDICT provides better compression, but LZO also produces greater

than 90 percent compression.

analyze compression cartesian_venue;

Table | Column | Encoding | Est_reduction_pct

---------------+------------+----------+------------------

reallybigvenue | venueid | lzo | 97.54

reallybigvenue | venuename | lzo | 91.71

reallybigvenue | venuecity | lzo | 96.01

reallybigvenue | venuestate | lzo | 97.68

reallybigvenue | venueseats | lzo | 98.21

Example: Choosing Compression Encodings for the

CUSTOMER Table

The following statement creates a CUSTOMER table that has columns with various data types. This

CREATE TABLE statement shows one of many possible combinations of compression encodings for these

columns.

create table customer(

custkey int encode delta,

custname varchar(30) encode raw,

gender varchar(7) encode text255,

address varchar(200) encode text255,

city varchar(30) encode text255,

state char(2) encode raw,

zipcode char(5) encode bytedict,

start_date date encode delta32k);

The following table shows the column encodings that were chosen for the CUSTOMER table and gives an

explanation for the choices:

Column Data Type Encoding Explanation

CUSTKEY int delta CUSTKEY consists of

unique, consecutive

integer values. Since

the diﬀerences will be

one byte, DELTA is a

good choice.

CUSTNAME varchar(30) raw CUSTNAME has a

large domain with few

repeated values. Any

compression encoding

would probably be

ineﬀective.

GENDER varchar(7) text255 GENDER is very small

domain with many

repeated values.

Text255 works well

with VARCHAR columns

in which the same

words recur.

API Version 2012-12-01

127

Amazon Redshift Database Developer Guide

Example: Choosing Compression

Encodings for the CUSTOMER Table

Column Data Type Encoding Explanation

ADDRESS varchar(200) text255 ADDRESS is a large

domain, but contains

many repeated words,

such as Street Avenue,

North, South, and

so on. Text 255 and

text 32k are useful for

compressing VARCHAR

columns in which the

same words recur. The

column length is short,

so text255 is a good

choice.

CITY varchar(30) text255 CITY is a large domain,

with some repeated

values. Certain city

names are used much

more commonly than

others. Text255 is

a good choice for

the same reasons as

ADDRESS.

STATE char(2) raw In the United States,

STATE is a precise

domain of 50 two-

character values.

Bytedict encoding

would yield some

compression, but

because the column

size is only two

characters, compression

might not be worth

the overhead of

uncompressing the

data.

ZIPCODE char(5) bytedict ZIPCODE is a known

domain of fewer than

50,000 unique values.

Certain zip codes

occur much more

commonly than others.

Bytedict encoding is

very eﬀective when

a column contains

a limited number of

unique values.

API Version 2012-12-01

128

Amazon Redshift Database Developer Guide

Choosing a Data Distribution Style

Column Data Type Encoding Explanation

START_DATE date delta32k Delta encodings are

very useful for datetime

columns, especially if

the rows are loaded in

date order.

Choosing a Data Distribution Style

Topics

•Data Distribution Concepts (p. 129)

•Distribution Styles (p. 130)

•Viewing Distribution Styles (p. 131)

•Evaluating Query Patterns (p. 132)

•Designating Distribution Styles (p. 132)

•Evaluating the Query Plan (p. 133)

•Query Plan Example (p. 134)

•Distribution Examples (p. 138)

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the

compute nodes according to the table's distribution style. When you run a query, the query optimizer

redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal

in selecting a table distribution style is to minimize the impact of the redistribution step by locating the

data where it needs to be before the query is executed.

This section will introduce you to the principles of data distribution in an Amazon Redshift database and

give you a methodology to choose the best distribution style for each of your tables.

Data Distribution Concepts

Nodes and slices

An Amazon Redshift cluster is a set of nodes. Each node in the cluster has its own operating system,

dedicated memory, and dedicated disk storage. One node is the leader node, which manages the

distribution of data and query processing tasks to the compute nodes.

The disk storage for a compute node is divided into a number of slices. The number of slices per node

depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and

each DS1.8XL compute node has 16 slices. The nodes all participate in parallel query execution, working

on data that is distributed as evenly as possible across the slices. For more information about the

number of slices that each node size has, go to About Clusters and Nodes in the Amazon Redshift Cluster

Management Guide.

Data redistribution

When you load data into a table, Amazon Redshift distributes the rows of the table to each of the node

slices according to the table's distribution style. As part of a query plan, the optimizer determines where

blocks of data need to be located to best execute the query. The data is then physically moved, or

redistributed, during execution. Redistribution might involve either sending speciﬁc rows to nodes for

joining or broadcasting an entire table to all of the nodes.

Data redistribution can account for a substantial portion of the cost of a query plan, and the network

traﬃc it generates can aﬀect other database operations and slow overall system performance. To the

API Version 2012-12-01

129

Amazon Redshift Database Developer Guide

Distribution Styles

extent that you anticipate where best to locate data initially, you can minimize the impact of data

redistribution.

Data distribution goals

When you load data into a table, Amazon Redshift distributes the table's rows to the compute nodes and

slices according to the distribution style that you chose when you created the table. Data distribution has

two primary goals:

• To distribute the workload uniformly among the nodes in the cluster. Uneven distribution, or data

distribution skew, forces some nodes to do more work than others, which impairs query performance.

• To minimize data movement during query execution. If the rows that participate in joins or aggregates

are already collocated on the nodes with their joining rows in other tables, the optimizer does not

need to redistribute as much data during query execution.

The distribution strategy that you choose for your database has important consequences for query

performance, storage requirements, data loading, and maintenance. By choosing the best distribution

style for each table, you can balance your data distribution and signiﬁcantly improve overall system

performance.

Distribution Styles

When you create a table, you can designate one of three distribution styles; EVEN, KEY, or ALL.

If you don't specify a distribution style, Amazon Redshift uses automatic distribution.

Automatic distribution

If you don't specify a distribution style with the CREATE TABLE statement, Amazon Redshift applies

automatic distribution.

With automatic distribution, Amazon Redshift assigns an optimal distribution style based on the size

of the table data. For example, Amazon Redshift initially assigns ALL distribution to a small table,

then changes to EVEN distribution when the table grows larger. When a table is changed from ALL to

EVEN distribution, storage utilization might change slightly. The change in distribution occurs in the

background, in a few seconds. Amazon Redshift never changes the distribution style from EVEN to ALL.

To view the distribution style applied to a table, query the PG_CLASS_INFO system catalog view. For

more information, see Viewing Distribution Styles (p. 131).

Even distribution

The leader node distributes the rows across the slices in a round-robin fashion, regardless of the values

in any particular column. EVEN distribution is appropriate when a table does not participate in joins or

when there is not a clear choice between KEY distribution and ALL distribution.

Key distribution

The rows are distributed according to the values in one column. The leader node places matching values

on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates

the rows on the slices according to the values in the joining columns so that matching values from the

common columns are physically stored together.

ALL distribution

A copy of the entire table is distributed to every node. Where EVEN distribution or KEY distribution place

only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for

every join that the table participates in.

ALL distribution multiplies the storage required by the number of nodes in the cluster, and so it takes

much longer to load, update, or insert data into multiple tables. ALL distribution is appropriate only

API Version 2012-12-01

130

Amazon Redshift Database Developer Guide

Viewing Distribution Styles

for relatively slow moving tables; that is, tables that are not updated frequently or extensively. Small

dimension tables do not beneﬁt signiﬁcantly from ALL distribution, because the cost of redistribution is

low.

Note

After you have speciﬁed a distribution style for a column, Amazon Redshift handles data

distribution at the cluster level. Amazon Redshift does not require or support the concept of

partitioning data within database objects. You do not need to create table spaces or deﬁne

partitioning schemes for tables.

You can't change the distribution style of a table after it's created. To use a diﬀerent distribution style,

you can recreate the table and populate the new table with a deep copy. For more information, see

Performing a Deep Copy (p. 221)

Viewing Distribution Styles

To view the distribution style of a table, query the PG_CLASS_INFO view or the SVV_TABLE_INFO view.

The RELEFFECTIVEDISTSTYLE column in PG_CLASS_INFO indicates the current distribution style for

the table. If the table uses automatic distribution, RELEFFECTIVEDISTSTYLE is 10 or 11, which indicates

whether the eﬀective distribution style is AUTO (ALL) or AUTO (EVEN). If the table uses automatic

distribution, the distribution style might initially show AUTO (ALL), then change to AUTO (EVEN) when

the table grows.

The following table gives the distribution style for each value in RELEFFECTIVEDISTSTYLE column:

RELEFFECTIVEDISTSTYLE Current Distribution style

0 EVEN

1 KEY

8 ALL

10 AUTO (ALL)

11 AUTO (EVEN)

The DISTSTYLE column in SVV_TABLE_INFO indicates the current distribution style for the table. If the

table uses automatic distribution, DISTSTYLE is AUTO (ALL) or AUTO (EVEN).

The following example creates four tables using the three distribution styles and automatic distribution,

then queries SVV_TABLE_INFO to view the distribution styles.

create table dist_key (col1 int)

diststyle key distkey (col1);

create table dist_even (col1 int)

diststyle even;

create table dist_all (col1 int)

diststyle all;

create table dist_auto (col1 int);

select "schema", "table", diststyle from SVV_TABLE_INFO

where "table" like 'dist%';

schema | table | diststyle

------------+-----------------+------------

API Version 2012-12-01

131

Amazon Redshift Database Developer Guide

Evaluating Query Patterns

public | dist_key | KEY(col1)

public | dist_even | EVEN

public | dist_all | ALL

public | dist_auto | AUTO(ALL)

Evaluating Query Patterns

Choosing distribution styles is just one aspect of database design. You should consider distribution styles

only within the context of the entire system, balancing distribution with other important factors such as

cluster size, compression encoding methods, sort keys, and table constraints.

Test your system with data that is as close to real data as possible.

In order to make good choices for distribution styles, you need to understand the query patterns for

your Amazon Redshift application. Identify the most costly queries in your system and base your initial

database design on the demands of those queries. Factors that determine the total cost of a query

are how long the query takes to execute, how much computing resources it consumes, how often it is

executed, and how disruptive it is to other queries and database operations.

Identify the tables that are used by the most costly queries, and evaluate their role in query execution.

Consider how the tables are joined and aggregated.

Use the guidelines in this section to choose a distribution style for each table. When you have done so,

create the tables, load them with data that is as close as possible to real data, and then test the tables

for the types of queries that you expect to use. You can evaluate the query explain plans to identify

tuning opportunities. Compare load times, storage space, and query execution times in order to balance

your system's overall requirements.

Designating Distribution Styles

The considerations and recommendations for designating distribution styles in this section use a star

schema as an example. Your database design might be based on a star schema, some variant of a star

schema, or an entirely diﬀerent schema. Amazon Redshift is designed to work eﬀectively with whatever

schema design you choose. The principles in this section can be applied to any design schema.

1. Specify the primary key and foreign keys for all your tables.

Amazon Redshift does not enforce primary key and foreign key constraints, but the query optimizer

uses them when it generates query plans. If you set primary keys and foreign keys, your application

must maintain the validity of the keys.

2. Distribute the fact table and its largest dimension table on their common columns.

Choose the largest dimension based on the size of data set that participates in the most common join,

not just the size of the table. If a table is commonly ﬁltered, using a WHERE clause, only a portion

of its rows participate in the join. Such a table has less impact on redistribution than a smaller table

that contributes more data. Designate both the dimension table's primary key and the fact table's

corresponding foreign key as DISTKEY. If multiple tables use the same distribution key, they will also

be collocated with the fact table. Your fact table can have only one distribution key. Any tables that

join on another key will not be collocated with the fact table.

3. Designate distribution keys for the other dimension tables.

Distribute the tables on their primary keys or their foreign keys, depending on how they most

commonly join with other tables.

4. Evaluate whether to change some of the dimension tables to use ALL distribution.

If a dimension table cannot be collocated with the fact table or other important joining tables, you

can improve query performance signiﬁcantly by distributing the entire table to all of the nodes. Using

ALL distribution multiplies storage space requirements and increases load times and maintenance

API Version 2012-12-01

132

Amazon Redshift Database Developer Guide

Evaluating the Query Plan

operations, so you should weigh all factors before choosing ALL distribution. The following section

explains how to identify candidates for ALL distribution by evaluating the EXPLAIN plan.

5. Use EVEN distribution for the remaining tables.

If a table is largely denormalized and does not participate in joins, or if you don't have a clear choice

for another distribution style, use EVEN distribution (the default).

To let Amazon Redshift choose the appropriate distribution style, don't explicitly specify a distribution

style.

You cannot change the distribution style of a table after it is created. To use a diﬀerent distribution

style, you can recreate the table and populate the new table with a deep copy. For more information, see

Performing a Deep Copy (p. 221).

Evaluating the Query Plan

You can use query plans to identify candidates for optimizing the distribution style.

After making your initial design decisions, create your tables, load them with data, and test them. Use

a test data set that is as close as possible to the real data. Measure load times to use as a baseline for

comparisons.

Evaluate queries that are representative of the most costly queries you expect to execute; speciﬁcally,

queries that use joins and aggregations. Compare execution times for various design options. When you

compare execution times, do not count the ﬁrst time the query is executed, because the ﬁrst run time

includes the compilation time.

DS_DIST_NONE

No redistribution is required, because corresponding slices are collocated on the compute nodes. You

will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension

table.

DS_DIST_ALL_NONE

No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is

located on every node.

DS_DIST_INNER

The inner table is redistributed.

DS_DIST_OUTER

The outer table is redistributed.

DS_BCAST_INNER

A copy of the entire inner table is broadcast to all the compute nodes.

DS_DIST_ALL_INNER

The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL.

DS_DIST_BOTH

Both tables are redistributed.

DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required for

that step because all of the joins are collocated.

DS_DIST_INNER means that the step will probably have a relatively high cost because the inner table

is being redistributed to the nodes. DS_DIST_INNER indicates that the outer table is already properly

API Version 2012-12-01

133

Amazon Redshift Database Developer Guide

Query Plan Example

distributed on the join key. Set the inner table's distribution key to the join key to convert this to

DS_DIST_NONE. If distributing the inner table on the join key is not possible because the outer table is

not distributed on the join key, evaluate whether to use ALL distribution for the inner table. If the table is

relatively slow moving, that is, it is not updated frequently or extensively, and it is large enough to carry

a high redistribution cost, change the distribution style to ALL and test again. ALL distribution causes

increased load times, so when you retest, include the load time in your evaluation factors.

DS_DIST_ALL_INNER is not good. It means the entire inner table is redistributed to a single slice because

the outer table uses DISTSTYLE ALL, so that a copy of the entire outer table is located on each node. This

results in ineﬃcient serial execution of the join on a single node instead taking advantage of parallel

execution using all of the nodes. DISTSTYLE ALL is meant to be used only for the inner join table.

Instead, specify a distribution key or use even distribution for the outer table.

DS_BCAST_INNER and DS_DIST_BOTH are not good. Usually these redistributions occur because the

tables are not joined on their distribution keys. If the fact table does not already have a distribution

key, specify the joining column as the distribution key for both tables. If the fact table already has a

distribution key on another column, you should evaluate whether changing the distribution key to

collocate this join will improve overall performance. If changing the distribution key of the outer table is

not an optimal choice, you can achieve collocation by specifying DISTSTYLE ALL for the inner table.

The following example shows a portion of a query plan with DS_BCAST_INNER and DS_DIST_NONE

labels.

-> XN Hash Join DS_BCAST_INNER (cost=112.50..3272334142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_BCAST_INNER (cost=109.98..3167290276.71 rows=172456 width=47)

Hash Cond: ("outer".eventid = "inner".eventid)

-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)

Merge Cond: ("outer".listid = "inner".listid)

-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)

-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

After changing the dimension tables to use DISTSTYLE ALL, the query plan for the same query shows

DS_DIST_ALL_NONE in place of DS_BCAST_INNER. Also, there is a dramatic change in the relative cost

for the join steps.

-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47)

Hash Cond: ("outer".eventid = "inner".eventid)

-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30)

Merge Cond: ("outer".listid = "inner".listid)

-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14)

-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)

Query Plan Example

This example shows how to evaluate a query plan to ﬁnd opportunities to optimize the distribution.

Run the following query with an EXPLAIN command to produce a query plan.

explain

select lastname, catname, venuename, venuecity, venuestate, eventname,

month, sum(pricepaid) as buyercost, max(totalprice) as maxtotalprice

from category join event on category.catid = event.catid

join venue on venue.venueid = event.venueid

join sales on sales.eventid = event.eventid

join listing on sales.listid = listing.listid

join date on sales.dateid = date.dateid

API Version 2012-12-01

134

Amazon Redshift Database Developer Guide

Query Plan Example

join users on users.userid = sales.buyerid

group by lastname, catname, venuename, venuecity, venuestate, eventname, month

having sum(pricepaid)>9999

order by catname, buyercost desc;

In the TICKIT database, SALES is a fact table and LISTING is its largest dimension. In order to collocate

the tables, SALES is distributed on the LISTID, which is the foreign key for LISTING, and LISTING is

distributed on its primary key, LISTID. The following example shows the CREATE TABLE commands for

SALES and LISTID.

create table sales(

salesid integer not null,

listid integer not null distkey,

sellerid integer not null,

buyerid integer not null,

eventid integer not null encode mostly16,

dateid smallint not null,

qtysold smallint not null encode mostly8,

pricepaid decimal(8,2) encode delta32k,

commission decimal(8,2) encode delta32k,

saletime timestamp,

primary key(salesid),

foreign key(listid) references listing(listid),

foreign key(sellerid) references users(userid),

foreign key(buyerid) references users(userid),

foreign key(dateid) references date(dateid))

sortkey(listid,sellerid);

create table listing(

listid integer not null distkey sortkey,

sellerid integer not null,

eventid integer not null encode mostly16,

dateid smallint not null,

numtickets smallint not null encode mostly8,

priceperticket decimal(8,2) encode bytedict,

totalprice decimal(8,2) encode mostly32,

listtime timestamp,

primary key(listid),

foreign key(sellerid) references users(userid),

foreign key(eventid) references event(eventid),

foreign key(dateid) references date(dateid));

In the following query plan, the Merge Join step for the join on SALES and LISTING shows

DS_DIST_NONE, which indicates that no redistribution is required for the step. However, moving up

the query plan, the other inner joins show DS_BCAST_INNER, which indicates that the inner table is

broadcast as part of the query execution. Because only one pair of tables can be collocated using key

distribution, ﬁve tables need to be rebroadcast.

QUERY PLAN

XN Merge (cost=1015345167117.54..1015345167544.46 rows=1000 width=103)

Merge Key: category.catname, sum(sales.pricepaid)

-> XN Network (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)

Send to leader

-> XN Sort (cost=1015345167117.54..1015345167544.46 rows=170771 width=103)

Sort Key: category.catname, sum(sales.pricepaid)

-> XN HashAggregate (cost=15345150568.37..15345152276.08 rows=170771

width=103)

Filter: (sum(pricepaid) > 9999.00)

-> XN Hash Join DS_BCAST_INNER (cost=742.08..15345146299.10

rows=170771 width=103)

Hash Cond: ("outer".catid = "inner".catid)

-> XN Hash Join DS_BCAST_INNER (cost=741.94..15342942456.61

rows=170771 width=97)

API Version 2012-12-01

135

Amazon Redshift Database Developer Guide

Query Plan Example

Hash Cond: ("outer".dateid = "inner".dateid)

-> XN Hash Join DS_BCAST_INNER

(cost=737.38..15269938609.81 rows=170766 width=90)

Hash Cond: ("outer".buyerid = "inner".userid)

-> XN Hash Join DS_BCAST_INNER

(cost=112.50..3272334142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_BCAST_INNER

(cost=109.98..3167290276.71 rows=172456 width=47)

Hash Cond: ("outer".eventid =

"inner".eventid)

-> XN Merge Join DS_DIST_NONE

(cost=0.00..6286.47 rows=172456 width=30)

Merge Cond: ("outer".listid =

"inner".listid)

-> XN Seq Scan on listing

(cost=0.00..1924.97 rows=192497 width=14)

-> XN Seq Scan on sales

(cost=0.00..1724.56 rows=172456 width=24)

-> XN Hash (cost=87.98..87.98

rows=8798 width=25)

-> XN Seq Scan on event

(cost=0.00..87.98 rows=8798 width=25)

-> XN Hash (cost=2.02..2.02 rows=202

width=41)

-> XN Seq Scan on venue

(cost=0.00..2.02 rows=202 width=41)

-> XN Hash (cost=499.90..499.90 rows=49990

width=14)

-> XN Seq Scan on users (cost=0.00..499.90

rows=49990 width=14)

-> XN Hash (cost=3.65..3.65 rows=365 width=11)

-> XN Seq Scan on date (cost=0.00..3.65 rows=365

width=11)

-> XN Hash (cost=0.11..0.11 rows=11 width=10)

-> XN Seq Scan on category (cost=0.00..0.11 rows=11

width=10)

One solution is to recreate the tables with DISTSTYLE ALL. You cannot change a table's distribution style

after it is created. To recreate tables with a diﬀerent distribution style, use a deep copy.

First, rename the tables.

alter table users rename to userscopy;

alter table venue rename to venuecopy;

alter table category rename to categorycopy;

alter table date rename to datecopy;

alter table event rename to eventcopy;

Run the following script to recreate USERS, VENUE, CATEGORY, DATE, EVENT. Don't make any changes to

SALES and LISTING.

create table users(

userid integer not null sortkey,

username char(8),

firstname varchar(30),

lastname varchar(30),

city varchar(30),

state char(2),

email varchar(100),

phone char(14),

likesports boolean,

liketheatre boolean,

API Version 2012-12-01

136

Amazon Redshift Database Developer Guide

Query Plan Example

likeconcerts boolean,

likejazz boolean,

likeclassical boolean,

likeopera boolean,

likerock boolean,

likevegas boolean,

likebroadway boolean,

likemusicals boolean,

primary key(userid)) diststyle all;

create table venue(

venueid smallint not null sortkey,

venuename varchar(100),

venuecity varchar(30),

venuestate char(2),

venueseats integer,

primary key(venueid)) diststyle all;

create table category(

catid smallint not null,

catgroup varchar(10),

catname varchar(10),

catdesc varchar(50),

primary key(catid)) diststyle all;

create table date(

dateid smallint not null sortkey,

caldate date not null,

day character(3) not null,

week smallint not null,

month character(5) not null,

qtr character(5) not null,

year smallint not null,

holiday boolean default('N'),

primary key (dateid)) diststyle all;

create table event(

eventid integer not null sortkey,

venueid smallint not null,

catid smallint not null,

dateid smallint not null,

eventname varchar(200),

starttime timestamp,

primary key(eventid),

foreign key(venueid) references venue(venueid),

foreign key(catid) references category(catid),

foreign key(dateid) references date(dateid)) diststyle all;

Insert the data back into the tables and run an ANALYZE command to update the statistics.

insert into users select * from userscopy;

insert into venue select * from venuecopy;

insert into category select * from categorycopy;

insert into date select * from datecopy;

insert into event select * from eventcopy;

analyze;

Finally, drop the copies.

drop table userscopy;

drop table venuecopy;

drop table categorycopy;

API Version 2012-12-01

137

Amazon Redshift Database Developer Guide

Distribution Examples

drop table datecopy;

drop table eventcopy;

Run the same query with EXPLAIN again, and examine the new query plan. The joins now show

DS_DIST_ALL_NONE, indicating that no redistribution is required because the data was distributed to

every node using DISTSTYLE ALL.

QUERY PLAN

XN Merge (cost=1000000047117.54..1000000047544.46 rows=1000 width=103)

Merge Key: category.catname, sum(sales.pricepaid)

-> XN Network (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)

Send to leader

-> XN Sort (cost=1000000047117.54..1000000047544.46 rows=170771 width=103)

Sort Key: category.catname, sum(sales.pricepaid)

-> XN HashAggregate (cost=30568.37..32276.08 rows=170771 width=103)

Filter: (sum(pricepaid) > 9999.00)

-> XN Hash Join DS_DIST_ALL_NONE (cost=742.08..26299.10 rows=170771

width=103)

Hash Cond: ("outer".buyerid = "inner".userid)

-> XN Hash Join DS_DIST_ALL_NONE (cost=117.20..21831.99

rows=170766 width=97)

Hash Cond: ("outer".dateid = "inner".dateid)

-> XN Hash Join DS_DIST_ALL_NONE (cost=112.64..17985.08

rows=170771 width=90)

Hash Cond: ("outer".catid = "inner".catid)

-> XN Hash Join DS_DIST_ALL_NONE

(cost=112.50..14142.59 rows=170771 width=84)

Hash Cond: ("outer".venueid = "inner".venueid)

-> XN Hash Join DS_DIST_ALL_NONE

(cost=109.98..10276.71 rows=172456 width=47)

Hash Cond: ("outer".eventid =

"inner".eventid)

-> XN Merge Join DS_DIST_NONE

(cost=0.00..6286.47 rows=172456 width=30)

Merge Cond: ("outer".listid =

"inner".listid)

-> XN Seq Scan on listing

(cost=0.00..1924.97 rows=192497 width=14)

-> XN Seq Scan on sales

(cost=0.00..1724.56 rows=172456 width=24)

-> XN Hash (cost=87.98..87.98 rows=8798

width=25)

-> XN Seq Scan on event

(cost=0.00..87.98 rows=8798 width=25)

-> XN Hash (cost=2.02..2.02 rows=202

width=41)

-> XN Seq Scan on venue

(cost=0.00..2.02 rows=202 width=41)

-> XN Hash (cost=0.11..0.11 rows=11 width=10)

-> XN Seq Scan on category (cost=0.00..0.11

rows=11 width=10)

-> XN Hash (cost=3.65..3.65 rows=365 width=11)

-> XN Seq Scan on date (cost=0.00..3.65 rows=365

width=11)

-> XN Hash (cost=499.90..499.90 rows=49990 width=14)

-> XN Seq Scan on users (cost=0.00..499.90 rows=49990

width=14)

Distribution Examples

The following examples show how data is distributed according to the options that you deﬁne in the

CREATE TABLE statement.

API Version 2012-12-01

138

Amazon Redshift Database Developer Guide

Distribution Examples

DISTKEY Examples

Look at the schema of the USERS table in the TICKIT database. USERID is deﬁned as the SORTKEY

column and the DISTKEY column:

select "column", type, encoding, distkey, sortkey

from pg_table_def where tablename = 'users';

column | type | encoding | distkey | sortkey

---------------+------------------------+----------+---------+---------

userid | integer | none | t | 1

username | character(8) | none | f | 0

firstname | character varying(30) | text32k | f | 0

...

USERID is a good choice for the distribution column on this table. If you query the SVV_DISKUSAGE

system view, you can see that the table is very evenly distributed. Column numbers are zero-based, so

USERID is column 0.

select slice, col, num_values as rows, minvalue, maxvalue

from svv_diskusage

where name='users' and col=0 and rows>0

order by slice, col;

slice| col | rows | minvalue | maxvalue

-----+-----+-------+----------+----------

0 | 0 | 12496 | 4 | 49987

1 | 0 | 12498 | 1 | 49988

2 | 0 | 12497 | 2 | 49989

3 | 0 | 12499 | 3 | 49990

(4 rows)

The table contains 49,990 rows. The rows (num_values) column shows that each slice contains about the

same number of rows. The minvalue and maxvalue columns show the range of values on each slice. Each

slice includes nearly the entire range of values, so there's a good chance that every slice will participate

in executing a query that ﬁlters for a range of user IDs.

This example demonstrates distribution on a small test system. The total number of slices is typically

much higher.

If you commonly join or group using the STATE column, you might choose to distribute on the STATE

column. The following examples shows that if you create a new table with the same data as the USERS

table, but you set the DISTKEY to the STATE column, the distribution will not be as even. Slice 0 (13,587

rows) holds approximately 30% more rows than slice 3 (10,150 rows). In a much larger table, this

amount of distribution skew could have an adverse impact on query processing.

create table userskey distkey(state) as select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage

where name = 'userskey' and col=0 and rows>0

order by slice, col;

slice | col | rows | minvalue | maxvalue

------+-----+-------+----------+----------

0 | 0 | 13587 | 5 | 49989

1 | 0 | 11245 | 2 | 49990

2 | 0 | 15008 | 1 | 49976

3 | 0 | 10150 | 4 | 49986

API Version 2012-12-01

139

Amazon Redshift Database Developer Guide

Choosing Sort Keys

(4 rows)

DISTSTYLE EVEN Example

If you create a new table with the same data as the USERS table but set the DISTSTYLE to EVEN, rows

are always evenly distributed across slices.

create table userseven diststyle even as

select * from users;

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage

where name = 'userseven' and col=0 and rows>0

order by slice, col;

slice | col | rows | minvalue | maxvalue

------+-----+-------+----------+----------

0 | 0 | 12497 | 4 | 49990

1 | 0 | 12498 | 8 | 49984

2 | 0 | 12498 | 2 | 49988

3 | 0 | 12497 | 1 | 49989

(4 rows)

However, because distribution is not based on a speciﬁc column, query processing can be degraded,

especially if the table is joined to other tables. The lack of distribution on a joining column often

inﬂuences the type of join operation that can be performed eﬃciently. Joins, aggregations, and grouping

operations are optimized when both tables are distributed and sorted on their respective joining

columns.

DISTSTYLE ALL Example

If you create a new table with the same data as the USERS table but set the DISTSTYLE to ALL, all the

rows are distributed to the ﬁrst slice of each node.

select slice, col, num_values as rows, minvalue, maxvalue from svv_diskusage

where name = 'usersall' and col=0 and rows > 0

order by slice, col;

slice | col | rows | minvalue | maxvalue

------+-----+-------+----------+----------

0 | 0 | 49990 | 4 | 49990

2 | 0 | 49990 | 2 | 49990

(4 rows)

Choosing Sort Keys

When you create a table, you can deﬁne one or more of its columns as sort keys. When data is initially

loaded into the empty table, the rows are stored on disk in sorted order. Information about sort key

columns is passed to the query planner, and the planner uses this information to construct plans that

exploit the way that the data is sorted.

Sorting enables eﬃcient handling of range-restricted predicates. Amazon Redshift stores columnar data

in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If query

uses a range-restricted predicate, the query processor can use the min and max values to rapidly skip

over large numbers of blocks during table scans. For example, if a table stores ﬁve years of data sorted

API Version 2012-12-01

140

Amazon Redshift Database Developer Guide

Compound Sort Key

by date and a query speciﬁes a date range of one month, up to 98 percent of the disk blocks can be

eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to

be scanned.

You can specify either a compound or interleaved sort key. A compound sort key is more eﬃcient when

query predicates use a preﬁx, which is a subset of the sort key columns in order. An interleaved sort key

gives equal weight to each column in the sort key, so query predicates can use any subset of the columns

that make up the sort key, in any order. For examples of using compound sort keys and interleaved sort

keys, see Comparing Sort Styles (p. 142).

To understand the impact of the chosen sort key on query performance, use the EXPLAIN (p. 511)

command. For more information, see Query Planning And Execution Workﬂow (p. 257)

To deﬁne a sort type, use either the INTERLEAVED or COMPOUND keyword with your CREATE TABLE or

CREATE TABLE AS statement. The default is COMPOUND. An INTERLEAVED sort key can use a maximum

of eight columns.

To view the sort keys for a table, query the SVV_TABLE_INFO (p. 926) system view.

Topics

•Compound Sort Key (p. 141)

•Interleaved Sort Key (p. 141)

•Comparing Sort Styles (p. 142)

Compound Sort Key

A compound key is made up of all of the columns listed in the sort key deﬁnition, in the order they are

listed. A compound sort key is most useful when a query's ﬁlter applies conditions, such as ﬁlters and

joins, that use a preﬁx of the sort keys. The performance beneﬁts of compound sorting decrease when

queries depend only on secondary sort columns, without referencing the primary columns. COMPOUND

is the default sort type.

Compound sort keys might speed up joins, GROUP BY and ORDER BY operations, and window functions

that use PARTITION BY and ORDER BY. For example, a merge join, which is often faster than a hash join,

is feasible when the data is distributed and presorted on the joining columns. Compound sort keys also

help improve compression.

As you add rows to a sorted table that already contains data, the unsorted region grows, which has

a signiﬁcant eﬀect on performance. The eﬀect is greater when the table uses interleaved sorting,

especially when the sort columns include data that increases monotonically, such as date or timestamp

columns. You should run a VACUUM operation regularly, especially after large data loads, to re-sort and

re-analyze the data. For more information, see Managing the Size of the Unsorted Region (p. 231).

After vacuuming to resort the data, it's a good practice to run an ANALYZE command to update the

statistical metadata for the query planner. For more information, see Analyzing Tables (p. 223).

Interleaved Sort Key

An interleaved sort gives equal weight to each column, or subset of columns, in the sort key. If multiple

queries use diﬀerent columns for ﬁlters, then you can often improve performance for those queries by

using an interleaved sort style. When a query uses restrictive predicates on secondary sort columns,

interleaved sorting signiﬁcantly improves query performance as compared to compound sorting.

Important

Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as

identity columns, dates, or timestamps.

API Version 2012-12-01

141

Amazon Redshift Database Developer Guide

Comparing Sort Styles

The performance improvements you gain by implementing an interleaved sort key should be weighed

against increased load and vacuum times.

Interleaved sorts are most eﬀective with highly selective queries that ﬁlter on one or more of the sort

key columns in the WHERE clause, for example select c_name from customer where c_region

= 'ASIA'. The beneﬁts of interleaved sorting increase with the number of sorted columns that are

restricted.

An interleaved sort is more eﬀective with large tables. Sorting is applied on each slice, so an interleaved

sort is most eﬀective when a table is large enough to require multiple 1 MB blocks per slice and the

query processor is able to skip a signiﬁcant proportion of the blocks using restrictive predicates. To view

the number of blocks a table uses, query the STV_BLOCKLIST (p. 869) system view.

When sorting on a single column, an interleaved sort might give better performance than a compound

sort if the column values have a long common preﬁx. For example, URLs commonly begin with "http://

www". Compound sort keys use a limited number of characters from the preﬁx, which results in a lot

of duplication of keys. Interleaved sorts use an internal compression scheme for zone map values that

enables them to better discriminate among column values that have a long common preﬁx.

VACUUM REINDEX

As you add rows to a sorted table that already contains data, performance might deteriorate over

time. This deterioration occurs for both compound and interleaved sorts, but it has a greater eﬀect on

interleaved tables. A VACUUM restores the sort order, but the operation can take longer for interleaved

tables because merging new interleaved data might involve modifying every data block.

When tables are initially loaded, Amazon Redshift analyzes the distribution of the values in the sort

key columns and uses that information for optimal interleaving of the sort key columns. As a table

grows, the distribution of the values in the sort key columns can change, or skew, especially with date or

timestamp columns. If the skew becomes too large, performance might be aﬀected. To re-analyze the

sort keys and restore performance, run the VACUUM command with the REINDEX key word. Because it

needs to take an extra analysis pass over the data, VACUUM REINDEX can take longer than a standard

VACUUM for interleaved tables. To view information about key distribution skew and last reindex time,

query the SVV_INTERLEAVED_COLUMNS (p. 905) system view.

For more information about how to determine how often to run VACUUM and when to run a VACUUM

REINDEX, see Deciding Whether to Reindex (p. 230).

Comparing Sort Styles

This section compares the performance diﬀerences when using a single-column sort key, a compound

sort key, and an interleaved sort key for diﬀerent types of queries.

For this example, you'll create a denormalized table named CUST_SALES, using data from the

CUSTOMER and LINEORDER tables. CUSTOMER and LINEORDER are part of the SSB data set, which is

used in the Tutorial: Tuning Table Design (p. 45).

The new CUST_SALES table has 480 million rows, which is not large by Amazon Redshift standards, but it

is large enough to show the performance diﬀerences. Larger tables will tend to show greater diﬀerences,

especially for interleaved sorting.

To compare the three sort methods, perform the following steps:

1. Create the SSB data set.

2. Create the CUST_SALES_DATE table.

3. Create three tables to compare sort styles.

4. Execute queries and compare the results.

API Version 2012-12-01

142

Amazon Redshift Database Developer Guide

Comparing Sort Styles

Create the SSB Data Set

If you haven't already done so, follow the steps in Step 1: Create a Test Data Set (p. 45) in the Tuning

Table Design tutorial to create the tables in the SSB data set and load them with data. The data load will

take about 10 to 15 minutes.

The example in the Tuning Table Design tutorial uses a four-node cluster. The comparisons in this

example use a two-node cluster. Your results will vary with diﬀerent cluster conﬁgurations.

Create the CUST_SALES_DATE Table

The CUST_SALES_DATE table is a denormalized table that contains data about customers and revenues.

To create the CUST_SALES_DATE table, execute the following statement.

create table cust_sales_date as

(select c_custkey, c_nation, c_region, c_mktsegment, d_date::date, lo_revenue

from customer, lineorder, dwdate

where lo_custkey = c_custkey

and lo_orderdate = dwdate.d_datekey

and lo_revenue > 0);

The following query shows the row count for CUST_SALES.

select count(*) from cust_sales_date;

count

-----------

480027069

(1 row)

Execute the following query to view the ﬁrst row of the CUST_SALES table.

select * from cust_sales_date limit 1;

----------+----------+----------+--------------+------------+-----------

1 | MOROCCO | AFRICA | BUILDING | 1994-10-28 | 1924330

Create Tables for Comparing Sort Styles

To compare the sort styles, create three tables. The ﬁrst will use a single-column sort key; the second

will use a compound sort key; the third will use an interleaved sort key. The single-column sort will use

the c_custkey column. The compound sort and the interleaved sort will both use the c_custkey,

c_region, c_mktsegment, and d_date columns.

To create the tables for comparison, execute the following CREATE TABLE statements.

create table cust_sales_date_single

sortkey (c_custkey)

as select * from cust_sales_date;

create table cust_sales_date_compound

compound sortkey (c_custkey, c_region, c_mktsegment, d_date)

as select * from cust_sales_date;

create table cust_sales_date_interleaved

interleaved sortkey (c_custkey, c_region, c_mktsegment, d_date)

API Version 2012-12-01

143

Amazon Redshift Database Developer Guide

Comparing Sort Styles

as select * from cust_sales_date;

Execute Queries and Compare the Results

Execute the same queries against each of the tables to compare execution times for each table. To

eliminate diﬀerences due to compile time, run each of the queries twice, and record the second time.

1. Test a query that restricts on the c_custkey column, which is the ﬁrst column in the sort key for

each table. Execute the following queries.

-- Query 1

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_single

where c_custkey < 100000;

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_compound

where c_custkey < 100000;

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_interleaved

where c_custkey < 100000;

2. Test a query that restricts on the c_region column, which is the second column in the sort key for

the compound and interleaved keys. Execute the following queries.

-- Query 2

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_single

where c_region = 'ASIA'

and c_mktsegment = 'FURNITURE';

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_compound

where c_region = 'ASIA'

and c_mktsegment = 'FURNITURE';

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_interleaved

where c_region = 'ASIA'

and c_mktsegment = 'FURNITURE';

3. Test a query that restricts on both the c_region column and the c_mktsegment column. Execute

the following queries.

-- Query 3

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_single

where d_date between '01/01/1996' and '01/14/1996'

and c_mktsegment = 'FURNITURE'

and c_region = 'ASIA';

select max(lo_revenue), min(lo_revenue)

from cust_sales_date_compound

where d_date between '01/01/1996' and '01/14/1996'

and c_mktsegment = 'FURNITURE'

and c_region = 'ASIA';

select max(lo_revenue), min(lo_revenue)

API Version 2012-12-01

144

Amazon Redshift Database Developer Guide

Deﬁning Constraints

from cust_sales_date_interleaved

where d_date between '01/01/1996' and '01/14/1996'

and c_mktsegment = 'FURNITURE'

and c_region = 'ASIA';

4. Evaluate the results.

The following table summarizes the performance of the three sort styles.

Important

These results show relative performance for the two-node cluster that was used for these

examples. Your results will vary, depending on multiple factors, such as your node type,

number of nodes, and other concurrent operations contending for resources.

Sort Style Query 1 Query 2 Query 3

Single 0.25 s 18.37 s 30.04 s

Compound 0.27 s 18.24 s 30.14 s

Interleaved 0.94 s 1.46 s 0.80 s

In Query 1, the results for all three sort styles are very similar, because the WHERE clause restricts

only on the ﬁrst column. There is a small overhead cost for accessing an interleaved table.

In Query 2, there is no beneﬁt to the single-column sort key because that column is not used in

the WHERE clause. There is no performance improvement for the compound sort key, because the

query was restricted using the second and third columns in the sort key. The query against the

interleaved table shows the best performance because interleaved sorting is able to eﬃciently ﬁlter

on secondary columns in the sort key.

In Query 3, the interleaved sort is much faster than the other styles because it is able to ﬁlter on the

combination of the d_date, c_mktsegment, and c_region columns.

This example uses a relatively small table, by Amazon Redshift standards, with 480 million rows. With

larger tables, containing billions of rows and more, interleaved sorting can improve performance by an

order of magnitude or more for certain types of queries.

Deﬁning Constraints

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by

Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should

be declared if your ETL process or some other process in your application enforces their integrity.

For example, the query planner uses primary and foreign keys in certain statistical computations, to infer

uniqueness and referential relationships that aﬀect subquery decorrelation techniques, to order large

numbers of joins, and to eliminate redundant joins.

The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are

valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return

incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key

is not unique. Do not deﬁne key constraints for your tables if you doubt their validity. On the other hand,

you should always declare primary and foreign keys and uniqueness constraints when you know that

they are valid.

Amazon Redshift does enforce NOT NULL column constraints.

API Version 2012-12-01

145

Amazon Redshift Database Developer Guide

Analyzing Table Design

As you have seen in the previous sections, specifying sort keys, distribution keys, and column encodings

can signiﬁcantly improve storage, I/O, and query performance. This section provides a SQL script that

you can run to help you identify tables where these options are missing or performing poorly.

Copy and paste the following code to create a SQL script named table_inspector.sql, then execute

the script in your SQL client application as superuser.

SELECT SCHEMA schemaname,

"table" tablename,

table_id tableid,

size size_in_mb,

CASE

WHEN diststyle NOT IN ('EVEN','ALL') THEN 1

ELSE 0

END has_dist_key,

CASE

WHEN sortkey1 IS NOT NULL THEN 1

ELSE 0

END has_sort_key,

CASE

WHEN encoded = 'Y' THEN 1

ELSE 0

END has_col_encoding,

CAST(max_blocks_per_slice - min_blocks_per_slice AS FLOAT) / GREATEST(NVL

(min_blocks_per_slice,0)::int,1) ratio_skew_across_slices,

CAST(100*dist_slice AS FLOAT) /(SELECT COUNT(DISTINCT slice) FROM stv_slices)

pct_slices_populated

FROM svv_table_info ti

JOIN (SELECT tbl,

MIN(c) min_blocks_per_slice,

MAX(c) max_blocks_per_slice,

COUNT(DISTINCT slice) dist_slice

FROM (SELECT b.tbl,

b.slice,

COUNT(*) AS c

FROM STV_BLOCKLIST b

GROUP BY b.tbl,

b.slice)

WHERE tbl IN (SELECT table_id FROM svv_table_info)

GROUP BY tbl) iq ON iq.tbl = ti.table_id;

The following sample shows the results of running the script with two sample tables, SKEW1 and

SKEW2, that demonstrate the eﬀects of data skew.

| | | |has_ |has_ |has_ |ratio_skew|pct_

----------+---------+-------+-----+------+-----+--------+----------+---------

public |category |100553 | 28 | 1 | 1 | 0 | 0 | 100

public |date |100555 | 44 | 1 | 1 | 0 | 0 | 100

public |event |100558 | 36 | 1 | 1 | 1 | 0 | 100

public |listing |100560 | 44 | 1 | 1 | 1 | 0 | 100

public |nation |100563 | 175 | 0 | 0 | 0 | 0 | 39.06

public |region |100566 | 30 | 0 | 0 | 0 | 0 | 7.81

public |sales |100562 | 52 | 1 | 1 | 0 | 0 | 100

public |skew1 |100547 |18978| 0 | 0 | 0 | .15 | 50

public |skew2 |100548 | 353 | 1 | 0 | 0 | 0 | 1.56

public |venue |100551 | 32 | 1 | 1 | 0 | 0 | 100

public |users |100549 | 82 | 1 | 1 | 1 | 0 | 100

API Version 2012-12-01

146

Amazon Redshift Database Developer Guide

Analyzing Table Design

public |venue |100551 | 32 | 1 | 1 | 0 | 0 | 100

The following list describes the columns in the result:

has_dist_key

Indicates whether the table has distribution key. 1 indicates a key exists; 0 indicates there is no key.

For example, nation does not have a distribution key .

has_sort_key

Indicates whether the table has a sort key. 1 indicates a key exists; 0 indicates there is no key. For

example, nation does not have a sort key.

has_column_encoding

Indicates whether the table has any compression encodings deﬁned for any of the columns. 1

indicates at least one column has an encoding. 0 indicates there is no encoding. For example,

region has no compression encoding.

ratio_skew_across_slices

An indication of the data distribution skew. A smaller value is good.

pct_slices_populated

The percentage of slices populated. A larger value is good.

Tables for which there is signiﬁcant data distribution skew will have either a large value in the

ratio_skew_across_slices column or a small value in the pct_slices_populated column. This indicates that

you have not chosen an appropriate distribution key column. In the example above, the SKEW1 table has

a .15 skew ratio across slices, but that's not necessarily a problem. What's more signiﬁcant is the 1.56%

value for the slices populated for the SKEW2 table. The small value is an indication that the SKEW2 table

has the wrong distribution key.

Run the table_inspector.sql script whenever you add new tables to your database or whenever you

have signiﬁcantly modiﬁed your tables.

API Version 2012-12-01

147

Amazon Redshift Database Developer Guide

Amazon Redshift Spectrum Overview

Using Amazon Redshift Spectrum to

Query External Data

Using Amazon Redshift Spectrum, you can eﬃciently query and retrieve structured and semistructured

data from ﬁles in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift

Spectrum queries employ massive parallelism to execute very fast against large datasets. Much of the

processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple

clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the

data for each cluster.

Topics

•Amazon Redshift Spectrum Overview (p. 148)

•Getting Started with Amazon Redshift Spectrum (p. 150)

•IAM Policies for Amazon Redshift Spectrum (p. 154)

•Creating Data Files for Queries in Amazon Redshift Spectrum (p. 164)

•Creating External Schemas for Amazon Redshift Spectrum (p. 165)

•Creating External Tables for Amazon Redshift Spectrum (p. 171)

•Improving Amazon Redshift Spectrum Query Performance (p. 179)

•Monitoring Metrics in Amazon Redshift Spectrum (p. 181)

•Troubleshooting Queries in Amazon Redshift Spectrum (p. 181)

Amazon Redshift Spectrum Overview

Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of

your cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate ﬁltering and

aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of

your cluster's processing capacity than other queries. Redshift Spectrum also scales intelligently. Based

on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take

advantage of massively parallel processing.

You create Redshift Spectrum tables by deﬁning the structure for your ﬁles and registering them as

tables in an external data catalog. The external data catalog can be AWS Glue, the data catalog that

comes with Amazon Athena, or your own Apache Hive metastore. You can create and manage external

tables either from Amazon Redshift using data deﬁnition language (DDL) commands or using any other

tool that connects to the external data catalog. Changes to the external data catalog are immediately

available to any of your Amazon Redshift clusters.

Optionally, you can partition the external tables on one or more columns. Deﬁning partitions as part

of the external table can improve performance. The improvement occurs because the Amazon Redshift

query optimizer eliminates partitions that don’t contain data for the query.

After your Redshift Spectrum tables have been deﬁned, you can query and join the tables just as you

do any other Amazon Redshift table. Amazon Redshift doesn't support update operations on external

tables. You can add Redshift Spectrum tables to multiple Amazon Redshift clusters and query the same

data on Amazon S3 from any cluster in the same AWS Region. When you update Amazon S3 data ﬁles,

the data is immediately available for query from any of your Amazon Redshift clusters.

The AWS Glue Data Catalog that you access might be encrypted to increase security. If the AWS Glue

catalog is encrypted, you need the AWS Key Management Service (AWS KMS) key for AWS Glue to access

the AWS Glue catalog. AWS Glue catalog encryption is not available in all AWS Regions. For a list of

API Version 2012-12-01

148

Amazon Redshift Database Developer Guide

Amazon Redshift Spectrum Regions

supported AWS Regions, see Encryption and Secure Access for AWS Glue in the AWS Glue Developer

Guide. For more information about AWS Glue Data Catalog encryption, see Encrypting Your AWS Glue

Data Catalog in the AWS Glue Developer Guide.

Note

You can't view details for Redshift Spectrum tables using the same resources that you use for

standard Amazon Redshift tables, such as PG_TABLE_DEF (p. 940), STV_TBL_PERM (p. 886),

PG_CLASS, or information_schema. If your business intelligence or analytics tool doesn't

recognize Redshift Spectrum external tables, conﬁgure your application to query

SVV_EXTERNAL_TABLES (p. 904) and SVV_EXTERNAL_COLUMNS (p. 902).

Amazon Redshift Spectrum Regions

Redshift Spectrum is available only in the following AWS Regions:

• US East (N. Virginia) Region (us-east-1)

• US East (Ohio) Region (us-east-2)

• US West (N. California) Region (us-west-1)

• US West (Oregon) Region (us-west-2)

• Asia Paciﬁc (Mumbai) Region (ap-south-1)

• Asia Paciﬁc (Seoul) Region (ap-northeast-2)

• Asia Paciﬁc (Singapore) Region (ap-southeast-1)

• Asia Paciﬁc (Sydney) Region (ap-southeast-2)

• Asia Paciﬁc (Tokyo) Region (ap-northeast-1)

• Canada (Central) Region (ca-central-1)

• EU (Frankfurt) Region (eu-central-1)

• EU (Ireland) Region (eu-west-1)

• EU (London) Region (eu-west-2)

• South America (São Paulo) Region (sa-east-1)

Amazon Redshift Spectrum Considerations

Note the following considerations when you use Amazon Redshift Spectrum:

• The Amazon Redshift cluster and the Amazon S3 bucket must be in the same AWS Region.

• If your cluster uses Enhanced VPC Routing, you might need to perform additional conﬁguration steps.

For more information, see Using Amazon Redshift Spectrum with Enhanced VPC Routing.

• External tables are read-only. You can't perform insert, update, or delete operations on external tables.

• You can't control user permissions on an external table. Instead, you can grant and revoke permissions

on the external schema.

• To run Redshift Spectrum queries, the database user must have permission to create temporary tables

in the database. The following example grants temporary permission on the database spectrumdb to

the spectrumusers user group.

grant temp on database spectrumdb to group spectrumusers;

For more information, see GRANT (p. 516).

• When using the Athena data catalog or AWS Glue Data Catalog, the following limits apply:

• A maximum of 10,000 databases per account.

• A maximum of 100,000 tables per database.

API Version 2012-12-01

149

Amazon Redshift Database Developer Guide

Getting Started With Amazon Redshift Spectrum

• A maximum of 1,000,000 partitions per table.

• A maximum of 10,000,000 partitions per account.

You can request a limit increase by contacting AWS Support.

These limits don’t apply to an Apache Hive metastore.

For more information, see Creating External Schemas for Amazon Redshift Spectrum (p. 165).

Getting Started with Amazon Redshift Spectrum

In this tutorial, you learn how to use Amazon Redshift Spectrum to query data directly from ﬁles on

Amazon S3. If you already have a cluster and a SQL client, you can complete this tutorial in ten minutes

or less.

Note

Redshift Spectrum queries incur additional charges. The cost of running the sample queries in

this tutorial is nominal. For more information about pricing, see Redshift Spectrum Pricing.

Prerequisites

To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that's connected to

your cluster so that you can execute SQL commands. The cluster and the data ﬁles in Amazon S3 must

be in the same AWS Region. For this example, the sample data is in the US West (Oregon) Region (us-

west-2), so you need a cluster that is also in us-west-2. If you don't have an Amazon Redshift cluster, you

can create a new cluster in us-west-2 and install a SQL client by following the steps in Getting Started

with Amazon Redshift.

If you already have a cluster, your cluster needs to be version 1.0.1294 or later to use Amazon Redshift

Spectrum. To ﬁnd the version number for your cluster, run the following command.

select version();

To force your cluster to update to the latest cluster version, adjust your maintenance window.

Steps to Get Started

To get started using Amazon Redshift Spectrum, follow these steps:

•Step 1. Create an IAM Role for Amazon Redshift (p. 150)

•Step 2: Associate the IAM Role with Your Cluster (p. 151)

•Step 3: Create an External Schema and an External Table (p. 152)

•Step 4: Query Your Data in Amazon S3 (p. 152)

Step 1. Create an IAM Role for Amazon Redshift

Your cluster needs authorization to access your external data catalog in AWS Glue or Amazon Athena and

your data ﬁles in Amazon S3. You provide that authorization by referencing an AWS Identity and Access

Management (IAM) role that is attached to your cluster. For more information about using roles with

Amazon Redshift, see Authorizing COPY and UNLOAD Operations Using IAM Roles.

Note

If your cluster is in an AWS Region where AWS Glue is supported and you have Redshift

Spectrum external tables in the Athena data catalog, you can migrate your Athena data catalog

API Version 2012-12-01

150

Amazon Redshift Database Developer Guide

Step 2: Associate the IAM Role with Your Cluster

to an AWS Glue Data Catalog. To use the AWS Glue Data Catalog with Redshift Spectrum, you

might need to change your IAM policies. For more information, see Upgrading to the AWS Glue

Data Catalog in the Athena User Guide.

To create an IAM role for Amazon Redshift

1. Open the IAM console.

2. In the navigation pane, choose Roles.

3. Choose Create role.

4. Choose AWS service, and then choose Redshift.

5. Under Select your use case, choose Redshift - Customizable and then choose Next: Permissions.

6. The Attach permissions policy page appears. Choose AmazonS3ReadOnlyAccess

and AWSGlueConsoleFullAccess, if you're using the AWS Glue Data Catalog, or

AmazonAthenaFullAccess if you're using the Athena data catalog. Choose Next: Review.

Note

The AmazonS3ReadOnlyAccess policy gives your cluster read-only access to all Amazon

S3 buckets. To grant access to only the AWS sample data bucket, create a new policy and

add the following permissions.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"s3:Get*",

"s3:List*"

"Resource": "arn:aws:s3:::awssampledbuswest2/*"

}

]

}

7. For Role name, type a name for your role, for example mySpectrumRole.

8. Review the information, and then choose Create role.

9. In the navigation pane, choose Roles. Choose the name of your new role to view the summary, and

then copy the Role ARN to your clipboard. This value is the Amazon Resource Name (ARN) for the

role that you just created. You use that value when you create external tables to reference your data

ﬁles on Amazon S3.

Step 2: Associate the IAM Role with Your Cluster

After you have created an IAM role that authorizes Amazon Redshift to access the external data catalog

and Amazon S3 on your behalf, you must associate that role with your Amazon Redshift cluster.

To associate the IAM role with your cluster

1. Sign in to the AWS Management Console and open the Amazon Redshift console at https://

console.aws.amazon.com/redshift/.

2. In the navigation pane, choose Clusters.

3. In the list, choose the cluster that you want to manage IAM role associations for.

4. Choose Manage IAM Roles.

5. Select your IAM role from the Available roles list.

API Version 2012-12-01

151

Amazon Redshift Database Developer Guide

Step 3: Create an External Schema and an External Table

6. Choose Apply Changes to update the IAM roles that are associated with the cluster.

Step 3: Create an External Schema and an External

Table

External tables must be created in an external schema. The external schema references a database in

the external data catalog and provides the IAM role ARN that authorizes your cluster to access Amazon

S3 on your behalf. You can create an external database in an Amazon Athena data catalog or an Apache

Hive metastore, such as Amazon EMR. For this example, you create the external database in an Amazon

Athena data catalog when you create the external schema Amazon Redshift. For more information, see

Creating External Schemas for Amazon Redshift Spectrum (p. 165).

To create an external schema and an external table

1. To create an external schema, replace the IAM role ARN in the following command with the role ARN

you created in step 1 (p. 150), and then execute the command in your SQL client.

create external schema spectrum

from data catalog

database 'spectrumdb'

iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'

create external database if not exists;

2. To create an external table, run the following CREATE EXTERNAL TABLE command.

Note

The Amazon S3 bucket with the sample data for this example is located in the us-west-2

region. Your cluster and the Redshift Spectrum ﬁles must be in the same AWS Region, so,

for this example, your cluster must also be located in us-west-2.

create external table spectrum.sales(

salesid integer,

listid integer,

sellerid integer,

buyerid integer,

eventid integer,

dateid smallint,

qtysold smallint,

pricepaid decimal(8,2),

commission decimal(8,2),

saletime timestamp)

row format delimited

fields terminated by '\t'

stored as textfile

location 's3://awssampledbuswest2/tickit/spectrum/sales/'

table properties ('numRows'='172000');

Step 4: Query Your Data in Amazon S3

After your external tables are created, you can query them using the same SELECT statements that you

use to query other Amazon Redshift tables. These SELECT statement queries include joining tables,

aggregating data, and ﬁltering on predicates.

To query your data in Amazon S3

1. Get the number of rows in the SPECTRUM.SALES table.

API Version 2012-12-01

152

Amazon Redshift Database Developer Guide

Step 4: Query Your Data in Amazon S3

select count(*) from spectrum.sales;

count

------

172462

2. Keep your larger fact tables in Amazon S3 and your smaller dimension tables in Amazon Redshift,

as a best practice. If you loaded the sample data in Getting Started with Amazon Redshift, you

have a table named EVENT in your database. If not, create the EVENT table by using the following

command.

create table event(

eventid integer not null distkey,

venueid smallint not null,

catid smallint not null,

dateid smallint not null sortkey,

eventname varchar(200),

starttime timestamp);

3. Load the EVENT table by replacing the IAM role ARN in the following COPY command with the role

ARN you created in Step 1. Create an IAM Role for Amazon Redshift (p. 150).

copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt'

iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'

delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';

The following example joins the external table SPECTRUM.SALES with the local table EVENT to ﬁnd

the total sales for the top 10 events.

select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid) from

spectrum.sales, event

where spectrum.sales.eventid = event.eventid

and spectrum.sales.pricepaid > 30

group by spectrum.sales.eventid

order by 2 desc;

eventid | sum

--------+---------

289 | 51846.00

7895 | 51049.00

1602 | 50301.00

851 | 49956.00

7315 | 49823.00

6471 | 47997.00

2118 | 47863.00

984 | 46780.00

7851 | 46661.00

5638 | 46280.00

4. View the query plan for the previous query. Note the S3 Seq Scan, S3 HashAggregate, and S3

Query Scan steps that were executed against the data on Amazon S3.

explain

select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid)

from spectrum.sales, event

where spectrum.sales.eventid = event.eventid

and spectrum.sales.pricepaid > 30

group by spectrum.sales.eventid

API Version 2012-12-01

153

Amazon Redshift Database Developer Guide

IAM Policies for Amazon Redshift Spectrum

order by 2 desc;

QUERY PLAN

-----------------------------------------------------------------------------

XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31)

-> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

Merge Key: sum(sales.derived_col2)

-> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

Send to leader

-> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

Sort Key: sum(sales.derived_col2)

-> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200

width=31)

-> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49

rows=200000 width=31)

Hash Cond: ("outer".derived_col1 = "inner".eventid)

-> XN S3 Query Scan sales (cost=3010.00..5010.50

rows=200000 width=31)

-> S3 HashAggregate (cost=3010.00..3010.50

rows=200000 width=16)

-> S3 Seq Scan spectrum.sales

location:"s3://awssampledbuswest2/tickit/spectrum/sales" format:TEXT

(cost=0.00..2150.00 rows=172000 width=16)

Filter: (pricepaid > 30.00)

-> XN Hash (cost=87.98..87.98 rows=8798 width=4)

-> XN Seq Scan on event (cost=0.00..87.98

rows=8798 width=4)

IAM Policies for Amazon Redshift Spectrum

By default, Amazon Redshift Spectrum uses the AWS Glue Data Catalog in AWS Regions that support

AWS Glue. In other AWS Regions, Redshift Spectrum uses the Athena data catalog. Your cluster needs

authorization to access your external data catalog in AWS Glue or Athena and your data ﬁles in Amazon

S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role

API Version 2012-12-01

154

Amazon Redshift Database Developer Guide

Amazon S3 Permissions

that is attached to your cluster. If you use an Apache Hive metastore to manage your data catalog, you

don't need to provide access to Athena.

You can chain roles so that your cluster can assume other roles not attached to the cluster. For more

information, see Chaining IAM Roles in Amazon Redshift Spectrum (p. 158).

The AWS Glue catalog that you access might be encrypted to increase security. If the AWS Glue catalog

is encrypted, you need the AWS KMS key for AWS Glue to access the AWS Glue catalog. For more

information, see Encrypting Your AWS Glue Data Catalog in the AWS Glue Developer Guide.

Topics

•Amazon S3 Permissions (p. 155)

•Cross-Account Amazon S3 Permissions (p. 156)

•Policies to Grant or Restrict Redshift Spectrum Access (p. 156)

•Policies to Grant Minimum Permissions (p. 157)

•Chaining IAM Roles in Amazon Redshift Spectrum (p. 158)

•Controlling Access to the AWS Glue Data Catalog (p. 158)

Note

If you currently have Redshift Spectrum external tables in the Athena data catalog, you can

migrate your Athena data catalog to an AWS Glue Data Catalog. To use the AWS Glue Data

Catalog with Redshift Spectrum, you might need to change your IAM policies. For more

information, see Upgrading to the AWS Glue Data Catalog in the Athena User Guide.

Amazon S3 Permissions

At a minimum, your cluster needs GET and LIST access to your Amazon S3 bucket. If your bucket is not in

the same AWS account as your cluster, your bucket must also authorize your cluster to access the data.

For more information, see Authorizing Amazon Redshift to Access Other AWS Services on Your Behalf.

Note

The Amazon S3 bucket can't use a bucket policy that restricts access only from speciﬁc VPC

endpoints.

The following policy grants GET and LIST access to any Amazon S3 bucket. The policy allows access to

Amazon S3 buckets for Redshift Spectrum as well as COPY and UNLOAD operations.

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Action": ["s3:Get*", "s3:List*"],

"Resource": "*"

}]

}

The following policy grants GET and LIST access to your Amazon S3 bucket named myBucket.

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Action": ["s3:Get*", "s3:List*"],

"Resource": "arn:aws:s3:::myBucket/*"

}]

}

API Version 2012-12-01

155

Amazon Redshift Database Developer Guide

Cross-Account Amazon S3 Permissions

To grant Redshift Spectrum permission to access data in an Amazon S3 bucket that belongs to another

AWS account, add the following policy to the Amazon S3 bucket. For more information, see Granting

Cross-Account Bucket Permissions.

{

"Version": "2012-10-17",

"Statement": [

{

"Sid": "Example permissions",

"Effect": "Allow",

"Principal": {

"AWS": "arn:aws:iam::redshift-account:role/spectrumrole"

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListMultipartUploadParts",

"s3:ListBucket",

"s3:ListBucketMultipartUploads"

"Resource": [

"arn:aws:s3:::bucketname",

"arn:aws:s3:::bucketname/*"

]

}

]

}

Policies to Grant or Restrict Redshift Spectrum Access

To grant access to an Amazon S3 bucket only using Redshift Spectrum, include a condition that allows

access for the user agent AWS Redshift/Spectrum. The following policy allows access to Amazon S3

buckets only for Redshift Spectrum. It excludes other access, such as COPY and UNLOAD operations.

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Action": ["s3:Get*", "s3:List*"],

"Resource": "arn:aws:s3:::myBucket/*",

"Condition": {"StringEquals": {"aws:UserAgent": "AWS Redshift/Spectrum"}}

}]

}

Similarly, you might want to create an IAM role that allows access for COPY and UNLOAD operations, but

excludes Redshift Spectrum access. To do so, include a condition that denies access for the user agent

"AWS Redshift/Spectrum". The following policy allows access to an Amazon S3 bucket with the exception

of Redshift Spectrum.

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Action": ["s3:Get*", "s3:List*"],

"Resource": "arn:aws:s3:::myBucket/*",

"Condition": {"StringNotEquals": {"aws:UserAgent": "AWS Redshift/

Spectrum"}}

API Version 2012-12-01

156

Amazon Redshift Database Developer Guide

Minimum Permissions

}]

}

Policies to Grant Minimum Permissions

The following policy grants the minimum permissions required to use Redshift Spectrum with Amazon

S3, AWS Glue, and Athena.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListMultipartUploadParts",

"s3:ListBucket",

"s3:ListBucketMultipartUploads"

"Resource": [

"arn:aws:s3:::bucketname",

"arn:aws:s3:::bucketname/folder1/folder2/*"

]

{

"Effect": "Allow",

"Action": [

"glue:CreateDatabase",

"glue:DeleteDatabase",

"glue:GetDatabase",

"glue:GetDatabases",

"glue:UpdateDatabase",

"glue:CreateTable",

"glue:DeleteTable",

"glue:BatchDeleteTable",

"glue:UpdateTable",

"glue:GetTable",

"glue:GetTables",

"glue:BatchCreatePartition",

"glue:CreatePartition",

"glue:DeletePartition",

"glue:BatchDeletePartition",

"glue:UpdatePartition",

"glue:GetPartition",

"glue:GetPartitions",

"glue:BatchGetPartition"

"Resource": [

"*"

]

}

]

If you use Athena for your data catalog instead of AWS Glue, the policy requires full Athena access. The

following policy grants access to Athena resources. If your external database is in a Hive metastore, you

don't need Athena access.

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

API Version 2012-12-01

157

Amazon Redshift Database Developer Guide

Chaining IAM Roles

"Action": ["athena:*"],

"Resource": ["*"]

}]

}

Chaining IAM Roles in Amazon Redshift Spectrum

When you attach a role to your cluster, your cluster can assume that role to access Amazon S3, Athena,

and AWS Glue on your behalf. If a role attached to your cluster doesn't have access to the necessary

resources, you can chain another role, possibly belonging to another account. Your cluster then

temporarily assumes the chained role to access the data. You can also grant cross-account access by

chaining roles. You can chain a maximum of 10 roles. Each role in the chain assumes the next role in the

chain, until the cluster assumes the role at the end of chain.

To chain roles, you establish a trust relationship between the roles. A role that assumes another role

must have a permissions policy that allows it to assume the speciﬁed role. In turn, the role that passes

permissions must have a trust policy that allows it to pass its permissions to another role. For more

information, see Chaining IAM Roles in Amazon Redshift.

When you run the CREATE EXTERNAL SCHEMA command, you can chain roles by including a comma-

separated list of role ARNs.

Note

The list of chained roles must not include spaces.

In the following example, MyRedshiftRole is attached to the cluster. MyRedshiftRole assumes the

role AcmeData, which belongs to account 111122223333.

create external schema acme from data catalog

database 'acmedb' region 'us-west-2'

iam_role 'arn:aws:iam::123456789012:role/MyRedshiftRole,arn:aws:iam::111122223333:role/

AcmeData';

Controlling Access to the AWS Glue Data Catalog

If you use AWS Glue for your data catalog, you can apply ﬁne-grained access control to the data catalog

with your IAM policy. For example, you might want to expose only a few databases and tables to a

speciﬁc IAM role.

The following sections describe the IAM policies for various levels of access to data stored in the AWS

Glue Data Catalog.

Topics

•Policy for Database Operations (p. 158)

•Policy for Table Operations (p. 159)

•Policy for Partition Operations (p. 162)

Policy for Database Operations

If you want to give users permissions to view and create a database, they need access rights to both the

database and the AWS Glue Data Catalog.

The following example query creates a database.

CREATE EXTERNAL SCHEMA example_db

API Version 2012-12-01

158

Amazon Redshift Database Developer Guide

Access AWS Glue Data

FROM DATA CATALOG DATABASE 'example_db' region 'us-west-2'

IAM_ROLE 'arn:aws:iam::redshift-account:role/spectrumrole'

CREATE EXTERNAL DATABASE IF NOT EXISTS

The following IAM policy gives the minimum permissions required for creating a database.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetDatabase",

"glue:CreateDatabase"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:catalog"

]

}

]

}

The following example query lists the current databases.

SELECT * FROM SVV_EXTERNAL_DATABASES WHERE

databasename = 'example_db1' or databasename = 'example_db2';

The following IAM policy gives the minimum permissions required to list the current databases.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetDatabases",

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:database/example_db1",

"arn:aws:glue:us-west-2:redshift-account:database/example_db2",

"arn:aws:glue:us-west-2:redshift-account:catalog"

]

}

]

}

Policy for Table Operations

If you want to give users permissions to view, create, drop, alter, or take other actions on tables, they

need access to the tables, the databases they belong to, and the catalog.

The following example query creates an external table.

API Version 2012-12-01

159

Amazon Redshift Database Developer Guide

Access AWS Glue Data

CREATE EXTERNAL TABLE example_db.example_tbl0(

col0 INT,

col1 VARCHAR(255)

) PARTITIONED BY (part INT) STORED AS TEXTFILE

LOCATION 's3://test/s3/location/';

The following IAM policy gives the minimum permissions required to create an external table.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:CreateTable"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

The following example queries each list the current external tables.

SELECT * FROM svv_external_tables

WHERE tablename = 'example_tbl0' OR

tablename = 'example_tbl1';

SELECT * FROM svv_external_columns

WHERE tablename = 'example_tbl0' OR

tablename = 'example_tbl1';

SELECT parameters FROM svv_external_tables

WHERE tablename = 'example_tbl0' OR

tablename = 'example_tbl1';

The following IAM policy gives the minimum permissions required to list the current external tables.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetTables"

API Version 2012-12-01

160

Amazon Redshift Database Developer Guide

Access AWS Glue Data

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl1"

]

}

]

}

The following example query alters an existing table.

ALTER TABLE example_db.example_tbl0

SET TABLE PROPERTIES ('numRows' = '100');

The following IAM policy gives the minimum permissions required to alter an existing table.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetTable",

"glue:UpdateTable"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db"

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

The following example query drops an existing table.

DROP TABLE example_db.example_tbl0;

The following IAM policy gives the minimum permissions required to drop an existing table.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:DeleteTable"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

API Version 2012-12-01

161

Amazon Redshift Database Developer Guide

Access AWS Glue Data

"arn:aws:glue:us-west-2:redshift-account:database/example_db"

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

Policy for Partition Operations

If you want to give users permissions to perform partition-level operations (view, create, drop, alter, and

so on), they need permissions to the tables that the partitions belong to. They also need permissions to

the related databases and the AWS Glue Data Catalog.

The following example query creates a partition.

ALTER TABLE example_db.example_tbl0

ADD PARTITION (part=0) LOCATION 's3://test/s3/location/part=0/';

ALTER TABLE example_db.example_t

ADD PARTITION (part=1) LOCATION 's3://test/s3/location/part=1/';

The following IAM policy gives the minimum permissions required to create a partition.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetTable",

"glue:BatchCreatePartition"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db"

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

The following example query lists the current partitions.

SELECT * FROM svv_external_partitions

WHERE schemname = 'example_db' AND

tablename = 'example_tbl0'

The following IAM policy gives the minimum permissions required to list the current partitions.

{

"Version": "2012-10-17",

API Version 2012-12-01

162

Amazon Redshift Database Developer Guide

Access AWS Glue Data

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetPartitions",

"glue:GetTables",

"glue:GetTable"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

The following example query alters an existing partition.

ALTER TABLE example_db.example_tbl0 PARTITION(part='0')

SET LOCATION 's3://test/s3/new/location/part=0/';

The following IAM policy gives the minimum permissions required to alter an existing partition.

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"glue:GetPartition",

"glue:UpdatePartition"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

The following example query drops an existing partition.

ALTER TABLE example_db.example_tbl0 DROP PARTITION(part='0');

The following IAM policy gives the minimum permissions required to drop an existing partition.

{

"Version": "2012-10-17",

"Statement": [

API Version 2012-12-01

163

Amazon Redshift Database Developer Guide

Creating Data Files for Queries

in Amazon Redshift Spectrum

{

"Effect": "Allow",

"Action": [

"glue:DeletePartition"

"Resource": [

"arn:aws:glue:us-west-2:redshift-account:catalog",

"arn:aws:glue:us-west-2:redshift-account:database/example_db",

"arn:aws:glue:us-west-2:redshift-account:table/example_db/example_tbl0"

]

}

]

}

Creating Data Files for Queries in Amazon Redshift

Spectrum

The data ﬁles that you use for queries in Amazon Redshift Spectrum are commonly the same types of

ﬁles that you use for other applications such as Amazon Athena, Amazon EMR, and Amazon QuickSight.

If the ﬁles are formatted in a format that Redshift Spectrum supports and located in an Amazon S3

bucket that your cluster can access, you can query the data in its original format directly from Amazon

S3.

The Amazon S3 bucket with the data ﬁles and the Amazon Redshift cluster must be in the same

AWS Region. For information about supported AWS Regions, see Amazon Redshift Spectrum

Regions (p. 149).

Redshift Spectrum supports the following structured and semistructured data formats:

• AVRO

• PARQUET

• TEXTFILE

• SEQUENCEFILE

• RCFILE

• RegexSerDe

• Optimized row columnar (ORC)

• Grok

• OpenCSV

• Ion

• JSON

Note

Timestamp values in text ﬁles must be in the format yyyy-MM-dd HH:mm:ss.SSSSSS, as the

following timestamp value shows: 2017-05-01 11:30:59.000000.

We recommend using a columnar storage ﬁle format, such as Parquet. With a columnar storage ﬁle

format, you can minimize data transfer out of Amazon S3 by selecting only the columns you need.

Compression

To reduce storage space, improve performance, and minimize costs, we strongly recommend

compressing your data ﬁles. Redshift Spectrum recognizes ﬁle compression types based on the ﬁle

extension.

API Version 2012-12-01

164

Amazon Redshift Database Developer Guide

Creating External Schemas

Redshift Spectrum supports the following compression types and extensions:

• gzip – .gz

• Snappy – .snappy

• bzip2 – .bz2

Redshift Spectrum transparently decrypts data ﬁles that are encrypted using the following encryption

options:

• Server-side encryption (SSE-S3) using an AES-256 encryption key managed by Amazon S3.

• Server-side encryption with keys managed by AWS Key Management Service (SSE-KMS).

Redshift Spectrum doesn't support Amazon S3 client-side encryption. For more information, see

Protecting Data Using Server-Side Encryption.

Amazon Redshift uses massively parallel processing (MPP) to achieve fast execution of complex queries

operating on large amounts of data. Redshift Spectrum extends the same principle to query external

data, using multiple Redshift Spectrum instances as needed to scan ﬁles. Place the ﬁles in a separate

folder for each table.

You can optimize your data for parallel processing by the following practices:

• Break large ﬁles into many smaller ﬁles. We recommend using ﬁle sizes of 64 MB or larger. Store ﬁles

for a table in the same folder.

• Keep all the ﬁles about the same size. If some ﬁles are much larger than others, Redshift Spectrum

can't distribute the workload evenly.

Creating External Schemas for Amazon Redshift

Spectrum

All external tables must be created in an external schema, which you create using a CREATE EXTERNAL

SCHEMA (p. 449) statement.

Note

Some applications use the term database and schema interchangeably. In Amazon Redshift, we

use the term schema.

An Amazon Redshift external schema references an external database in an external data catalog. You

can create the external database in Amazon Redshift, in Amazon Athena, or in an Apache Hive metastore,

such as Amazon EMR. If you create an external database in Amazon Redshift, the database resides in the

Athena data catalog. To create a database in a Hive metastore, you need to create the database in your

Hive application.

Amazon Redshift needs authorization to access the data catalog in Athena and the data ﬁles in

Amazon S3 on your behalf. To provide that authorization, you ﬁrst create an AWS Identity and Access

Management (IAM) role. Then you attach the role to your cluster and provide Amazon Resource Name

(ARN) for the role in the Amazon Redshift CREATE EXTERNAL SCHEMA statement. For more information

about authorization, see IAM Policies for Amazon Redshift Spectrum (p. 154).

Note

If you currently have Redshift Spectrum external tables in the Athena data catalog, you can

migrate your Athena data catalog to an AWS Glue Data Catalog. To use an AWS Glue Data

Catalog with Redshift Spectrum, you might need to change your IAM policies. For more

information, see Upgrading to the AWS Glue Data Catalog in the Athena User Guide.

API Version 2012-12-01

165

Amazon Redshift Database Developer Guide

Creating External Schemas

To create an external database at the same time you create an external schema, specify FROM DATA

CATALOG and include the CREATE EXTERNAL DATABASE clause in your CREATE EXTERNAL SCHEMA

statement.

The following example creates an external schema named spectrum_schema using the external

database spectrum_db.

create external schema spectrum_schema from data catalog

database 'spectrum_db'

iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'

create external database if not exists;

If you manage your data catalog using Athena, specify the Athena database name and the AWS Region in

which the Athena data catalog is located.

The following example creates an external schema using the default sampledb database in the Athena

data catalog.

create external schema athena_schema from data catalog

database 'sampledb'

iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'

region 'us-east-2';

Note

The region parameter references the AWS Region in which the Athena data catalog is located,

not the location of the data ﬁles in Amazon S3.

When using the Athena data catalog, the following limits apply:

• A maximum of 100 databases per account.

• A maximum of 100 tables per database.

• A maximum of 20,000 partitions per table.

You can request a limit increase by contacting AWS Support.

To avoid the limits, use a Hive metastore instead of an Athena data catalog.

If you manage your data catalog using a Hive metastore, such as Amazon EMR, your security groups must

be conﬁgured to allow traﬃc between the clusters.

In the CREATE EXTERNAL SCHEMA statement, specify FROM HIVE METASTORE and include the

metastore's URI and port number. The following example creates an external schema using a Hive

metastore database named hive_db.

create external schema hive_schema

from hive metastore

database 'hive_db'

uri '172.10.10.10' port 99

iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'

To view external schemas for your cluster, query the PG_EXTERNAL_SCHEMA catalog table or the

SVV_EXTERNAL_SCHEMAS view. The following example queries SVV_EXTERNAL_SCHEMAS, which joins

PG_EXTERNAL_SCHEMA and PG_NAMESPACE.

select * from svv_external_schemas

API Version 2012-12-01

166

Amazon Redshift Database Developer Guide

Working with External Catalogs

For the full command syntax and examples, see CREATE EXTERNAL SCHEMA (p. 449).

Working with Amazon Redshift Spectrum External

Catalogs

The metadata for Amazon Redshift Spectrum external databases and external tables is stored in an

external data catalog. By default, Redshift Spectrum metadata is stored in an Athena data catalog. You

can view and manage Redshift Spectrum databases and tables in your Athena console.

You can also create and manage external databases and external tables using Hive data deﬁnition

language (DDL) using Athena or a Hive metastore, such as Amazon EMR.

Note

We recommend using Amazon Redshift to create and manage external databases and external

tables in Redshift Spectrum.

Viewing Redshift Spectrum Databases in Athena

If you created an external database by including the CREATE EXTERNAL DATABASE IF NOT EXISTS clause

as part of your CREATE EXTERNAL SCHEMA statement, the external database metadata is stored in your

Athena data catalog. The metadata for external tables that you create qualiﬁed by the external schema is

also stored in your Athena data catalog.

Athena maintains a data catalog for each supported AWS Region. To view table metadata, log on to

the Athena console and choose Catalog Manager. The following example shows the Athena Catalog

Manager for the US West (Oregon) Region.

If you create and manage your external tables using Athena, register the database using CREATE

EXTERNAL SCHEMA. For example, the following command registers the Athena database named

sampledb.

API Version 2012-12-01

167

Amazon Redshift Database Developer Guide

Working with External Catalogs

create external schema athena_sample

from data catalog

database 'sampledb'

iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole'

region 'us-east-1';

When you query the SVV_EXTERNAL_TABLES system view, you see tables in the Athena sampledb

database and also tables that you created in Amazon Redshift.

select * from svv_external_tables;

schemaname | tablename | location

--------------+------------------+--------------------------------------------------------

athena_sample | elb_logs | s3://athena-examples/elb/plaintext

athena_sample | lineitem_1t_csv | s3://myspectrum/tpch/1000/lineitem_csv

athena_sample | lineitem_1t_part | s3://myspectrum/tpch/1000/lineitem_partition

spectrum | sales | s3://awssampledbuswest2/tickit/spectrum/sales

spectrum | sales_part | s3://awssampledbuswest2/tickit/spectrum/sales_part

Registering an Apache Hive Metastore Database

If you create external tables in an Apache Hive metastore, you can use CREATE EXTERNAL SCHEMA to

In the CREATE EXTERNAL SCHEMA statement, specify the FROM HIVE METASTORE clause and provide

the Hive metastore URI and port number. The IAM role must include permission to access Amazon S3 but

doesn't need any Athena permissions. The following example registers a Hive metastore.

create external schema if not exists hive_schema

from hive metastore

database 'hive_database'

uri 'ip-10-0-111-111.us-west-2.compute.internal' port 9083

iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole';

Enabling Your Amazon Redshift Cluster to Access Your Amazon

EMR Cluster

If your Hive metastore is in Amazon EMR, you must give your Amazon Redshift cluster access to your

Amazon EMR cluster. To do so, you create an Amazon EC2 security group and allow all inbound traﬃc

to the EC2 security group from your Amazon Redshift cluster's security group and your Amazon EMR

cluster's security group. Then you add the EC2 security to both your Amazon Redshift cluster and your

Amazon EMR cluster.

To enable your Amazon Redshift cluster to access your Amazon EMR cluster

1. In Amazon Redshift, make a note of your cluster's security group name. In the Amazon Redshift

dashboard, choose your cluster. Find your cluster security groups in the Cluster Properties group.

API Version 2012-12-01

168

Amazon Redshift Database Developer Guide

Working with External Catalogs

2. In Amazon EMR, make a note of the EMR master node security group name.

3. Create or modify an Amazon EC2 security group to allow connection between Amazon Redshift and

Amazon EMR:

1. In the Amazon EC2 dashboard, choose Security Groups.

2. Choose Create Security Group.

3. If using VPC, choose the VPC that both your Amazon Redshift and Amazon EMR clusters are in.

4. Add an inbound rule.

5. For Type, choose TCP.

6. For Source, choose Custom.

7. Type the name of your Amazon Redshift security group.

8. Add another inbound rule.

9. For Type, choose TCP.

10.For Port Range, type 9083.

Note

The default port for an EMR HMS is 9083. If your HMS uses a diﬀerent port, specify that

port in the inbound rule and in the external schema deﬁnition.

11.For Source, choose Custom.

12.Type the name of your Amazon EMR security group.

13.Choose Create.

API Version 2012-12-01

169

Amazon Redshift Database Developer Guide

Working with External Catalogs

4. Add the Amazon EC2 security group you created in the previous step to your Amazon Redshift

cluster and to your Amazon EMR cluster:

1. In Amazon Redshift, choose your cluster.

2. Choose Cluster, Modify.

3. In VPC Security Groups, add the new security group by pressing CRTL and choosing the new

security group name.

4. In Amazon EMR, choose your cluster.

5. Under Hardware, choose the link for the Master node.

6. Choose the link in the EC2 Instance ID column.

7. Choose Actions, Networking, Change Security Groups.

API Version 2012-12-01

170

Amazon Redshift Database Developer Guide

Creating External Tables

8. Choose the new security group.

9. Choose Assign Security Groups.

Creating External Tables for Amazon Redshift

Spectrum

Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. You can query

an external table using the same SELECT syntax you use with other Amazon Redshift tables. External

tables are read-only. You can't write to an external table.

You create an external table in an external schema. To create external tables, you must be the

owner of the external schema or a superuser. To transfer ownership of an external schema, use

ALTER SCHEMA (p. 364) to change the owner. The following example changes the owner of the

spectrum_schema schema to newowner.

alter schema spectrum_schema owner to newowner;

To run a Redshift Spectrum query, you need the following permissions:

• Usage permission on the schema

• Permission to create temporary tables in the current database

The following example grants usage permission on the schema spectrum_schema to the

spectrumusers user group.

API Version 2012-12-01

171

Amazon Redshift Database Developer Guide

Pseudocolumns

grant usage on schema spectrum_schema to group spectrumusers;

The following example grants temporary permission on the database spectrumdb to the

spectrumusers user group.

grant temp on database spectrumdb to group spectrumusers;

You can create an external table in Amazon Redshift, AWS Glue, Amazon Athena, or an Apache Hive

metastore. For more information, see Getting Started Using AWS Glue in the AWS Glue Developer Guide,

Getting Started in the Amazon Athena User Guide, or Apache Hive in the Amazon EMR Developer Guide.

If your external table is deﬁned in AWS Glue, Athena, or a Hive metastore, you ﬁrst create an external

schema that references the external database. Then you can reference the external table in your

SELECT statement by preﬁxing the table name with the schema name, without needing to create the

table in Amazon Redshift. For more information, see Creating External Schemas for Amazon Redshift

Spectrum (p. 165).

For example, suppose that you have an external table named lineitem_athena deﬁned in an Athena

external catalog. In this case, you can deﬁne an external schema named athena_schema, then query

the table using the following SELECT statement.

select count(*) from athena_schema.lineitem_athena;

To deﬁne an external table in Amazon Redshift, use the CREATE EXTERNAL TABLE (p. 452) command.

The external table statement deﬁnes the table columns, the format of your data ﬁles, and the location

of your data in Amazon S3. Redshift Spectrum scans the ﬁles in the speciﬁed folder and any subfolders.

Redshift Spectrum ignores hidden ﬁles and ﬁles that begin with a period, underscore, or hash mark ( . , _,

or #) or end with a tilde (~).

The following example creates a table named SALES in the Amazon Redshift external schema named

spectrum. The data is in tab-delimited text ﬁles.

create external table spectrum.sales(

salesid integer,

listid integer,

sellerid integer,

buyerid integer,

eventid integer,

dateid smallint,

qtysold smallint,

pricepaid decimal(8,2),

commission decimal(8,2),

saletime timestamp)

row format delimited

fields terminated by '\t'

stored as textfile

location 's3://awssampledbuswest2/tickit/spectrum/sales/'

table properties ('numRows'='172000');

To view external tables, query the SVV_EXTERNAL_TABLES (p. 904) system view.

Pseudocolumns

By default, Amazon Redshift creates external tables with the pseudocolumns $path and $size. Select

these columns to view the path to the data ﬁles on Amazon S3 and the size of the data ﬁles for each

row returned by a query. The $path and $size column names must be delimited with double quotation

API Version 2012-12-01

172

Amazon Redshift Database Developer Guide

Partitioning Redshift Spectrum External Tables

marks. A SELECT * clause doesn't return the pseudocolumns. You must explicitly include the $path and

$size column names in your query, as the following example shows.

select "$path", "$size"

from spectrum.sales_part

where saledate = '2008-12-01';

You can disable creation of pseudocolumns for a session by setting the

spectrum_enable_pseudo_columns conﬁguration parameter to false.

Important

Selecting $size or $path incurs charges because Redshift Spectrum scans the data ﬁles on

Amazon S3 to determine the size of the result set. For more information, see Amazon Redshift

Pricing.

Pseudocolumns Example

The following example returns the total size of related data ﬁles for an external table.

select distinct "$path", "$size"

from spectrum.sales_part;

$path | $size

---------------------------------------+-------

s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616

s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444

s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/ | 1644

Partitioning Redshift Spectrum External Tables

When you partition your data, you can restrict the amount of data that Redshift Spectrum scans by

ﬁltering on the partition key. You can partition your data by any key.

A common practice is to partition the data based on time. For example, you might choose to partition

by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a

data source identiﬁer and date.

The following procedure describes how to partition your data.

To partition your data

1. Store your data in folders in Amazon S3 according to your partition key.

Create one folder for each partition value and name the folder with the partition key and value.

For example, if you partition by date, you might have folders named saledate=2017-04-31,

saledate=2017-04-30, and so on. Redshift Spectrum scans the ﬁles in the partition folder

and any subfolders. Redshift Spectrum ignores hidden ﬁles and ﬁles that begin with a period,

underscore, or hash mark ( . , _, or #) or end with a tilde (~).

2. Create an external table and specify the partition key in the PARTITIONED BY clause.

The partition key can't be the name of a table column. The data type can be any standard Amazon

Redshift data type except TIMESTAMPTZ.

3. Add the partitions.

Using ALTER TABLE (p. 365) … ADD PARTITION, add each partition, specifying the partition

column and key value, and the location of the partition folder in Amazon S3. You can add multiple

partitions in a single ALTER TABLE … ADD statement. The following example adds partitions for

'2008-01-01' and '2008-02-01'.

API Version 2012-12-01

173

Amazon Redshift Database Developer Guide

Partitioning Redshift Spectrum External Tables

alter table spectrum.sales_part add

partition(saledate='2008-01-01')

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/';

partition(saledate='2008-02-01')

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Note

If you use the AWS Glue catalog, you can add up to 100 partitions using a single ALTER

TABLE statement.

Partitioning Data Examples

In this example, you create an external table that is partitioned by a single partition key and an external

table that is partitioned by two partition keys.

The sample data for this example is located in an Amazon S3 bucket that gives read access to all

authenticated AWS users. Your cluster and your external data ﬁles must be in the same AWS Region.

The sample data bucket is in the US West (Oregon) Region (us-west-2). To access the data using Redshift

Spectrum, your cluster must also be in us-west-2. To list the folders in Amazon S3, run the following

command.

aws s3 ls s3://awssampledbuswest2/tickit/spectrum/sales_partition/

PRE saledate=2008-01/

PRE saledate=2008-02/

PRE saledate=2008-03/

If you don't already have an external schema, run the following command. Substitute the Amazon

Resource Name (ARN) for your AWS Identity and Access Management (IAM) role.

create external schema spectrum

from data catalog

database 'spectrumdb'

iam_role 'arn:aws:iam::123456789012:role/myspectrumrole'

create external database if not exists;

Example 1: Partitioning with a Single Partition Key

In the following example, you create an external table that is partitioned by month.

To create an external table partitioned by month, run the following command.

create external table spectrum.sales_part(

salesid integer,

listid integer,

sellerid integer,

buyerid integer,

eventid integer,

dateid smallint,

qtysold smallint,

pricepaid decimal(8,2),

commission decimal(8,2),

saletime timestamp)

partitioned by (saledate char(10))

row format delimited

fields terminated by '|'

API Version 2012-12-01

174

Amazon Redshift Database Developer Guide

Partitioning Redshift Spectrum External Tables

stored as textfile

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/'

table properties ('numRows'='172000');

To add the partitions, run the following ALTER TABLE commands.

alter table spectrum.sales_part add

partition(saledate='2008-01')

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/'

partition(saledate='2008-02')

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/'

partition(saledate='2008-03')

location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/';

Run the following query to select data from the partitioned table.

select top 5 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid)

from spectrum.sales_part, event

where spectrum.sales_part.eventid = event.eventid

and spectrum.sales_part.pricepaid > 30

and saledate = '2008-01'

group by spectrum.sales_part.eventid

order by 2 desc;

eventid | sum

--------+---------

4124 | 21179.00

1924 | 20569.00

2294 | 18830.00

2260 | 17669.00

6032 | 17265.00

To view external table partitions, query the SVV_EXTERNAL_PARTITIONS (p. 903) system view.

select schemaname, tablename, values, location from svv_external_partitions

where tablename = 'sales_part';

schemaname | tablename | values | location

-----------+------------+-------------

+-------------------------------------------------------------------------

spectrum | sales_part | ["2008-01"] | s3://awssampledbuswest2/tickit/spectrum/

sales_partition/saledate=2008-01

spectrum | sales_part | ["2008-02"] | s3://awssampledbuswest2/tickit/spectrum/

sales_partition/saledate=2008-02

spectrum | sales_part | ["2008-03"] | s3://awssampledbuswest2/tickit/spectrum/

sales_partition/saledate=2008-03

Example 2: Partitioning with a Multiple Partition Key

To create an external table partitioned by date and eventid, run the following command.

create external table spectrum.sales_event(

salesid integer,

listid integer,

sellerid integer,

buyerid integer,

API Version 2012-12-01

175

Amazon Redshift Database Developer Guide

Partitioning Redshift Spectrum External Tables

eventid integer,

dateid smallint,

qtysold smallint,

pricepaid decimal(8,2),

commission decimal(8,2),

saletime timestamp)

partitioned by (salesmonth char(10), event integer)

row format delimited

fields terminated by '|'

stored as textfile

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/'

table properties ('numRows'='172000');

To add the partitions, run the following ALTER TABLE commands.

alter table spectrum.sales_event add

partition(salesmonth='2008-01', event='101')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-01/

event=101/';

partition(salesmonth='2008-01', event='102')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-01/event=102/'

partition(salesmonth='2008-01', event='103')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-01/event=103/'

partition(salesmonth='2008-02', event='101')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-02/event=101/'

partition(salesmonth='2008-02', event='102')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-02/event=102/'

partition(salesmonth='2008-02', event='103')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-02/event=103/'

partition(salesmonth='2008-03', event='101')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-03/event=101/'

partition(salesmonth='2008-03', event='102')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-03/

event=102/';

partition(salesmonth='2008-03', event='103')

location 's3://awssampledbuswest2/tickit/spectrum/salesevent/salesmonth=2008-03/

event=103/';

Run the following query to select data from the partitioned table.

select spectrum.sales_event.salesmonth, event.eventname,

sum(spectrum.sales_event.pricepaid)

from spectrum.sales_event, event

where spectrum.sales_event.eventid = event.eventid

and salesmonth = '2008-02'

and (event = '101'

or event = '102'

or event = '103')

group by event.eventname, spectrum.sales_event.salesmonth

order by 3 desc;

salesmonth | eventname | sum

-----------+-----------------+--------

2008-02 | The Magic Flute | 5062.00

API Version 2012-12-01

176

Amazon Redshift Database Developer Guide

Mapping to ORC Columns

2008-02 | La Sonnambula | 3498.00

2008-02 | Die Walkure | 534.00

Mapping External Table Columns to ORC Columns

You use Amazon Redshift Spectrum external tables to query data from ﬁles in ORC format. Optimized

row columnar (ORC) format is a columnar storage ﬁle format that supports nested data structures.

For more information about querying nested data, see Querying Nested Data with Amazon Redshift

Spectrum (p. 104).

When you create an external table that references data in an ORC ﬁle, you map each column in the

external table to a column in the ORC data. To do so, you use one of the following methods:

•Mapping by position (p. 177)

•Mapping by column name (p. 178)

Mapping by column name is the default.

Mapping by Position

With position mapping, the ﬁrst column deﬁned in the external table maps to the ﬁrst column in the

ORC data ﬁle, the second to the second, and so on. Mapping by position requires that the order of

columns in the external table and in the ORC ﬁle match. If the order of the columns doesn't match, then

you can map the columns by name.

Important

In earlier releases, Redshift Spectrum used position mapping by default. If you need to continue

using position mapping for existing tables, set the table property orc.schema.resolution to

position, as the following example shows.

alter table spectrum.orc_example

set table properties('orc.schema.resolution'='position');

For example, the table SPECTRUM.ORC_EXAMPLE is deﬁned as follows.

create external table spectrum.orc_example(

int_col int,

float_col float,

nested_col struct<

"int_col" : int,

"map_col" : map<int, array<float >>

) stored as orc

location 's3://example/orc/files/';

The table structure can be abstracted as follows.

• 'int_col' : int

• 'float_col' : float

• 'nested_col' : struct

o 'int_col' : int

o 'map_col' : map

- key : int

- value : array

- value : float

The underlying ORC ﬁle has the following ﬁle structure.

API Version 2012-12-01

177

Amazon Redshift Database Developer Guide

Mapping to ORC Columns

• ORC file root(id = 0)

o 'int_col' : int (id = 1)

o 'float_col' : float (id = 2)

o 'nested_col' : struct (id = 3)

- 'int_col' : int (id = 4)

- 'map_col' : map (id = 5)

- key : int (id = 6)

- value : array (id = 7)

- value : float (id = 8)

In this example, you can map each column in the external table to a column in ORC ﬁle strictly by

position. The following shows the mapping.

External Table Column Name ORC Column ID ORC Column Name

int_col 1 int_col

ﬂoat_col 2 ﬂoat_col

nested_col 3 nested_col

nested_col.int_col 4 int_col

nested_col.map_col 5 map_col

nested_col.map_col.key 6 NA

nested_col.map_col.value 7 NA

nested_col.map_col.value.item 8 NA

Mapping by Column Name

Using name mapping, you map columns in an external table to named columns in ORC ﬁles on the same

level, with the same name.

For example, suppose that you want to map the table from the previous example,

SPECTRUM.ORC_EXAMPLE, with an ORC ﬁle that uses the following ﬁle structure.

• ORC file root(id = 0)

o 'nested_col' : struct (id = 1)

- 'map_col' : map (id = 2)

- key : int (id = 3)

- value : array (id = 4)

- value : float (id = 5)

- 'int_col' : int (id = 6)

o 'int_col' : int (id = 7)

o 'float_col' : float (id = 8)

Using position mapping, Redshift Spectrum attempts the following mapping.

External Table Column Name ORC Column ID ORC Column Name

int_col 1 struct

ﬂoat_col 7 int_col

nested_col 8 ﬂoat_col

API Version 2012-12-01

178

Amazon Redshift Database Developer Guide

Improving Amazon Redshift Spectrum Query Performance

When you query a table with the preceding position mapping, the SELECT command fails on type

validation because the structures are diﬀerent.

You can map the same external table to both ﬁle structures shown in the previous examples by using

column name mapping. The table columns int_col, float_col, and nested_col map by column

name to columns with the same names in the ORC ﬁle. The column named nested_col in the external

table is a struct column with subcolumns named map_col and int_col. The subcolumns also map

correctly to the corresponding columns in the ORC ﬁle by column name.

Improving Amazon Redshift Spectrum Query

Performance

Look at the query plan to ﬁnd what steps have been pushed to the Amazon Redshift Spectrum layer.

The following steps are related to the Redshift Spectrum query:

• S3 Seq Scan

• S3 HashAggregate

• S3 Query Scan

• Seq Scan PartitionInfo

• Partition Loop

The following example shows the query plan for a query that joins an external table with a local table.

Note the S3 Seq Scan and S3 HashAggregate steps that were executed against the data on Amazon S3.

explain

select top 10 spectrum.sales.eventid, sum(spectrum.sales.pricepaid)

from spectrum.sales, event

where spectrum.sales.eventid = event.eventid

and spectrum.sales.pricepaid > 30

group by spectrum.sales.eventid

order by 2 desc;

QUERY PLAN

-----------------------------------------------------------------------------

XN Limit (cost=1001055770628.63..1001055770628.65 rows=10 width=31)

-> XN Merge (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

Merge Key: sum(sales.derived_col2)

-> XN Network (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

Send to leader

-> XN Sort (cost=1001055770628.63..1001055770629.13 rows=200 width=31)

API Version 2012-12-01

179

Amazon Redshift Database Developer Guide

Improving Amazon Redshift Spectrum Query Performance

Sort Key: sum(sales.derived_col2)

-> XN HashAggregate (cost=1055770620.49..1055770620.99 rows=200

width=31)

-> XN Hash Join DS_BCAST_INNER (cost=3119.97..1055769620.49

rows=200000 width=31)

Hash Cond: ("outer".derived_col1 = "inner".eventid)

-> XN S3 Query Scan sales (cost=3010.00..5010.50

rows=200000 width=31)

-> S3 HashAggregate (cost=3010.00..3010.50

rows=200000 width=16)

-> S3 Seq Scan spectrum.sales location:"s3://

awssampledbuswest2/tickit/spectrum/sales" format:TEXT (cost=0.00..2150.00 rows=172000

width=16)

Filter: (pricepaid > 30.00)

-> XN Hash (cost=87.98..87.98 rows=8798 width=4)

-> XN Seq Scan on event (cost=0.00..87.98 rows=8798

width=4)

Note the following elements in the query plan:

• The S3 Seq Scan node shows the ﬁlter pricepaid > 30.00 was processed in the Redshift

Spectrum layer.

A ﬁlter node under the XN S3 Query Scan node indicates predicate processing in Amazon Redshift

on top of the data returned from the Redshift Spectrum layer.

• The S3 HashAggregate node indicates aggregation in the Redshift Spectrum layer for the group by

clause (group by spectrum.sales.eventid).

Following are ways to improve Redshift Spectrum performance:

• Use Parquet formatted data ﬁles. Parquet stores data in a columnar format, so Redshift Spectrum can

eliminate unneeded columns from the scan. When data is in text-ﬁle format, Redshift Spectrum needs

to scan the entire ﬁle.

• Use the fewest columns possible in your queries.

• Use multiple ﬁles to optimize for parallel processing. Keep your ﬁle sizes larger than 64 MB. Avoid data

size skew by keeping ﬁles about the same size.

• Put your large fact tables in Amazon S3 and keep your frequently used, smaller dimension tables in

your local Amazon Redshift database.

• Update external table statistics by setting the TABLE PROPERTIES numRows parameter. Use CREATE

EXTERNAL TABLE (p. 452) or ALTER TABLE (p. 365) to set the TABLE PROPERTIES numRows

parameter to reﬂect the number of rows in the table. Amazon Redshift doesn't analyze external

tables to generate the table statistics that the query optimizer uses to generate a query plan. If table

statistics aren't set for an external table, Amazon Redshift generates a query execution plan based on

an assumption that external tables are the larger tables and local tables are the smaller tables.

• The Amazon Redshift query planner pushes predicates and aggregations to the Redshift Spectrum

query layer whenever possible. When large amounts of data are returned from Amazon S3, the

processing is limited by your cluster's resources. Redshift Spectrum scales automatically to process

API Version 2012-12-01

180

Amazon Redshift Database Developer Guide

Monitoring Metrics

large requests. Thus, your overall performance improves whenever you can push processing to the

Redshift Spectrum layer.

• Write your queries to use ﬁlters and aggregations that are eligible to be pushed to the Redshift

Spectrum layer.

The following are examples of some operations that can be pushed to the Redshift Spectrum layer:

• GROUP BY clauses

• Comparison conditions and pattern-matching conditions, such as LIKE.

• Aggregate functions, such as COUNT, SUM, AVG, MIN, and MAX.

• String functions.

Operations that can't be pushed to the Redshift Spectrum layer include DISTINCT and ORDER BY.

• Use partitions to limit the data that is scanned. Partition your data based on your most common

query predicates, then prune partitions by ﬁltering on partition columns. For more information, see

Partitioning Redshift Spectrum External Tables (p. 173).

Query SVL_S3PARTITION (p. 919) to view total partitions and qualiﬁed partitions.

Monitoring Metrics in Amazon Redshift Spectrum

You can monitor Amazon Redshift Spectrum queries using the following system views:

•SVL_S3QUERY (p. 920)

Use the SVL_S3QUERY view to get details about Redshift Spectrum queries (S3 queries) at the

segment and node slice level.

•SVL_S3QUERY_SUMMARY (p. 921)

Use the SVL_S3QUERY_SUMMARY view to get a summary of all Amazon Redshift Spectrum queries

(S3 queries) that have been run on the system.

The following are some things to look for in SVL_S3QUERY_SUMMARY:

• The number of ﬁles that were processed by the Redshift Spectrum query.

• The number of bytes scanned from Amazon S3. The cost of a Redshift Spectrum query is reﬂected in

the amount of data scanned from Amazon S3.

• The number of bytes returned from the Redshift Spectrum layer to the cluster. A large amount of data

returned might aﬀect system performance.

• The maximum duration and average duration of Redshift Spectrum requests. Long-running requests

might indicate a bottleneck.

Troubleshooting Queries in Amazon Redshift

Spectrum

Following, you can ﬁnd a quick reference for identifying and addressing some of the most common and

most serious issues you are likely to encounter with Amazon Redshift Spectrum queries. To view errors

generated by Redshift Spectrum queries, query the SVL_S3LOG (p. 918) system table.

Topics

•Retries Exceeded (p. 182)

API Version 2012-12-01

181

Amazon Redshift Database Developer Guide

Retries Exceeded

•No Rows Returned for a Partitioned Table (p. 182)

•Not Authorized Error (p. 182)

•Incompatible Data Formats (p. 182)

•Syntax Error When Using Hive DDL in Amazon Redshift (p. 183)

•Permission to Create Temporary Tables (p. 183)

Retries Exceeded

If an Amazon Redshift Spectrum request times out, the request is canceled and resubmitted. After ﬁve

failed retries, the query fails with the following error.

error: S3Query Exception (Fetch), retries exceeded

Possible causes include the following:

• Large ﬁle sizes (greater than 1 GB). Check your ﬁle sizes in Amazon S3 and look for large ﬁles and ﬁle

size skew. Break up large ﬁles into smaller ﬁles, between 100 MB and 1 GB. Try to make ﬁles about the

same size.

• Slow network throughput. Try your query later.

No Rows Returned for a Partitioned Table

If your query returns zero rows from a partitioned external table, check whether a partition

has been added for this external table. Redshift Spectrum only scans ﬁles in an Amazon S3

location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the

SVV_EXTERNAL_PARTITIONS (p. 903) view to ﬁnd existing partitions. Run ALTER TABLE ADD …

PARTITION for each missing partition.

Not Authorized Error

Verify that the IAM role for the cluster allows access to the Amazon S3 ﬁle objects. If your external

database is on Amazon Athena, verify that the AWS Identity and Access Management (IAM) role

allows access to Athena resources. For more information, see IAM Policies for Amazon Redshift

Spectrum (p. 154).

Incompatible Data Formats

For a columnar ﬁle format, such as Parquet, the column type is embedded with the data. The column

type in the CREATE EXTERNAL TABLE deﬁnition must match the column type of the data ﬁle. If there is a

mismatch, you receive an error similar to the following:

Task failed due to an internal error.

File 'https://s3bucket/location/file has an incompatible Parquet schema

for column ‘s3://s3bucket/location.col1'. Column type: VARCHAR, Par

The error message might be truncated due to the limit on message length. To retrieve the complete error

message, including column name and column type, query the SVL_S3LOG (p. 918) system view.

The following example queries SVL_S3LOG for the last query executed.

select message

API Version 2012-12-01

182

Amazon Redshift Database Developer Guide

Syntax Error When Using Hive DDL in Amazon Redshift

from svl_s3log

where query = pg_last_query_id()

order by query,segment,slice;

The following is an example of a result that shows the full error message.

message

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––-

S3 Query Exception (Fetch). Task failed due to an internal error.

File 'https://s3bucket/location/file has an incompatible

Parquet schema for column ' s3bucket/location.col1'.

Column type: VARCHAR, Parquet schema:\noptional int64 l_orderkey [i:0 d:1 r:0]\n

To correct the error, alter the external table to match the column type of the Parquet ﬁle.

Syntax Error When Using Hive DDL in Amazon

Redshift

Amazon Redshift supports data deﬁnition language (DDL) for CREATE EXTERNAL TABLE that is similar to

Hive DDL. However, the two types of DDL aren't always exactly the same. If you copy Hive DDL to create

or alter Amazon Redshift external tables, you might encounter syntax errors. The following are examples

of diﬀerences between Amazon Redshift and Hive DDL:

• Amazon Redshift requires single quotation marks (') where Hive DDL supports double quotation marks

(").

• Amazon Redshift doesn't support the STRING data type. Use VARCHAR instead.

Permission to Create Temporary Tables

To run Redshift Spectrum queries, the database user must have permission to create temporary tables in

the database. The following example grants temporary permission on the database spectrumdb to the

spectrumusers user group.

grant temp on database spectrumdb to group spectrumusers;

For more information, see GRANT (p. 516).

API Version 2012-12-01

183

Amazon Redshift Database Developer Guide

Using COPY to Load Data

Loading Data

Topics

•Using a COPY Command to Load Data (p. 184)

•Updating Tables with DML Commands (p. 216)

•Updating and Inserting New Data (p. 216)

•Performing a Deep Copy (p. 221)

•Analyzing Tables (p. 223)

•Vacuuming Tables (p. 228)

•Managing Concurrent Write Operations (p. 238)

A COPY command is the most eﬃcient way to load a table. You can also add data to your tables using

INSERT commands, though it is much less eﬃcient than using COPY. The COPY command is able to

read from multiple data ﬁles or multiple data streams simultaneously. Amazon Redshift allocates the

workload to the cluster nodes and performs the load operations in parallel, including sorting the rows

and distributing data across node slices.

Note

Amazon Redshift Spectrum external tables are read-only. You can't COPY or INSERT to an

external table.

To access data on other AWS resources, your cluster must have permission to access those resources and

to perform the necessary actions to access the data. You can use Identity and Access Management (IAM)

to limit the access users have to your cluster resources and data.

After your initial data load, if you add, modify, or delete a signiﬁcant amount of data, you should follow

up by running a VACUUM command to reorganize your data and reclaim space after deletes. You should

also run an ANALYZE command to update table statistics.

This section explains how to load data and troubleshoot data loads and presents best practices for

loading data.

Using a COPY Command to Load Data

Topics

•Credentials and Access Permissions (p. 185)

•Preparing Your Input Data (p. 186)

•Loading Data from Amazon S3 (p. 187)

•Loading Data from Amazon EMR (p. 196)

•Loading Data from Remote Hosts (p. 200)

•Loading Data from an Amazon DynamoDB Table (p. 206)

•Verifying That the Data Was Loaded Correctly (p. 208)

•Validating Input Data (p. 208)

•Loading Tables with Automatic Compression (p. 209)

•Optimizing Storage for Narrow Tables (p. 211)

•Loading Default Column Values (p. 211)

•Troubleshooting Data Loads (p. 211)

API Version 2012-12-01

184

Amazon Redshift Database Developer Guide

Credentials and Access Permissions

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture

to read and load data in parallel from ﬁles on Amazon S3, from a DynamoDB table, or from text output

from one or more remote hosts.

Note

We strongly recommend using the COPY command to load large amounts of data. Using

individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if

your data already exists in other Amazon Redshift database tables, use INSERT INTO ... SELECT

or CREATE TABLE AS to improve performance. For information, see INSERT (p. 520) or CREATE

TABLE AS (p. 483).

To load data from another AWS resource, your cluster must have permission to access the resource and

perform the necessary actions.

To grant or revoke privilege to load data into a table using a COPY command, grant or revoke the INSERT

privilege.

Your data needs to be in the proper format for loading into your Amazon Redshift table. This section

presents guidelines for preparing and verifying your data before the load and for validating a COPY

statement before you execute it.

To protect the information in your ﬁles, you can encrypt the data ﬁles before you upload them to your

Amazon S3 bucket; COPY will decrypt the data as it performs the load. You can also limit access to your

load data by providing temporary security credentials to users. Temporary security credentials provide

enhanced security because they have short life spans and cannot be reused after they expire.

You can compress the ﬁles using gzip, lzop, or bzip2 to save time uploading the ﬁles. COPY can then

speed up the load process by uncompressing the ﬁles as they are read.

To help keep your data secure in transit within the AWS cloud, Amazon Redshift uses hardware

accelerated SSL to communicate with Amazon S3 or Amazon DynamoDB for COPY, UNLOAD, backup,

and restore operations.

When you load your table directly from an Amazon DynamoDB table, you have the option to control the

amount of Amazon DynamoDB provisioned throughput you consume.

You can optionally let COPY analyze your input data and automatically apply optimal compression

encodings to your table as part of the load process.

Credentials and Access Permissions

To load or unload data using another AWS resource, such as Amazon S3, Amazon DynamoDB, Amazon

EMR, or Amazon EC2, your cluster must have permission to access the resource and perform the

necessary actions to access the data. For example, to load data from Amazon S3, COPY must have LIST

access to the bucket and GET access for the bucket objects.

To obtain authorization to access a resource, your cluster must be authenticated. You can choose either

role-based access control or key-based access control. This section presents an overview of the two

methods. For complete details and examples, see Permissions to Access Other AWS Resources (p. 424).

Role-Based Access Control

With role-based access control, your cluster temporarily assumes an AWS Identity and Access

Management (IAM) role on your behalf. Then, based on the authorizations granted to the role, your

cluster can access the required AWS resources.

We recommend using role-based access control because it is provides more secure, ﬁne-grained control

of access to AWS resources and sensitive user data, in addition to safeguarding your AWS credentials.

API Version 2012-12-01

185

Amazon Redshift Database Developer Guide

Preparing Your Input Data

To use role-based access control, you must ﬁrst create an IAM role using the Amazon Redshift service

role type, and then attach the role to your cluster. The role must have, at a minimum, the permissions

listed in IAM Permissions for COPY, UNLOAD, and CREATE LIBRARY (p. 427). For steps to create an IAM

role and attach it to your cluster, see Creating an IAM Role to Allow Your Amazon Redshift Cluster to

Access AWS Services in the Amazon Redshift Cluster Management Guide.

You can add a role to a cluster or view the roles associated with a cluster by using the Amazon Redshift

Management Console, CLI, or API. For more information, see Authorizing COPY and UNLOAD Operations

Using IAM Roles in the Amazon Redshift Cluster Management Guide.

When you create an IAM role, IAM returns an Amazon Resource Name (ARN) for the role. To execute

a COPY command using an IAM role, provide the role ARN using the IAM_ROLE parameter or the

CREDENTIALS parameter.

The following COPY command example uses IAM_ROLE parameter with the role MyRedshiftRole for

authentication.

copy customer from 's3://mybucket/mydata'

iam_role 'arn:aws:iam::12345678901:role/MyRedshiftRole';

Key-Based Access Control

With key-based access control, you provide the access key ID and secret access key for anIAM user that is

authorized to access the AWS resources that contain the data.

Note

We strongly recommend using an IAM role for authentication instead of supplying a plain-text

access key ID and secret access key. If you choose key-based access control, never use your AWS

account (root) credentials. Always create an IAM user and provide that user's access key ID and

secret access key. For steps to create an IAM user, see Creating an IAM User in Your AWS Account.

To authenticate using IAM user credentials, replace <access-key-id> and <secret-access-

key with an authorized user's access key ID and full secret access key for the ACCESS_KEY_ID and

SECRET_ACCESS_KEY parameters as shown following.

ACCESS_KEY_ID '<access-key-id>'

SECRET_ACCESS_KEY '<secret-access-key>';

The AWS IAM user must have, at a minimum, the permissions listed in IAM Permissions for COPY,

UNLOAD, and CREATE LIBRARY (p. 427).

Preparing Your Input Data

If your input data is not compatible with the table columns that will receive it, the COPY command will

fail.

Use the following guidelines to help ensure that your input data is valid:

• Your data can only contain UTF-8 characters up to four bytes long.

• Verify that CHAR and VARCHAR strings are no longer than the lengths of the corresponding columns.

VARCHAR strings are measured in bytes, not characters, so, for example, a four-character string of

Chinese characters that occupy four bytes each requires a VARCHAR(16) column.

• Multibyte characters can only be used with VARCHAR columns. Verify that multibyte characters are no

more than four bytes long.

• Verify that data for CHAR columns only contains single-byte characters.

API Version 2012-12-01

186

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

• Do not include any special characters or syntax to indicate the last ﬁeld in a record. This ﬁeld can be a

delimiter.

• If your data includes null terminators, also referred to as NUL (UTF-8 0000) or binary zero (0x000),

you can load these characters as NULLS into CHAR or VARCHAR columns by using the NULL AS

option in the COPY command: null as '\0' or null as '\000' . If you do not use NULL AS, null

terminators will cause your COPY to fail.

• If your strings contain special characters, such as delimiters and embedded newlines, use the ESCAPE

option with the COPY (p. 390) command.

• Verify that all single and double quotes are appropriately matched.

• Verify that ﬂoating-point strings are in either standard ﬂoating-point format, such as 12.123, or an

exponential format, such as 1.0E4.

• Verify that all timestamp and date strings follow the speciﬁcations for DATEFORMAT and

TIMEFORMAT Strings (p. 432). The default timestamp format is YYYY-MM-DD hh:mm:ss, and the

default date format is YYYY-MM-DD.

• For more information about boundaries and limitations on individual data types, see Data

Types (p. 315). For information about multibyte character errors, see Multibyte Character Load

Errors (p. 214)

Loading Data from Amazon S3

Topics

•Splitting Your Data into Multiple Files (p. 187)

•Uploading Files to Amazon S3 (p. 188)

•Using the COPY Command to Load from Amazon S3 (p. 191)

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture

to read and load data in parallel from ﬁles in an Amazon S3 bucket. You can take maximum advantage

of parallel processing by splitting your data into multiple ﬁles and by setting distribution keys on your

tables. For more information about distribution keys, see Choosing a Data Distribution Style (p. 129).

Data from the ﬁles is loaded into the target table, one line per row. The ﬁelds in the data ﬁle are

matched to table columns in order, left to right. Fields in the data ﬁles can be ﬁxed-width or character

delimited; the default delimiter is a pipe (|). By default, all the table columns are loaded, but you can

optionally deﬁne a comma-separated list of columns. If a table column is not included in the column

list speciﬁed in the COPY command, it is loaded with a default value. For more information, see Loading

Default Column Values (p. 211).

Follow this general process to load data from Amazon S3:

1. Split your data into multiple ﬁles.

2. Upload your ﬁles to Amazon S3.

3. Run a COPY command to load the table.

4. Verify that the data was loaded correctly.

The rest of this section explains these steps in detail.

Splitting Your Data into Multiple Files

You can load table data from a single ﬁle, or you can split the data for each table into multiple ﬁles. The

COPY command can load data from multiple ﬁles in parallel. You can load multiple ﬁles by specifying a

common preﬁx, or preﬁx key, for the set, or by explicitly listing the ﬁles in a manifest ﬁle.

API Version 2012-12-01

187

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

Note

We strongly recommend that you divide your data into multiple ﬁles to take advantage of

parallel processing.

Split your data into ﬁles so that the number of ﬁles is a multiple of the number of slices in your cluster.

That way Amazon Redshift can divide the data evenly among the slices. The number of slices per node

depends on the node size of the cluster. For example, each DS1.XL compute node has two slices, and

each DS1.8XL compute node has 32 slices. For more information about the number of slices that each

node size has, go to About Clusters and Nodes in the Amazon Redshift Cluster Management Guide.

The nodes all participate in parallel query execution, working on data that is distributed as evenly as

possible across the slices. If you have a cluster with two DS1.XL nodes, you might split your data into four

ﬁles or some multiple of four. Amazon Redshift does not take ﬁle size into account when dividing the

workload, so you need to ensure that the ﬁles are roughly the same size, between 1 MB and 1 GB after

compression.

If you intend to use object preﬁxes to identify the load ﬁles, name each ﬁle with a common preﬁx. For

example, the venue.txt ﬁle might be split into four ﬁles, as follows:

venue.txt.1

venue.txt.2

venue.txt.3

venue.txt.4

If you put multiple ﬁles in a folder in your bucket, you can specify the folder name as the preﬁx and

COPY will load all of the ﬁles in the folder. If you explicitly list the ﬁles to be loaded by using a manifest

ﬁle, the ﬁles can reside in diﬀerent buckets or folders.

For more information about manifest ﬁles, see Example: COPY from Amazon S3 using a

manifest (p. 435).

Uploading Files to Amazon S3

Topics

•Managing Data Consistency (p. 189)

•Uploading Encrypted Data to Amazon S3 (p. 189)

•Verifying That the Correct Files Are Present in Your Bucket (p. 191)

After splitting your ﬁles, you can upload them to your bucket. You can optionally compress or encrypt

the ﬁles before you load them.

Create an Amazon S3 bucket to hold your data ﬁles, and then upload the data ﬁles to the bucket. For

information about creating buckets and uploading ﬁles, see Working with Amazon S3 Buckets in the

Amazon Simple Storage Service Developer Guide.

Amazon S3 provides eventual consistency for some operations, so it is possible that new data will not be

available immediately after the upload. For more information see, Managing Data Consistency (p. 189)

Important

The Amazon S3 bucket that holds the data ﬁles must be created in the same region as your

cluster unless you use the REGION (p. 397) option to specify the region in which the Amazon

S3 bucket is located.

You can create an Amazon S3 bucket in a speciﬁc region either by selecting the region when you create

the bucket by using the Amazon S3 console, or by specifying an endpoint when you create the bucket

using the Amazon S3 API or CLI.

API Version 2012-12-01

188

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

Following the data load, verify that the correct ﬁles are present on Amazon S3.

Managing Data Consistency

Amazon S3 provides eventual consistency for some operations, so it is possible that new data will not

be available immediately after the upload, which could result in an incomplete data load or loading

stale data. COPY operations where the cluster and the bucket are in diﬀerent regions are eventually

consistent. All regions provide read-after-write consistency for uploads of new objects with unique

object keys. For more information about data consistency, see Amazon S3 Data Consistency Model in the

Amazon Simple Storage Service Developer Guide.

To ensure that your application loads the correct data, we recommend the following practices:

• Create new object keys.

Amazon S3 provides eventual consistency in all regions for overwrite operations. Creating new ﬁle

names, or object keys, in Amazon S3 for each data load operation provides strong consistency in all

regions.

• Use a manifest ﬁle with your COPY operation.

The manifest explicitly names the ﬁles to be loaded. Using a manifest ﬁle enforces strong consistency.

The rest of this section explains these steps in detail.

Creating New Object Keys

Because of potential data consistency issues, we strongly recommend creating new ﬁles with unique

Amazon S3 object keys for each data load operation. If you overwrite existing ﬁles with new data, and

then issue a COPY command immediately following the upload, it is possible for the COPY operation

to begin loading from the old ﬁles before all of the new data is available. For more information about

eventual consistency, see Amazon S3 Data Consistency Model in the Amazon S3 Developer Guide.

Using a Manifest File

You can explicitly specify which ﬁles to load by using a manifest ﬁle. When you use a manifest ﬁle, COPY

enforces strong consistency by searching secondary servers if it does not ﬁnd a listed ﬁle on the primary

server. The manifest ﬁle can be conﬁgured with an optional mandatory ﬂag. If mandatory is true and

the ﬁle is not found, COPY returns an error.

For more information about using a manifest ﬁle, see the copy_from_s3_manifest_ﬁle (p. 395) option

for the COPY command and Example: COPY from Amazon S3 using a manifest (p. 435) in the COPY

examples.

Because Amazon S3 provides eventual consistency for overwrites in all regions, it is possible to load stale

data if you overwrite existing objects with new data. As a best practice, never overwrite existing ﬁles with

new data.

Uploading Encrypted Data to Amazon S3

Amazon S3 supports both server-side encryption and client-side encryption. This topic discusses the

diﬀerences between the server-side and client-side encryption and describes the steps to use client-side

encryption with Amazon Redshift. Server-side encryption is transparent to Amazon Redshift.

Server-Side Encryption

Server-side encryption is data encryption at rest—that is, Amazon S3 encrypts your data as it uploads

it and decrypts it for you when you access it. When you load tables using a COPY command, there is no

diﬀerence in the way you load from server-side encrypted or unencrypted objects on Amazon S3. For

API Version 2012-12-01

189

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

more information about server-side encryption, see Using Server-Side Encryption in the Amazon Simple

Storage Service Developer Guide.

Client-Side Encryption

In client-side encryption, your client application manages encryption of your data, the encryption keys,

and related tools. You can upload data to an Amazon S3 bucket using client-side encryption, and then

load the data using the COPY command with the ENCRYPTED option and a private encryption key to

provide greater security.

You encrypt your data using envelope encryption. With envelope encryption, your application handles all

encryption exclusively. Your private encryption keys and your unencrypted data are never sent to AWS,

so it's very important that you safely manage your encryption keys. If you lose your encryption keys, you

won't be able to unencrypt your data, and you can't recover your encryption keys from AWS. Envelope

encryption combines the performance of fast symmetric encryption while maintaining the greater

security that key management with asymmetric keys provides. A one-time-use symmetric key (the

envelope symmetric key) is generated by your Amazon S3 encryption client to encrypt your data, then

that key is encrypted by your master key and stored alongside your data in Amazon S3. When Amazon

Redshift accesses your data during a load, the encrypted symmetric key is retrieved and decrypted with

your real key, then the data is decrypted.

To work with Amazon S3 client-side encrypted data in Amazon Redshift, follow the steps outlined in

Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service Developer Guide, with

the additional requirements that you use:

•Symmetric encryption – The AWS SDK for Java AmazonS3EncryptionClient class uses envelope

encryption, described preceding, which is based on symmetric key encryption. Use this class to create

an Amazon S3 client to upload client-side encrypted data.

•A 256-bit AES master symmetric key – A master key encrypts the envelope key. You pass the master

key to your instance of the AmazonS3EncryptionClient class. Save this key, because you will need

it to copy data into Amazon Redshift.

•Object metadata to store encrypted envelope key – By default, Amazon S3 stores the envelope key

as object metadata for the AmazonS3EncryptionClient class. The encrypted envelope key that is

stored as object metadata is used during the decryption process.

Note

If you get a cipher encryption error message when you use the encryption API for the ﬁrst time,

your version of the JDK may have a Java Cryptography Extension (JCE) jurisdiction policy ﬁle

that limits the maximum key length for encryption and decryption transformations to 128 bits.

For information about addressing this issue, go to Specifying Client-Side Encryption Using the

AWS SDK for Java in the Amazon Simple Storage Service Developer Guide.

For information about loading client-side encrypted ﬁles into your Amazon Redshift tables using the

COPY command, see Loading Encrypted Data Files from Amazon S3 (p. 195).

Example: Uploading Client-Side Encrypted Data

For an example of how to use the AWS SDK for Java to upload client-side encrypted data, go to Example

1: Encrypt and Upload a File Using a Client-Side Symmetric Master Key in the Amazon Simple Storage

Service Developer Guide.

The example shows the choices you must make during client-side encryption so that the data can

be loaded in Amazon Redshift. Speciﬁcally, the example shows using object metadata to store the

encrypted envelope key and the use of a 256-bit AES master symmetric key.

This example provides example code using the AWS SDK for Java to create a 256-bit AES symmetric

master key and save it to a ﬁle. Then the example upload an object to Amazon S3 using an S3 encryption

API Version 2012-12-01

190

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

client that ﬁrst encrypts sample data on the client-side. The example also downloads the object and

veriﬁes that the data is the same.

Verifying That the Correct Files Are Present in Your Bucket

After you upload your ﬁles to your Amazon S3 bucket, we recommend listing the contents of the

bucket to verify that all of the correct ﬁles are present and that no unwanted ﬁles are present. For

example, if the bucket mybucket holds a ﬁle named venue.txt.back, that ﬁle will be loaded, perhaps

unintentionally, by the following command:

copy venue from 's3://mybucket/venue' … ;

If you want to control speciﬁcally which ﬁles are loaded, you can use a manifest ﬁle to

explicitly list the data ﬁles. For more information about using a manifest ﬁle, see the

copy_from_s3_manifest_ﬁle (p. 395) option for the COPY command and Example: COPY from Amazon

S3 using a manifest (p. 435) in the COPY examples.

For more information about listing the contents of the bucket, see Listing Object Keys in the Amazon S3

Developer Guide.

Using the COPY Command to Load from Amazon S3

Topics

•Using a Manifest to Specify Data Files (p. 193)

•Loading Compressed Data Files from Amazon S3 (p. 193)

•Loading Fixed-Width Data from Amazon S3 (p. 194)

•Loading Multibyte Data from Amazon S3 (p. 195)

•Loading Encrypted Data Files from Amazon S3 (p. 195)

Use the COPY (p. 390) command to load a table in parallel from data ﬁles on Amazon S3. You can

specify the ﬁles to be loaded by using an Amazon S3 object preﬁx or by using a manifest ﬁle.

The syntax to specify the ﬁles to be loaded by using a preﬁx is as follows:

copy <table_name> from 's3://<bucket_name>/<object_prefix>'

authorization;

The manifest ﬁle is a JSON-formatted ﬁle that lists the data ﬁles to be loaded. The syntax to specify the

ﬁles to be loaded by using a manifest ﬁle is as follows:

copy <table_name> from 's3://<bucket_name>/<manifest_file>'

authorization

manifest;

The table to be loaded must already exist in the database. For information about creating a table, see

CREATE TABLE (p. 471) in the SQL Reference.

The values for authorization provide the AWS authorization your cluster needs to access the Amazon

S3 objects. For information about required permissions, see IAM Permissions for COPY, UNLOAD,

and CREATE LIBRARY (p. 427). The preferred method for authentication is to specify the IAM_ROLE

parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary

permissions. Alternatively, you can specify the ACCESS_KEY_ID and SECRET_ACCESS_KEY parameters

and provide the access key ID and secret access key for an authorized IAM user as plain text. For more

information, see Role-Based Access Control (p. 424) or Key-Based Access Control (p. 425).

API Version 2012-12-01

191

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

To authenticate using the IAM_ROLE parameter, replace <aws-account-id> and <role-name> as

shown in the following syntax.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

The following example shows authentication using an IAM role.

copy customer

from 's3://mybucket/mydata'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

To authenticate using IAM user credentials, replace <access-key-id> and <secret-access-

key with an authorized user's access key ID and full secret access key for the ACCESS_KEY_ID and

SECRET_ACCESS_KEY parameters as shown following.

ACCESS_KEY_ID '<access-key-id>'

SECRET_ACCESS_KEY '<secret-access-key>';

The following example shows authentication using IAM user credentials.

copy customer

from 's3://mybucket/mydata'

access_key_id '<access-key-id>'

secret_access_key '<secret-access-key';

For more information about other authorization options, see Authorization Parameters (p. 404)

If you want to validate your data without actually loading the table, use the NOLOAD option with the

COPY (p. 390) command.

The following example shows the ﬁrst few rows of a pipe-delimited data in a ﬁle named venue.txt.

1|Toyota Park|Bridgeview|IL|0

2|Columbus Crew Stadium|Columbus|OH|0

3|RFK Stadium|Washington|DC|0

Before uploading the ﬁle to Amazon S3, split the ﬁle into multiple ﬁles so that the COPY command can

load it using parallel processing. The number of ﬁles should be a multiple of the number of slices in your

cluster. Split your load data ﬁles so that the ﬁles are about equal size, between 1 MB and 1 GB after

compression. For more information, see Splitting Your Data into Multiple Files (p. 187).

For example, the venue.txt ﬁle might be split into four ﬁles, as follows:

venue.txt.1

venue.txt.2

venue.txt.3

venue.txt.4

The following COPY command loads the VENUE table using the pipe-delimited data in the data ﬁles with

the preﬁx 'venue' in the Amazon S3 bucket mybucket.

Note

The Amazon S3 bucket mybucket in the following examples does not exist. For sample COPY

commands that use real data in an existing Amazon S3 bucket, see Step 4: Load Sample

Data (p. 15).

copy venue from 's3://mybucket/venue'

API Version 2012-12-01

192

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

delimiter '|';

If no Amazon S3 objects with the key preﬁx 'venue' exist, the load fails.

Using a Manifest to Specify Data Files

You can use a manifest to ensure that the COPY command loads all of the required ﬁles, and only the

required ﬁles, for a data load. Instead of supplying an object path for the COPY command, you supply

the name of a JSON-formatted text ﬁle that explicitly lists the ﬁles to be loaded. The URL in the manifest

must specify the bucket name and full object path for the ﬁle, not just a preﬁx. You can use a manifest to

load ﬁles from diﬀerent buckets or ﬁles that do not share the same preﬁx. The following example shows

the JSON to load ﬁles from diﬀerent buckets and with ﬁle names that begin with date stamps.

{

"entries": [

{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},

{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},

{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},

{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}

]

}

The optional mandatory ﬂag speciﬁes whether COPY should return an error if the ﬁle is not found. The

default of mandatory is false. Regardless of any mandatory settings, COPY will terminate if no ﬁles

are found.

The following example runs the COPY command with the manifest in the previous example, which is

named cust.manifest.

copy customer

from 's3://mybucket/cust.manifest'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

manifest;

For more information, see Example: COPY from Amazon S3 using a manifest (p. 435).

Using a Manifest Created by UNLOAD

A manifest created by a UNLOAD (p. 566) operation using the MANIFEST parameter might have keys

that are not required for the COPY operation. For example, the following UNLOAD manifest includes a

meta key that is required for an Amazon Redshift Spectrum external table and for loading dataﬁles in an

ORC or Parquet ﬁle format. The COPY operation requires only the url key and an optional mandatory

key.

{

"entries": [

{"url":"s3://mybucket/unload/manifest_0000_part_00", "meta": { "content_length":

5956875 }},

{"url":"s3://mybucket/unload/unload/manifest_0001_part_00", "meta": { "content_length":

5997091 }}

]

}

Loading Compressed Data Files from Amazon S3

To load data ﬁles that are compressed using gzip, lzop, or bzip2, include the corresponding option: GZIP,

LZOP, or BZIP2.

API Version 2012-12-01

193

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

COPY does not support ﬁles compressed using the lzop --ﬁlter option.

For example, the following command loads from ﬁles that were compressing using lzop.

copy customer from 's3://mybucket/customer.lzo'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

delimiter '|' lzop;

Loading Fixed-Width Data from Amazon S3

Fixed-width data ﬁles have uniform lengths for each column of data. Each ﬁeld in a ﬁxed-width data ﬁle

has exactly the same length and position. For character data (CHAR and VARCHAR) in a ﬁxed-width data

ﬁle, you must include leading or trailing spaces as placeholders in order to keep the width uniform. For

integers, you must use leading zeros as placeholders. A ﬁxed-width data ﬁle has no delimiter to separate

columns.

To load a ﬁxed-width data ﬁle into an existing table, USE the FIXEDWIDTH parameter in the COPY

command. Your table speciﬁcations must match the value of ﬁxedwidth_spec in order for the data to

load correctly.

To load ﬁxed-width data from a ﬁle to a table, issue the following command:

copy table_name from 's3://mybucket/prefix'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

fixedwidth 'fixedwidth_spec';

The ﬁxedwidth_spec parameter is a string that contains an identiﬁer for each column and the width of

each column, separated by a colon. The column:width pairs are delimited by commas. The identiﬁer

can be anything that you choose: numbers, letters, or a combination of the two. The identiﬁer has no

relation to the table itself, so the speciﬁcation must contain the columns in the same order as the table.

The following two examples show the same speciﬁcation, with the ﬁrst using numeric identiﬁers and the

second using string identiﬁers:

'0:3,1:25,2:12,3:2,4:6'

'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6'

The following example shows ﬁxed-width sample data that could be loaded into the VENUE table using

the above speciﬁcations:

1 Toyota Park Bridgeview IL0

2 Columbus Crew Stadium Columbus OH0

3 RFK Stadium Washington DC0

4 CommunityAmerica Ballpark Kansas City KS0

5 Gillette Stadium Foxborough MA68756

The following COPY command loads this data set into the VENUE table:

copy venue

from 's3://mybucket/data/venue_fw.txt'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

fixedwidth 'venueid:3,venuename:25,venuecity:12,venuestate:2,venueseats:6';

API Version 2012-12-01

194

Amazon Redshift Database Developer Guide

Loading Data from Amazon S3

Loading Multibyte Data from Amazon S3

If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must

load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters,

but the CHAR data type only accepts single-byte ASCII characters. You cannot load ﬁve-byte or longer

characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see Data

Types (p. 315).

To check which encoding an input ﬁle uses, use the Linux file command:

$ file ordersdata.txt

ordersdata.txt: ASCII English text

$ file uni_ordersdata.dat

uni_ordersdata.dat: UTF-8 Unicode text

Loading Encrypted Data Files from Amazon S3

You can use the COPY command to load data ﬁles that were uploaded to Amazon S3 using server-side

encryption, client-side encryption, or both.

The COPY command supports the following types of Amazon S3 encryption:

• Server-side encryption with Amazon S3-managed keys (SSE-S3)

• Server-side encryption with AWS KMS-managed keys (SSE-KMS)

• Client-side encryption using a client-side symmetric master key

The COPY command doesn't support the following types of Amazon S3 encryption:

• Server-side encryption with customer-provided keys (SSE-C)

• Client-side encryption using an AWS KMS-managed customer master key

• Client-side encryption using a customer-provided asymmetric master key

For more information about Amazon S3 encryption, see Protecting Data Using Server-Side Encryption

and Protecting Data Using Client-Side Encryption in the Amazon Simple Storage Service Developer

Guide.

The UNLOAD (p. 566) command automatically encrypts ﬁles using SSE-S3. You can also unload using

SSE-KMS or client-side encryption with a customer-managed symmetric key. For more information, see

Unloading Encrypted Data Files (p. 245)

The COPY command automatically recognizes and loads ﬁles encrypted using SSE-S3 and SSE-KMS.

You can load ﬁles encrypted using a client-side symmetric master key by specifying the ENCRYPTED

option and providing the key value. For more information, see Uploading Encrypted Data to Amazon

S3 (p. 189).

To load client-side encrypted data ﬁles, provide the master key value using the

MASTER_SYMMETRIC_KEY parameter and include the ENCRYPTED option.

copy customer from 's3://mybucket/encrypted/customer'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

master_symmetric_key '<master_key>'

encrypted

delimiter '|';

To load encrypted data ﬁles that are gzip, lzop, or bzip2 compressed, include the GZIP, LZOP, or BZIP2

option along with the master key value and the ENCRYPTED option.

API Version 2012-12-01

195

Amazon Redshift Database Developer Guide

Loading Data from Amazon EMR

copy customer from 's3://mybucket/encrypted/customer'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

master_symmetric_key '<master_key>'

encrypted

delimiter '|'

gzip;

Loading Data from Amazon EMR

You can use the COPY command to load data in parallel from an Amazon EMR cluster conﬁgured to

write text ﬁles to the cluster's Hadoop Distributed File System (HDFS) in the form of ﬁxed-width ﬁles,

character-delimited ﬁles, CSV ﬁles, or JSON-formatted ﬁles.

Loading Data From Amazon EMR Process

This section walks you through the process of loading data from an Amazon EMR cluster. The following

sections provide the details you need to accomplish each step.

•Step 1: Conﬁgure IAM Permissions (p. 196)

The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must

have the necessary permissions.

•Step 2: Create an Amazon EMR Cluster (p. 197)

Conﬁgure the cluster to output text ﬁles to the Hadoop Distributed File System (HDFS). You will need

the Amazon EMR cluster ID and the cluster's master public DNS (the endpoint for the Amazon EC2

instance that hosts the cluster).

•Step 3: Retrieve the Amazon Redshift Cluster Public Key and Cluster Node IP Addresses (p. 197)

The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts.

You will use the IP address for each cluster node to conﬁgure the host security groups to permit access

from your Amazon Redshift cluster using these IP addresses.

•Step 4: Add the Amazon Redshift Cluster Public Key to Each Amazon EC2 Host's Authorized Keys

File (p. 199)

You add the Amazon Redshift cluster public key to the host's authorized keys ﬁle so that the host will

recognize the Amazon Redshift cluster and accept the SSH connection.

•Step 5: Conﬁgure the Hosts to Accept All of the Amazon Redshift Cluster's IP Addresses (p. 199)

Modify the Amazon EMR instance's security groups to add ingress rules to accept the Amazon Redshift

IP addresses.

•Step 6: Run the COPY Command to Load the Data (p. 199)

From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift

table.

Step 1: Conﬁgure IAM Permissions

The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have

the necessary permissions.

To conﬁgure IAM permissions

1. Add the following permissions for the IAM user that will create the Amazon EMR cluster.

API Version 2012-12-01

196

Amazon Redshift Database Developer Guide

Loading Data from Amazon EMR

ec2:DescribeSecurityGroups

ec2:RevokeSecurityGroupIngress

ec2:AuthorizeSecurityGroupIngress

redshift:DescribeClusters

2. Add the following permission for the IAM role or IAM user that will execute the COPY command.

elasticmapreduce:ListInstances

3. Add the following permission to the Amazon EMR cluster's IAM role.

redshift:DescribeClusters

Step 2: Create an Amazon EMR Cluster

The COPY command loads data from ﬁles on the Amazon EMR Hadoop Distributed File System (HDFS).

When you create the Amazon EMR cluster, conﬁgure the cluster to output data ﬁles to the cluster's

HDFS.

To create an Amazon EMR cluster

1. Create an Amazon EMR cluster in the same AWS region as the Amazon Redshift cluster.

If the Amazon Redshift cluster is in a VPC, the Amazon EMR cluster must be in the same VPC group.

If the Amazon Redshift cluster uses EC2-Classic mode (that is, it is not in a VPC), the Amazon EMR

cluster must also use EC2-Classic mode. For more information, see Managing Clusters in Virtual

Private Cloud (VPC) in the Amazon Redshift Cluster Management Guide.

2. Conﬁgure the cluster to output data ﬁles to the cluster's HDFS. The HDFS ﬁle names must not

include asterisks (*) or question marks (?).

Important

The ﬁle names must not include asterisks ( * ) or question marks ( ? ).

3. Specify No for the Auto-terminate option in the Amazon EMR cluster conﬁguration so that the

cluster remains available while the COPY command executes.

Important

If any of the data ﬁles are changed or deleted before the COPY completes, you might have

unexpected results, or the COPY operation might fail.

4. Note the cluster ID and the master public DNS (the endpoint for the Amazon EC2 instance that hosts

the cluster). You will use that information in later steps.

Step 3: Retrieve the Amazon Redshift Cluster Public Key and

Cluster Node IP Addresses

To retrieve the Amazon Redshift cluster public key and cluster node IP addresses for your

cluster using the console

1. Access the Amazon Redshift Management Console.

2. Click the Clusters link in the left navigation pane.

3. Select your cluster from the list.

4. Locate the SSH Ingestion Settings group.

Note the Cluster Public Key and Node IP addresses. You will use them in later steps.

API Version 2012-12-01

197

Amazon Redshift Database Developer Guide

Loading Data from Amazon EMR

You will use the Private IP addresses in Step 3 to conﬁgure the Amazon EC2 host to accept the

connection from Amazon Redshift.

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon

Redshift CLI, execute the describe-clusters command. For example:

aws redshift describe-clusters --cluster-identifier <cluster-identifier>

The response will include a ClusterPublicKey value and the list of private and public IP addresses, similar

to the following:

{

"Clusters": [

{

"VpcSecurityGroups": [],

"ClusterStatus": "available",

"ClusterNodes": [

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "LEADER",

"PublicIPAddress": "10.nnn.nnn.nnn"

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "COMPUTE-0",

"PublicIPAddress": "10.nnn.nnn.nnn"

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "COMPUTE-1",

"PublicIPAddress": "10.nnn.nnn.nnn"

}

"AutomatedSnapshotRetentionPeriod": 1,

"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",

"AvailabilityZone": "us-east-1a",

API Version 2012-12-01

198

Amazon Redshift Database Developer Guide

Loading Data from Amazon EMR

"NodeType": "ds1.xlarge",

"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",

...

}

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon

Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the

Amazon Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.

Step 4: Add the Amazon Redshift Cluster Public Key to Each

Amazon EC2 Host's Authorized Keys File

You add the cluster public key to each host's authorized keys ﬁle for all of the Amazon EMR cluster nodes

so that the hosts will recognize Amazon Redshift and accept the SSH connection.

To add the Amazon Redshift cluster public key to the host's authorized keys ﬁle

1. Access the host using an SSH connection.

For information about connecting to an instance using SSH, see Connect to Your Instance in the

Amazon EC2 User Guide.

2. Copy the Amazon Redshift public key from the console or from the CLI response text.

3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/

authorized_keys ﬁle on the host. Include the complete string, including the preﬁx "ssh-rsa "

and suﬃx "Amazon-Redshift". For example:

ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-

Redshift

Step 5: Conﬁgure the Hosts to Accept All of the Amazon

Redshift Cluster's IP Addresses

To allow inbound traﬃc to the host instances, edit the security group and add one Inbound rule for each

Amazon Redshift cluster node. For Type, select SSH with TCP protocol on Port 22. For Source, enter the

Amazon Redshift cluster node Private IP addresses you retrieved in Step 3: Retrieve the Amazon Redshift

Cluster Public Key and Cluster Node IP Addresses (p. 197). For information about adding rules to an

Amazon EC2 security group, see Authorizing Inbound Traﬃc for Your Instances in the Amazon EC2 User

Guide.

Step 6: Run the COPY Command to Load the Data

Run a COPY (p. 390) command to connect to the Amazon EMR cluster and load the data into an

Amazon Redshift table. The Amazon EMR cluster must continue running until the COPY command

completes. For example, do not conﬁgure the cluster to auto-terminate.

Important

If any of the data ﬁles are changed or deleted before the COPY completes, you might have

unexpected results, or the COPY operation might fail.

In the COPY command, specify the Amazon EMR cluster ID and the HDFS ﬁle path and ﬁle name.

copy sales

from 'emr://myemrclusterid/myoutput/part*' credentials

API Version 2012-12-01

199

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

You can use the wildcard characters asterisk ( * ) and question mark ( ? ) as part of the ﬁle name

argument. For example, part* loads the ﬁles part-0000, part-0001, and so on. If you specify only a

folder name, COPY attempts to load all ﬁles in the folder.

Important

If you use wildcard characters or use only the folder name, verify that no unwanted ﬁles will be

loaded or the COPY command will fail. For example, some processes might write a log ﬁle to the

output folder.

Loading Data from Remote Hosts

You can use the COPY command to load data in parallel from one or more remote hosts, such Amazon

EC2 instances or other computers. COPY connects to the remote hosts using SSH and executes

commands on the remote hosts to generate text output.

The remote host can be an Amazon EC2 Linux instance or another Unix or Linux computer conﬁgured

to accept SSH connections. This guide assumes your remote host is an Amazon EC2 instance. Where the

procedure is diﬀerent for another computer, the guide will point out the diﬀerence.

Amazon Redshift can connect to multiple hosts, and can open multiple SSH connections to each host.

Amazon Redshifts sends a unique command through each connection to generate text output to the

host's standard output, which Amazon Redshift then reads as it would a text ﬁle.

Before You Begin

Before you begin, you should have the following in place:

• One or more host machines, such as Amazon EC2 instances, that you can connect to using SSH.

• Data sources on the hosts.

You will provide commands that the Amazon Redshift cluster will run on the hosts to generate the text

output. After the cluster connects to a host, the COPY command runs the commands, reads the text

from the hosts' standard output, and loads the data in parallel into an Amazon Redshift table. The text

output must be in a form that the COPY command can ingest. For more information, see Preparing

Your Input Data (p. 186)

• Access to the hosts from your computer.

For an Amazon EC2 instance, you will use an SSH connection to access the host. You will need to access

the host to add the Amazon Redshift cluster's public key to the host's authorized keys ﬁle.

• A running Amazon Redshift cluster.

For information about how to launch a cluster, see Amazon Redshift Getting Started.

Loading Data Process

This section walks you through the process of loading data from remote hosts. The following sections

provide the details you need to accomplish each step.

•Step 1: Retrieve the Cluster Public Key and Cluster Node IP Addresses (p. 201)

The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the remote

hosts. You will use the IP address for each cluster node to conﬁgure the host security groups or ﬁrewall

to permit access from your Amazon Redshift cluster using these IP addresses.

API Version 2012-12-01

200

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

•Step 2: Add the Amazon Redshift Cluster Public Key to the Host's Authorized Keys File (p. 203)

You add the Amazon Redshift cluster public key to the host's authorized keys ﬁle so that the host will

recognize the Amazon Redshift cluster and accept the SSH connection.

•Step 3: Conﬁgure the Host to Accept All of the Amazon Redshift Cluster's IP Addresses (p. 203)

For Amazon EC2 , modify the instance's security groups to add ingress rules to accept the Amazon

Redshift IP addresses. For other hosts, modify the ﬁrewall so that your Amazon Redshift nodes are

able to establish SSH connections to the remote host.

•Step 4: Get the Public Key for the Host (p. 204)

You can optionally specify that Amazon Redshift should use the public key to identify the host. You

will need to locate the public key and copy the text into your manifest ﬁle.

•Step 5: Create a Manifest File (p. 204)

The manifest is a JSON-formatted text ﬁle with the details Amazon Redshift needs to connect to the

hosts and fetch the data.

•Step 6: Upload the Manifest File to an Amazon S3 Bucket (p. 205)

Amazon Redshift reads the manifest and uses that information to connect to the remote host. If the

Amazon S3 bucket does not reside in the same region as your Amazon Redshift cluster, you must use

the REGION (p. 397) option to specify the region in which the data is located.

•Step 7: Run the COPY Command to Load the Data (p. 205)

From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift

table.

Step 1: Retrieve the Cluster Public Key and Cluster Node IP

Addresses

To retrieve the cluster public key and cluster node IP addresses for your cluster using the

console

1. Access the Amazon Redshift Management Console.

2. Click the Clusters link in the left navigation pane.

3. Select your cluster from the list.

4. Locate the SSH Ingestion Settings group.

Note the Cluster Public Key and Node IP addresses. You will use them in later steps.

API Version 2012-12-01

201

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

You will use the IP addresses in Step 3 to conﬁgure the host to accept the connection from Amazon

Redshift. Depending on what type of host you connect to and whether it is in a VPC, you will use

either the public IP addresses or the private IP addresses.

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon

Redshift CLI, execute the describe-clusters command.

For example:

aws redshift describe-clusters --cluster-identifier <cluster-identifier>

The response will include the ClusterPublicKey and the list of Private and Public IP addresses, similar to

the following:

{

"Clusters": [

{

"VpcSecurityGroups": [],

"ClusterStatus": "available",

"ClusterNodes": [

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "LEADER",

"PublicIPAddress": "10.nnn.nnn.nnn"

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "COMPUTE-0",

"PublicIPAddress": "10.nnn.nnn.nnn"

{

"PrivateIPAddress": "10.nnn.nnn.nnn",

"NodeRole": "COMPUTE-1",

"PublicIPAddress": "10.nnn.nnn.nnn"

}

API Version 2012-12-01

202

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

"AutomatedSnapshotRetentionPeriod": 1,

"PreferredMaintenanceWindow": "wed:05:30-wed:06:00",

"AvailabilityZone": "us-east-1a",

"NodeType": "ds1.xlarge",

"ClusterPublicKey": "ssh-rsa AAAABexamplepublickey...Y3TAl Amazon-Redshift",

...

}

To retrieve the cluster public key and cluster node IP addresses for your cluster using the Amazon

Redshift API, use the DescribeClusters action. For more information, see describe-clusters in the Amazon

Redshift CLI Guide or DescribeClusters in the Amazon Redshift API Guide.

Step 2: Add the Amazon Redshift Cluster Public Key to the

Host's Authorized Keys File

You add the cluster public key to each host's authorized keys ﬁle so that the host will recognize Amazon

Redshift and accept the SSH connection.

To add the Amazon Redshift cluster public key to the host's authorized keys ﬁle

1. Access the host using an SSH connection.

For information about connecting to an instance using SSH, see Connect to Your Instance in the

Amazon EC2 User Guide.

2. Copy the Amazon Redshift public key from the console or from the CLI response text.

3. Copy and paste the contents of the public key into the /home/<ssh_username>/.ssh/

authorized_keys ﬁle on the remote host. The <ssh_username> must match the value for the

"username" ﬁeld in the manifest ﬁle. Include the complete string, including the preﬁx "ssh-rsa "

and suﬃx "Amazon-Redshift". For example:

ssh-rsa AAAACTP3isxgGzVWoIWpbVvRCOzYdVifMrh… uA70BnMHCaMiRdmvsDOedZDOedZ Amazon-

Redshift

Step 3: Conﬁgure the Host to Accept All of the Amazon Redshift

Cluster's IP Addresses

If you are working with an Amazon EC2 instance or an Amazon EMR cluster, add Inbound rules to the

host's security group to allow traﬃc from each Amazon Redshift cluster node. For Type, select SSH with

TCP protocol on Port 22. For Source, enter the Amazon Redshift cluster node IP addresses you retrieved

in Step 1: Retrieve the Cluster Public Key and Cluster Node IP Addresses (p. 201). For information about

adding rules to an Amazon EC2 security group, see Authorizing Inbound Traﬃc for Your Instances in the

Amazon EC2 User Guide.

Use the Private IP addresses when:

• You have an Amazon Redshift cluster that is not in a Virtual Private Cloud (VPC), and an Amazon EC2 -

Classic instance, both of which are in the same AWS region.

• You have an Amazon Redshift cluster that is in a VPC, and an Amazon EC2 -VPC instance, both of

which are in the same AWS region and in the same VPC.

Otherwise, use the Public IP addresses.

For more information about using Amazon Redshift in a VPC, see Managing Clusters in Virtual Private

Cloud (VPC) in the Amazon Redshift Cluster Management Guide.

API Version 2012-12-01

203

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

Step 4: Get the Public Key for the Host

You can optionally provide the host's public key in the manifest ﬁle so that Amazon Redshift can identify

the host. The COPY command does not require the host public key but, for security reasons, we strongly

recommend using a public key to help prevent 'man-in-the-middle' attacks.

You can ﬁnd the host's public key in the following location, where <ssh_host_rsa_key_name> is the

unique name for the host's public key:

: /etc/ssh/<ssh_host_rsa_key_name>.pub

Note

Amazon Redshift only supports RSA keys. We do not support DSA keys.

When you create your manifest ﬁle in Step 5, you will paste the text of the public key into the "Public

Key" ﬁeld in the manifest ﬁle entry.

Step 5: Create a Manifest File

The COPY command can connect to multiple hosts using SSH, and can create multiple SSH connections

to each host. COPY executes a command through each host connection, and then loads the output

from the commands in parallel into the table. The manifest ﬁle is a text ﬁle in JSON format that

Amazon Redshift uses to connect to the host. The manifest ﬁle speciﬁes the SSH host endpoints and the

commands that will be executed on the hosts to return data to Amazon Redshift. Optionally, you can

include the host public key, the login user name, and a mandatory ﬂag for each entry.

The manifest ﬁle is in the following format:

{

"entries": [

{"endpoint":"<ssh_endpoint_or_IP>",

"command": "<remote_command>",

"mandatory":true,

“publickey”: “<public_key>”,

"username": “<host_user_name>”},

{"endpoint":"<ssh_endpoint_or_IP>",

"command": "<remote_command>",

"mandatory":true,

“publickey”: “<public_key>”,

"username": “host_user_name”}

]

}

The manifest ﬁle contains one "entries" construct for each SSH connection. Each entry represents a single

SSH connection. You can have multiple connections to a single host or multiple connections to multiple

hosts. The double quotes are required as shown, both for the ﬁeld names and the values. The only value

that does not need double quotes is the Boolean value true or false for the mandatory ﬁeld.

The following table describes the ﬁelds in the manifest ﬁle.

endpoint

The URL address or IP address of the host. For example,

"ec2-111-222-333.compute-1.amazonaws.com" or "22.33.44.56"

command

The command that will be executed by the host to generate text or binary (gzip, lzop, or bzip2)

output. The command can be any command that the user "host_user_name" has permission to run.

The command can be as simple as printing a ﬁle, or it could query a database or launch a script. The

API Version 2012-12-01

204

Amazon Redshift Database Developer Guide

Loading Data from Remote Hosts

output (text ﬁle, gzip binary ﬁle, lzop binary ﬁle, or bzip2 binary ﬁle) must be in a form the Amazon

Redshift COPY command can ingest. For more information, see Preparing Your Input Data (p. 186).

publickey

(Optional) The public key of the host. If provided, Amazon Redshift will use the public key to identify

the host. If the public key is not provided, Amazon Redshift will not attempt host identiﬁcation. For

example, if the remote host's public key is: ssh-rsa AbcCbaxxx…xxxDHKJ root@amazon.com

enter the following text in the publickey ﬁeld: AbcCbaxxx…xxxDHKJ.

mandatory

(Optional) Indicates whether the COPY command should fail if the connection fails. The default is

false. If Amazon Redshift does not successfully make at least one connection, the COPY command

fails.

username

(Optional) The username that will be used to log on to the host system and execute the remote

command. The user login name must be the same as the login that was used to add the public key to

the host's authorized keys ﬁle in Step 2. The default username is "redshift".

The following example shows a completed manifest to open four connections to the same host and

execute a diﬀerent command through each connection:

{

"entries": [

{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",

"command": "cat loaddata1.txt",

"mandatory":true,

"publickey": "ec2publickeyportionoftheec2keypair",

"username": "ec2-user"},

{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",

"command": "cat loaddata2.txt",

"mandatory":true,

"publickey": "ec2publickeyportionoftheec2keypair",

"username": "ec2-user"},

{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",

"command": "cat loaddata3.txt",

"mandatory":true,

"publickey": "ec2publickeyportionoftheec2keypair",

"username": "ec2-user"},

{"endpoint":"ec2-184-72-204-112.compute-1.amazonaws.com",

"command": "cat loaddata4.txt",

"mandatory":true,

"publickey": "ec2publickeyportionoftheec2keypair",

"username": "ec2-user"}

]

}

Step 6: Upload the Manifest File to an Amazon S3 Bucket

Upload the manifest ﬁle to an Amazon S3 bucket. If the Amazon S3 bucket does not reside in the same

region as your Amazon Redshift cluster, you must use the REGION (p. 397) option to specify the region

in which the manifest is located. For information about creating an Amazon S3 bucket and uploading a

ﬁle, see Amazon Simple Storage Service Getting Started Guide.

Step 7: Run the COPY Command to Load the Data

Run a COPY (p. 390) command to connect to the host and load the data into an Amazon Redshift table.

In the COPY command, specify the explicit Amazon S3 object path for the manifest ﬁle and include the

SSH option. For example,

API Version 2012-12-01

205

Amazon Redshift Database Developer Guide

Loading from Amazon DynamoDB

copy sales

from 's3://mybucket/ssh_manifest' credentials

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

delimiter '|'

ssh;

Note

If you use automatic compression, the COPY command performs two data reads, which

means it will execute the remote command twice. The ﬁrst read is to provide a sample for

compression analysis, then the second read actually loads the data. If executing the remote

command twice might cause a problem because of potential side eﬀects, you should disable

automatic compression. To disable automatic compression, run the COPY command with the

COMPUPDATE option set to OFF. For more information, see Loading Tables with Automatic

Compression (p. 209).

Loading Data from an Amazon DynamoDB Table

You can use the COPY command to load a table with data from a single Amazon DynamoDB table.

Important

The Amazon DynamoDB table that provides the data must be created in the same region as your

cluster unless you use the REGION (p. 397) option to specify the region in which the Amazon

DynamoDB table is located.

The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to

read and load data in parallel from an Amazon DynamoDB table. You can take maximum advantage of

parallel processing by setting distribution styles on your Amazon Redshift tables. For more information,

see Choosing a Data Distribution Style (p. 129).

Important

When the COPY command reads data from the Amazon DynamoDB table, the resulting data

transfer is part of that table's provisioned throughput.

To avoid consuming excessive amounts of provisioned read throughput, we recommend that you not

load data from Amazon DynamoDB tables that are in production environments. If you do load data from

production tables, we recommend that you set the READRATIO option much lower than the average

percentage of unused provisioned throughput. A low READRATIO setting will help minimize throttling

issues. To use the entire provisioned throughput of an Amazon DynamoDB table, set READRATIO to 100.

The COPY command matches attribute names in the items retrieved from the DynamoDB table to

column names in an existing Amazon Redshift table by using the following rules:

• Amazon Redshift table columns are case-insensitively matched to Amazon DynamoDB item attributes.

If an item in the DynamoDB table contains multiple attributes that diﬀer only in case, such as Price and

PRICE, the COPY command will fail.

• Amazon Redshift table columns that do not match an attribute in the Amazon DynamoDB table are

loaded as either NULL or empty, depending on the value speciﬁed with the EMPTYASNULL option in

the COPY (p. 390) command.

• Amazon DynamoDB attributes that do not match a column in the Amazon Redshift table are

discarded. Attributes are read before they are matched, and so even discarded attributes consume part

of that table's provisioned throughput.

• Only Amazon DynamoDB attributes with scalar STRING and NUMBER data types are supported. The

Amazon DynamoDB BINARY and SET data types are not supported. If a COPY command tries to load

an attribute with an unsupported data type, the command will fail. If the attribute does not match an

Amazon Redshift table column, COPY does not attempt to load it, and it does not raise an error.

The COPY command uses the following syntax to load data from an Amazon DynamoDB table:

API Version 2012-12-01

206

Amazon Redshift Database Developer Guide

Loading from Amazon DynamoDB

copy <redshift_tablename> from 'dynamodb://<dynamodb_table_name>'

authorization

readratio '<integer>';

The values for authorization are the AWS credentials needed to access the Amazon DynamoDB table. If

these credentials correspond to an IAM user, that IAM user must have permission to SCAN and DESCRIBE

the Amazon DynamoDB table that is being loaded.

The values for authorization provide the AWS authorization your cluster needs to access the Amazon

DynamoDB table. The permission must include SCAN and DESCRIBE for the Amazon DynamoDB table

that is being loaded. For more information about required permissions, see IAM Permissions for COPY,

UNLOAD, and CREATE LIBRARY (p. 427). The preferred method for authentication is to specify the

IAM_ROLE parameter and provide the Amazon Resource Name (ARN) for an IAM role with the necessary

permissions. Alternatively, you can specify the ACCESS_KEY_ID and SECRET_ACCESS_KEY parameters

and provide the access key ID and secret access key for an authorized IAM user as plain text. For more

information, see Role-Based Access Control (p. 424) or Key-Based Access Control (p. 425).

To authenticate using the IAM_ROLE parameter, <aws-account-id> and <role-name> as shown in

the following syntax.

IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'

The following example shows authentication using an IAM role.

copy favoritemovies

from 'dynamodb://ProductCatalog'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

To authenticate using IAM user credentials, replace <access-key-id> and <secret-access-

key with an authorized user's access key ID and full secret access key for the ACCESS_KEY_ID and

SECRET_ACCESS_KEY parameters as shown following.

ACCESS_KEY_ID '<access-key-id>'

SECRET_ACCESS_KEY '<secret-access-key>';

The following example shows authentication using IAM user credentials.

copy favoritemovies

from 'dynamodb://ProductCatalog'

access_key_id '<access-key-id>'

secret_access_key '<secret-access-key';

For more information about other authorization options, see Authorization Parameters (p. 404)

If you want to validate your data without actually loading the table, use the NOLOAD option with the

COPY (p. 390) command.

The following example loads the FAVORITEMOVIES table with data from the DynamoDB table my-

favorite-movies-table. The read activity can consume up to 50% of the provisioned throughput.

copy favoritemovies from 'dynamodb://my-favorite-movies-table'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

readratio 50;

To maximize throughput, the COPY command loads data from an Amazon DynamoDB table in parallel

across the compute nodes in the cluster.

API Version 2012-12-01

207

Amazon Redshift Database Developer Guide

Verifying That the Data Was Loaded Correctly

Provisioned Throughput with Automatic Compression

By default, the COPY command applies automatic compression whenever you specify an empty target

table with no compression encoding. The automatic compression analysis initially samples a large

number of rows from the Amazon DynamoDB table. The sample size is based on the value of the

COMPROWS parameter. The default is 100,000 rows per slice.

After sampling, the sample rows are discarded and the entire table is loaded. As a result, many rows are

read twice. For more information about how automatic compression works, see Loading Tables with

Automatic Compression (p. 209).

Important

When the COPY command reads data from the Amazon DynamoDB table, including the rows

used for sampling, the resulting data transfer is part of that table's provisioned throughput.

Loading Multibyte Data from Amazon DynamoDB

If your data includes non-ASCII multibyte characters (such as Chinese or Cyrillic characters), you must

load the data to VARCHAR columns. The VARCHAR data type supports four-byte UTF-8 characters,

but the CHAR data type only accepts single-byte ASCII characters. You cannot load ﬁve-byte or longer

characters into Amazon Redshift tables. For more information about CHAR and VARCHAR, see Data

Types (p. 315).

Verifying That the Data Was Loaded Correctly

After the load operation is complete, query the STL_LOAD_COMMITS (p. 823) system table to verify

that the expected ﬁles were loaded. You should execute the COPY command and load veriﬁcation within

the same transaction so that if there is problem with the load you can roll back the entire transaction.

The following query returns entries for loading the tables in the TICKIT database:

select query, trim(filename) as filename, curtime, status

from stl_load_commits

where filename like '%tickit%' order by query;

query | btrim | curtime | status

-------+---------------------------+----------------------------+--------

22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1

22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1

22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1

22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1

22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1

22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1

22489 | tickit/sales_tab.txt | 2013-02-08 20:58:37.632939 | 1

(6 rows)

Validating Input Data

To validate the data in the Amazon S3 input ﬁles or Amazon DynamoDB table before you actually load

the data, use the NOLOAD option with the COPY (p. 390) command. Use NOLOAD with the same COPY

commands and options you would use to actually load the data. NOLOAD checks the integrity of all of

the data without loading it into the database. The NOLOAD option displays any errors that would occur if

you had attempted to load the data.

For example, if you speciﬁed the incorrect Amazon S3 path for the input ﬁle, Amazon Redshift would

display the following error:

ERROR: No such file or directory

API Version 2012-12-01

208

Amazon Redshift Database Developer Guide

Automatic Compression

DETAIL:

-----------------------------------------------

Amazon Redshift error: The specified key does not exist

code: 2

context: S3 key being read :

location: step_scan.cpp:1883

process: xenmaster [pid=22199]

-----------------------------------------------

To troubleshoot error messages, see the Load Error Reference (p. 215).

Loading Tables with Automatic Compression

Topics

•How Automatic Compression Works (p. 209)

•Automatic Compression Example (p. 210)

You can apply compression encodings to columns in tables manually, based on your own evaluation

of the data, or you can use the COPY command to analyze and apply compression automatically. We

strongly recommend using the COPY command to apply automatic compression.

You can use automatic compression when you create and load a brand new table. The COPY command

will perform a compression analysis. You can also perform a compression analysis without loading data

or changing the compression on a table by running the ANALYZE COMPRESSION (p. 382) command

against an already populated table. For example, you can run the ANALYZE COMPRESSION command

when you want to analyze compression on a table for future use, while preserving the existing DDL.

Automatic compression balances overall performance when choosing compression encodings. Range-

restricted scans might perform poorly if sort key columns are compressed much more highly than other

columns in the same query. As a result, automatic compression will choose a less eﬃcient compression

encoding to keep the sort key columns balanced with other columns. However, ANALYZE COMPRESSION

does not take sort keys into account, so it might recommend a diﬀerent encoding for the sort key than

what automatic compression would choose. If you use ANALYZE COMPRESSION, consider changing the

encoding to RAW for sort keys.

How Automatic Compression Works

By default, the COPY command applies automatic compression whenever you run the COPY command

with an empty target table and all of the table columns either have RAW encoding or no encoding.

To apply automatic compression to an empty table, regardless of its current compression encodings, run

the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the

COPY command with the COMPUPDATE option set to OFF.

You cannot apply automatic compression to a table that already contains data.

Note

Automatic compression analysis requires enough rows in the load data (at least 100,000 rows

per slice) to generate a meaningful sample.

Automatic compression performs these operations in the background as part of the load transaction:

1. An initial sample of rows is loaded from the input ﬁle. Sample size is based on the value of the

COMPROWS parameter. The default is 100,000.

2. Compression options are chosen for each column.

3. The sample rows are removed from the table.

4. The table is recreated with the chosen compression encodings.

API Version 2012-12-01

209

Amazon Redshift Database Developer Guide

Automatic Compression

5. The entire input ﬁle is loaded and compressed using the new encodings.

After you run the COPY command, the table is fully loaded, compressed, and ready for use. If you load

more data later, appended rows are compressed according to the existing encoding.

If you only want to perform a compression analysis, run ANALYZE COMPRESSION, which is more

eﬃcient than running a full COPY. Then you can evaluate the results to decide whether to use automatic

compression or recreate the table manually.

Automatic compression is supported only for the COPY command. Alternatively, you can manually apply

compression encoding when you create the table. For information about manual compression encoding,

see Choosing a Column Compression Type (p. 118).

Automatic Compression Example

In this example, assume that the TICKIT database contains a copy of the LISTING table called BIGLIST,

and you want to apply automatic compression to this table when it is loaded with approximately 3

million rows.

To load and automatically compress the table

1. Ensure that the table is empty. You can apply automatic compression only to an empty table:

truncate biglist;

2. Load the table with a single COPY command. Although the table is empty, some earlier encoding

might have been speciﬁed. To ensure that Amazon Redshift performs a compression analysis, set the

COMPUPDATE parameter to ON.

copy biglist from 's3://mybucket/biglist.txt'

iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'

delimiter '|' COMPUPDATE ON;

Because no COMPROWS option is speciﬁed, the default and recommended sample size of 100,000

rows per slice is used.

3. Look at the new schema for the BIGLIST table in order to review the automatically chosen encoding

schemes.

select "column", type, encoding

from pg_table_def where tablename = 'biglist';

Column | Type | Encoding

---------------+-----------------------------+----------

listid | integer | delta

sellerid | integer | delta32k

eventid | integer | delta32k

dateid | smallint | delta

+numtickets | smallint | delta

priceperticket | numeric(8,2) | delta32k

totalprice | numeric(8,2) | mostly32

listtime | timestamp without time zone | none

4. Verify that the expected number of rows were loaded:

select count(*) from biglist;

count

---------

API Version 2012-12-01

210

Amazon Redshift Database Developer Guide

Optimizing for Narrow Tables

3079952

(1 row)

When rows are later appended to this table using COPY or INSERT statements, the same compression

encodings will be applied.

Optimizing Storage for Narrow Tables

If you have a table with very few columns but a very large number of rows, the three hidden metadata

identity columns (INSERT_XID, DELETE_XID, ROW_ID) will consume a disproportionate amount of the

disk space for the table.

In order to optimize compression of the hidden columns, load the table in a single COPY transaction

where possible. If you load the table with multiple separate COPY commands, the INSERT_XID column

will not compress well. You will need to perform a vacuum operation if you use multiple COPY

commands, but it will not improve compression of INSERT_XID.

Loading Default Column Values

You can optionally deﬁne a column list in your COPY command. If a column in the table is omitted from

the column list, COPY will load the column with either the value supplied by the DEFAULT option that

was speciﬁed in the CREATE TABLE command, or with NULL if the DEFAULT option was not speciﬁed.

If COPY attempts to assign NULL to a column that is deﬁned as NOT NULL, the COPY command fails. For

information about assigning the DEFAULT option, see CREATE TABLE (p. 471).

When loading from data ﬁles on Amazon S3, the columns in the column list must be in the same order as

the ﬁelds in the data ﬁle. If a ﬁeld in the data ﬁle does not have a corresponding column in the column

list, the COPY command fails.

When loading from Amazon DynamoDB table, order does not matter. Any ﬁelds in the Amazon

DynamoDB attributes that do not match a column in the Amazon Redshift table are discarded.

The following restrictions apply when using the COPY command to load DEFAULT values into a table:

• If an IDENTITY (p. 474) column is included in the column list, the EXPLICIT_IDS option must also be

speciﬁed in the COPY (p. 390) command, or the COPY command will fail. Similarly, if an IDENTITY

column is omitted from the column list, and the EXPLICIT_IDS option is speciﬁed, the COPY operation

will fail.

• Because the evaluated DEFAULT expression for a given column is the same for all loaded rows, a

DEFAULT expression that uses a RANDOM() function will assign to same value to all the rows.

• DEFAULT expressions that contain CURRENT_DATE or SYSDATE are set to the timestamp of the current

transaction.

For an example, see "Load data from a ﬁle with default values" in COPY Examples (p. 434).

Troubleshooting Data Loads

Topics

•S3ServiceException Errors (p. 212)

•System Tables for Troubleshooting Data Loads (p. 213)

•Multibyte Character Load Errors (p. 214)

•Load Error Reference (p. 215)

API Version 2012-12-01

211

Amazon Redshift Database Developer Guide

Troubleshooting

This section provides information about identifying and resolving data loading errors.

S3ServiceException Errors

The most common s3ServiceException errors are caused by an improperly formatted or incorrect

credentials string, having your cluster and your bucket in diﬀerent regions, and insuﬃcient Amazon S3

privileges.

The section provides troubleshooting information for each type of error.

Invalid Credentials String

If your credentials string was improperly formatted, you will receive the following error message:

ERROR: Invalid credentials. Must be of the format: credentials

'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>

[;token=<temporary-session-token>]'

Verify that the credentials string does not contain any spaces or line breaks, and is enclosed in single

quotes.

Invalid Access Key ID

If your access key id does not exist, you will receive the following error message:

[Amazon](500310) Invalid operation: S3ServiceException:The AWS Access Key Id you provided

does not exist in our records.

This is often a copy and paste error. Verify that the access key ID was entered correctly.

Invalid Secret Access Key

If your secret access key is incorrect, you will receive the following error message:

[Amazon](500310) Invalid operation: S3ServiceException:The request signature we calculated

does not match the signature you provided.

Check your key and signing method.,Status 403,Error SignatureDoesNotMatch

This is often a copy and paste error. Verify that the secret access key was entered correctly and that it is

the correct key for the access key ID.

Bucket is in a Diﬀerent Region

The Amazon S3 bucket speciﬁed in the COPY command must be in the same region as the cluster. If

your Amazon S3 bucket and your cluster are in diﬀerent regions, you will receive an error similar to the

following:

ERROR: S3ServiceException:The bucket you are attempting to access must be addressed using

the specified endpoint.

You can create an Amazon S3 bucket in a speciﬁc region either by selecting the region when you create

the bucket by using the Amazon S3 Management Console, or by specifying an endpoint when you

create the bucket using the Amazon S3 API or CLI. For more information, see Uploading Files to Amazon

S3 (p. 188).

For more information about Amazon S3 regions, see Accessing a Bucket in the Amazon Simple Storage

Service Developer Guide.

API Version 2012-12-01

212

Amazon Redshift Database Developer Guide

Troubleshooting

Alternatively, you can specify the region using the REGION (p. 397) option with the COPY command.

Access Denied

The user account identiﬁed by the credentials must have LIST and GET access to the Amazon S3 bucket.

If the user does not have suﬃcient privileges, you will receive the following error message:

ERROR: S3ServiceException:Access Denied,Status 403,Error AccessDenied

For information about managing user access to buckets, see Access Control in the Amazon S3 Developer

Guide.

System Tables for Troubleshooting Data Loads

The following Amazon Redshift system tables can be helpful in troubleshooting data load issues:

• Query STL_LOAD_ERRORS (p. 825) to discover the errors that occurred during speciﬁc loads.

• Query STL_FILE_SCAN (p. 816) to view load times for speciﬁc ﬁles or to see if a speciﬁc ﬁle was even

read.

• Query STL_S3CLIENT_ERROR (p. 847) to ﬁnd details for errors encountered while transferring data

from Amazon S3.

To ﬁnd and diagnose load errors

1. Create a view or deﬁne a query that returns details about load errors. The following example joins

the STL_LOAD_ERRORS table to the STV_TBL_PERM table to match table IDs with actual table

names.

create view loadview as

(select distinct tbl, trim(name) as table_name, query, starttime,

trim(filename) as input, line_number, colname, err_code,

trim(err_reason) as reason

from stl_load_errors sl, stv_tbl_perm sp

where sl.tbl = sp.id);

2. Set the MAXERRORS option in your COPY command to a large enough value to enable COPY to

return useful information about your data. If the COPY encounters errors, an error message directs

you to consult the STL_LOAD_ERRORS table for details.

3. Query the LOADVIEW view to see error details. For example:

select * from loadview where table_name='venue';

tbl | table_name | query | starttime

--------+------------+-------+----------------------------

100551 | venue | 20974 | 2013-01-29 19:05:58.365391

+----------------+-------------+-------+----------+---------------------

| venue_pipe.txt | 1 | 0 | 1214 | Delimiter not found

4. Fix the problem in the input ﬁle or the load script, based on the information that the view returns.

Some typical load errors to watch for include:

• Mismatch between data types in table and values in input data ﬁelds.

• Mismatch between number of columns in table and number of ﬁelds in input data.

API Version 2012-12-01

213

Amazon Redshift Database Developer Guide

Troubleshooting

• Mismatched quotes. Amazon Redshift supports both single and double quotes; however, these

quotes must be balanced appropriately.

• Incorrect format for date/time data in input ﬁles.

• Out-of-range values in input ﬁles (for numeric columns).

• Number of distinct values for a column exceeds the limitation for its compression encoding.

Multibyte Character Load Errors

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F

hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a

maximum of four bytes. For more information, see Character Types (p. 323).

If a line in your load data contains a character that is invalid for the column data type, COPY returns

an error and logs a row in the STL_LOAD_ERRORS system log table with error number 1220. The

ERR_REASON ﬁeld includes the byte sequence, in hex, for the invalid character.

An alternative to ﬁxing invalid characters in your load data is to replace the invalid characters during the

load process. To replace invalid UTF-8 characters, specify the ACCEPTINVCHARS option with the COPY

command. For more information, see ACCEPTINVCHARS (p. 417).

The following example shows the error reason when COPY attempts to load UTF-8 character e0 a1 c7a4

into a CHAR column:

Multibyte character not supported for CHAR

(Hint: Try using VARCHAR). Invalid char: e0 a1 c7a4

If the error is related to a VARCHAR datatype, the error reason includes an error code as well as the

invalid UTF-8 hex sequence. The following example shows the error reason when COPY attempts to load

UTF-8 a4 into a VARCHAR ﬁeld:

String contains invalid or unsupported UTF-8 codepoints.

Bad UTF-8 hex sequence: a4 (error 3)

The following table lists the descriptions and suggested workarounds for VARCHAR load errors. If one of

these errors occurs, replace the character with a valid UTF-8 code sequence or remove the character.

Error code Description

1 The UTF-8 byte sequence exceeds the four-byte maximum supported by VARCHAR.

2 The UTF-8 byte sequence is incomplete. COPY did not ﬁnd the expected number of

continuation bytes for a multibyte character before the end of the string.

3 The UTF-8 single-byte character is out of range. The starting byte must not be 254,

255 or any character between 128 and 191 (inclusive).

4 The value of the trailing byte in the byte sequence is out of range. The continuation

byte must be between 128 and 191 (inclusive).

5 The UTF-8 character is reserved as a surrogate. Surrogate code points (U+D800

through U+DFFF) are invalid.

6 The character is not a valid UTF-8 character (code points 0xFDD0 to 0xFDEF).

7 The character is not a valid UTF-8 character (code points 0xFFFE and 0xFFFF).

API Version 2012-12-01

214

Amazon Redshift Database Developer Guide

Troubleshooting

Error code Description

8 The byte sequence exceeds the maximum UTF-8 code point.

9 The UTF-8 byte sequence does not have a matching code point.

Load Error Reference

If any errors occur while loading data from a ﬁle, query the STL_LOAD_ERRORS (p. 825) table to

identify the error and determine the possible explanation. The following table lists all error codes that

might occur during data loads:

Load Error Codes

Error code Description

1200 Unknown parse error. Contact support.

1201 Field delimiter was not found in the input ﬁle.

1202 Input data had more columns than were deﬁned in the DDL.

1203 Input data had fewer columns than were deﬁned in the DDL.

1204 Input data exceeded the acceptable range for the data type.

1205 Date format is invalid. See DATEFORMAT and TIMEFORMAT Strings (p. 432) for valid

formats.

1206 Timestamp format is invalid. See DATEFORMAT and TIMEFORMAT Strings (p. 432)

for valid formats.

1207 Data contained a value outside of the expected range of 0-9.

1208 FLOAT data type format error.

1209 DECIMAL data type format error.

1210 BOOLEAN data type format error.

1211 Input line contained no data.

1212 Load ﬁle was not found.

1213 A ﬁeld speciﬁed as NOT NULL contained no data.

1214 Delimiter not found.

1215 CHAR ﬁeld error.

1216 Invalid input line.

1217 Invalid identity column value.

1218 When using NULL AS '\0', a ﬁeld containing a null terminator (NUL, or UTF-8 0000)

contained more than one byte.

1219 UTF-8 hexadecimal contains an invalid digit.

1220 String contains invalid or unsupported UTF-8 codepoints.

API Version 2012-12-01

215

Amazon Redshift Database Developer Guide

Updating with DML

Error code Description

1221 Encoding of the ﬁle is not the same as that speciﬁed in the COPY command.

Updating Tables with DML Commands

Amazon Redshift supports standard Data Manipulation Language (DML) commands (INSERT, UPDATE,

and DELETE) that you can use to modify rows in tables. You can also use the TRUNCATE command to do

fast bulk deletes.

Note

We strongly encourage you to use the COPY (p. 390) command to load large amounts of

data. Using individual INSERT statements to populate a table might be prohibitively slow.

Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT

INTO ... SELECT FROM or CREATE TABLE AS to improve performance. For information, see

INSERT (p. 520) or CREATE TABLE AS (p. 483).

If you insert, update, or delete a signiﬁcant number of rows in a table, relative to the number of rows

before the changes, run the ANALYZE and VACUUM commands against the table when you are done.

If a number of small changes accumulate over time in your application, you might want to schedule

the ANALYZE and VACUUM commands to run at regular intervals. For more information, see Analyzing

Tables (p. 223) and Vacuuming Tables (p. 228).

Updating and Inserting New Data

You can eﬃciently add new data to an existing table by using a combination of updates and inserts from

a staging table. While Amazon Redshift does not support a single merge, or upsert, command to update a

table from a single data source, you can perform a merge operation by creating a staging table and then

using one of the methods described in this section to update the target table from the staging table.

Topics

•Merge Method 1: Replacing Existing Rows (p. 216)

•Merge Method 2: Specifying a Column List (p. 217)

•Creating a Temporary Staging Table (p. 217)

•Performing a Merge Operation by Replacing Existing Rows (p. 217)

•Performing a Merge Operation by Specifying a Column List (p. 218)

•Merge Examples (p. 219)

Note

You should run the entire merge operation, except for creating and dropping the temporary

staging table, in a single transaction so that the transaction will roll back if any step fails. Using

a single transaction also reduces the number of commits, which saves time and resources.

Merge Method 1: Replacing Existing Rows

If you are overwriting all of the columns in the target table, the fastest method for performing a merge

is by replacing the existing rows because it scans the target table only once, by using an inner join to

delete rows that will be updated. After the rows are deleted, they are replaced along with new rows by a

single insert operation from the staging table.

API Version 2012-12-01

216

Amazon Redshift Database Developer Guide

Merge Method 2: Specifying a Column List

Use this method if all of the following are true:

• Your target table and your staging table contain the same columns.

• You intend to replace all of the data in the target table columns with all of the staging table columns.

• You will use all of the rows in the staging table in the merge.

If any of these criteria do not apply, use Merge Method 2: Specifying a column list, described in the

following section.

If you will not use all of the rows in the staging table, you can ﬁlter the DELETE and INSERT statements

by using a WHERE clause to leave out rows that are not actually changing. However, if most of the rows

in the staging table will not participate in the merge, we recommend performing an UPDATE and an

INSERT in separate steps, as described later in this section.

Merge Method 2: Specifying a Column List

Use this method to update speciﬁc columns in the target table instead of overwriting entire rows.

This method takes longer than the previous method because it requires an extra update step. Use this

method if any of the following are true:

• Not all of the columns in the target table are to be updated.

• Most rows in the staging table will not be used in the updates.

Creating a Temporary Staging Table

The staging table is a temporary table that holds all of the data that will be used to make changes to the

target table, including both updates and inserts.

A merge operation requires a join between the staging table and the target table. To collocate the

joining rows, set the staging table's distribution key to the same column as the target table's distribution

key. For example, if the target table uses a foreign key column as its distribution key, use the same

column for the staging table's distribution key. If you create the staging table by using a CREATE TABLE

LIKE (p. 475) statement, the staging table will inherit the distribution key from the parent table. If

you use a CREATE TABLE AS statement, the new table does not inherit the distribution key. For more

information, see Choosing a Data Distribution Style (p. 129)

If the distribution key is not the same as the primary key and the distribution key is not updated as part

of the merge operation, add a redundant join predicate on the distribution key columns to enable a

collocated join. For example:

where target.primarykey = stage.primarykey

and target.distkey = stage.distkey

To verify that the query will use a collocated join, run the query with EXPLAIN (p. 511) and check for

DS_DIST_NONE on all of the joins. For more information, see Evaluating the Query Plan (p. 133)

Performing a Merge Operation by Replacing Existing

Rows

To perform a merge operation by replacing existing rows

1. Create a staging table, and then populate it with data to be merged, as shown in the following

pseudocode.

API Version 2012-12-01

217

Amazon Redshift Database Developer Guide

Performing a Merge Operation by Specifying a Column List

create temp table stage (like target);

insert into stage

select * from source

where source.filter = 'filter_expression';

2. Use an inner join with the staging table to delete the rows from the target table that are being

updated.

Put the delete and insert operations in a single transaction block so that if there is a problem,

everything will be rolled back.

begin transaction;

delete from target

using stage

where target.primarykey = stage.primarykey;

3. Insert all of the rows from the staging table.

insert into target

select * from stage;

end transaction;

4. Drop the staging table.

drop table stage;

Performing a Merge Operation by Specifying a

Column List

To perform a merge operation by specifying a column list

1. Put the entire operation in a single transaction block so that if there is a problem, everything will be

rolled back.

begin transaction;

…

end transaction;

2. Create a staging table, and then populate it with data to be merged, as shown in the following

pseudocode.

create temp table stage (like target);

insert into stage

select * from source

where source.filter = 'filter_expression';

3. Update the target table by using an inner join with the staging table.

• In the UPDATE clause, explicitly list the columns to be updated.

• Perform an inner join with the staging table.

• If the distribution key is diﬀerent from the primary key and the distribution key is not being

updated, add a redundant join on the distribution key. To verify that the query will use a

API Version 2012-12-01

218

Amazon Redshift Database Developer Guide

Merge Examples

collocated join, run the query with EXPLAIN (p. 511) and check for DS_DIST_NONE on all of the

joins. For more information, see Evaluating the Query Plan (p. 133)

• If your target table is sorted by time stamp, add a predicate to take advantage of range-restricted

scans on the target table. For more information, see Amazon Redshift Best Practices for Designing

Queries (p. 32).

• If you will not use all of the rows in the merge, add a clause to ﬁlter the rows that need to be

changed. For example, add an inequality ﬁlter on one or more columns to exclude rows that have

not changed.

• Put the update, delete, and insert operations in a single transaction block so that if there is a

problem, everything will be rolled back.

For example:

begin transaction;

update target

set col1 = stage.col1,

col2 = stage.col2,

col3 = 'expression'

from stage

where target.primarykey = stage.primarykey

and target.distkey = stage.distkey

and target.col3 > 'last_update_time'

and (target.col1 != stage.col1

or target.col2 != stage.col2

or target.col3 = 'filter_expression');

4. Delete unneeded rows from the staging table by using an inner join with the target table. Some

rows in the target table already match the corresponding rows in the staging table, and others were

updated in the previous step. In either case, they are not needed for the insert.

delete from stage

using target

where stage.primarykey = target.primarykey;

5. Insert the remaining rows from the staging table. Use the same column list in the VALUES clause

that you used in the UPDATE statement in step two.

insert into target

(select col1, col2, 'expression')

from stage;

end transaction;

6. Drop the staging table.

drop table stage;

Merge Examples

The following examples perform a merge to update the SALES table. The ﬁrst example uses the simpler

method of deleting from the target table and then inserting all of the rows from the staging table. The

second example requires updating on select columns in the target table, so it includes an extra update

step.

Sample merge data source

API Version 2012-12-01

219

Amazon Redshift Database Developer Guide

Merge Examples

The examples in this section need a sample data source that includes both updates and inserts. For the

examples, we will create a sample table named SALES_UPDATE that uses data from the SALES table.

We’ll populate the new table with random data that represents new sales activity for December. We will

use the SALES_UPDATE sample table to create the staging table in the examples that follow.

-- Create a sample table as a copy of the SALES table

create table sales_update as

select * from sales;

-- Change every fifth row so we have updates

update sales_update

set qtysold = qtysold*2,

pricepaid = pricepaid*0.8,

commission = commission*1.1

where saletime > '2008-11-30'

and mod(sellerid, 5) = 0;

-- Add some new rows so we have insert examples

-- This example creates a duplicate of every fourth row

insert into sales_update

select (salesid + 172456) as salesid, listid, sellerid, buyerid, eventid, dateid, qtysold,

pricepaid, commission, getdate() as saletime

from sales_update

where saletime > '2008-11-30'

and mod(sellerid, 4) = 0;

Example of a merge that replaces existing rows

The following script uses the SALES_UPDATE table to perform a merge operation on the SALES table

with new data for December sales activity. This example deletes rows in the SALES table that have

updates so they can be replaced with the updated rows in the staging table. The staging table should

contain only rows that will participate in the merge, so the CREATE TABLE statement includes a ﬁlter to

exclude rows that have not changed.

-- Create a staging table and populate it with updated rows from SALES_UPDATE

create temp table stagesales as

select * from sales_update

where sales_update.saletime > '2008-11-30'

and sales_update.salesid = (select sales.salesid from sales

where sales.salesid = sales_update.salesid

and sales.listid = sales_update.listid

and (sales_update.qtysold != sales.qtysold

or sales_update.pricepaid != sales.pricepaid));

-- Start a new transaction

begin transaction;

-- Delete any rows from SALES that exist in STAGESALES, because they are updates

-- The join includes a redundant predicate to collocate on the distribution key

–- A filter on saletime enables a range-restricted scan on SALES

delete from sales

using stagesales

where sales.salesid = stagesales.salesid

and sales.listid = stagesales.listid

and sales.saletime > '2008-11-30';

-- Insert all the rows from the staging table into the target table

insert into sales

select * from stagesales;

API Version 2012-12-01

220

Amazon Redshift Database Developer Guide

Performing a Deep Copy

-- End transaction and commit

end transaction;

-- Drop the staging table

drop table stagesales;

Example of a merge that speciﬁes a column list

The following example performs a merge operation to update SALES with new data for December

sales activity. We need sample data that includes both updates and inserts, along with rows that have

not changed. For this example, we want to update the QTYSOLD and PRICEPAID columns but leave

COMMISSION and SALETIME unchanged. The following script uses the SALES_UPDATE table to perform

a merge operation on the SALES table.

-- Create a staging table and populate it with rows from SALES_UPDATE for Dec

create temp table stagesales as select * from sales_update

where saletime > '2008-11-30';

-- Start a new transaction

begin transaction;

-- Update the target table using an inner join with the staging table

-- The join includes a redundant predicate to collocate on the distribution key –- A filter

on saletime enables a range-restricted scan on SALES

update sales

set qtysold = stagesales.qtysold,

pricepaid = stagesales.pricepaid

from stagesales

where sales.salesid = stagesales.salesid

and sales.listid = stagesales.listid

and stagesales.saletime > '2008-11-30'

and (sales.qtysold != stagesales.qtysold

or sales.pricepaid != stagesales.pricepaid);

-- Delete matching rows from the staging table

-- using an inner join with the target table

delete from stagesales

using sales

where sales.salesid = stagesales.salesid

and sales.listid = stagesales.listid;

-- Insert the remaining rows from the staging table into the target table

insert into sales

select * from stagesales;

-- End transaction and commit

end transaction;

-- Drop the staging table

drop table stagesales;

Performing a Deep Copy

A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table.

If a table has a large unsorted region, a deep copy is much faster than a vacuum. The trade oﬀ is that

you should not make concurrent updates during a deep copy operation unless you can track it and move

API Version 2012-12-01

221

Amazon Redshift Database Developer Guide

Performing a Deep Copy

the delta updates into the new table after the process has completed. A VACUUM operation supports

concurrent updates automatically.

You can choose one of the following methods to create a copy of the original table:

• Use the original table DDL.

If the CREATE TABLE DDL is available, this is the fastest and preferred method. If you create a new

table, you can specify all table and column attributes, including primary key and foreign keys.

Note

If the original DDL is not available, you might be able to recreate the DDL by running a script

called v_generate_tbl_ddl. You can download the script from amazon-redshift-utils,

which is part of the Amazon Web Services - Labs git hub repository.

• Use CREATE TABLE LIKE.

If the original DDL is not available, you can use CREATE TABLE LIKE to recreate the original table. The

new table inherits the encoding, distkey, sortkey, and notnull attributes of the parent table. The new

table doesn't inherit the primary key and foreign key attributes of the parent table, but you can add

them using ALTER TABLE (p. 365).

• Create a temporary table and truncate the original table.

If you need to retain the primary key and foreign key attributes of the parent table, or if the parent

table has dependencies, you can use CREATE TABLE ... AS (CTAS) to create a temporary table, then

truncate the original table and populate it from the temporary table.

Using a temporary table improves performance signiﬁcantly compared to using a permanent table,

but there is a risk of losing data. A temporary table is automatically dropped at the end of the session

in which it is created. TRUNCATE commits immediately, even if it is inside a transaction block. If the

TRUNCATE succeeds but the session terminates before the subsequent INSERT completes, the data is

lost. If data loss is unacceptable, use a permanent table.

To perform a deep copy using the original table DDL

1. (Optional) Recreate the table DDL by running a script called v_generate_tbl_ddl.

2. Create a copy of the table using the original CREATE TABLE DDL.

3. Use an INSERT INTO … SELECT statement to populate the copy with data from the original table.

4. Drop the original table.

5. Use an ALTER TABLE statement to rename the copy to the original table name.

The following example performs a deep copy on the SALES table using a duplicate of SALES named

SALESCOPY.

create table salescopy ( … );

insert into salescopy (select * from sales);

drop table sales;

alter table salescopy rename to sales;

To perform a deep copy using CREATE TABLE LIKE

1. Create a new table using CREATE TABLE LIKE.

2. Use an INSERT INTO … SELECT statement to copy the rows from the current table to the new table.

3. Drop the current table.

4. Use an ALTER TABLE statement to rename the new table to the original table name.

API Version 2012-12-01

222

Amazon Redshift Database Developer Guide

Analyzing Tables

The following example performs a deep copy on the SALES table using CREATE TABLE LIKE.

create table likesales (like sales);

insert into likesales (select * from sales);

drop table sales;

alter table likesales rename to sales;

To perform a deep copy by creating a temporary table and truncating the original table

1. Use CREATE TABLE AS to create a temporary table with the rows from the original table.

2. Truncate the current table.

3. Use an INSERT INTO … SELECT statement to copy the rows from the temporary table to the original

table.

4. Drop the temporary table.

The following example performs a deep copy on the SALES table by creating a temporary table and

truncating the original table:

create temp table salestemp as select * from sales;

truncate sales;

insert into sales (select * from salestemp);

drop table salestemp;

Analyzing Tables

The ANALYZE operation updates the statistical metadata that the query planner uses to choose optimal

plans.

In most cases, you don't need to explicitly run the ANALYZE command. Amazon Redshift monitors

changes to your workload and automatically updates statistics in the background. In addition, the COPY

command performs an analysis automatically when it loads data into an empty table.

To explicitly analyze a table or the entire database, run the ANALYZE (p. 380) command.

Topics

•Automatic Analyze (p. 223)

•Analysis of New Table Data (p. 224)

•ANALYZE Command History (p. 227)

Automatic Analyze

Amazon Redshift continuously monitors your database and automatically performs analyze operations in

the background. To minimize impact to your system performance, automatic analyze runs during periods

when workloads are light.

Automatic analyze is enabled by default. To disable automatic analyze, set the auto_analyze

parameter to false by modifying your cluster's parameter group.

To reduce processing time and improve overall system performance, Amazon Redshift skips automatic

analyze for any table where the extent of modiﬁcations is small.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your

extract, transform, and load (ETL) workﬂow, automatic analyze skips tables that have current statistics.

Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.

API Version 2012-12-01

223

Amazon Redshift Database Developer Guide

Analysis of New Table Data

By default, the COPY command performs an ANALYZE after it loads data into an empty table. You can

force an ANALYZE regardless of whether a table is empty by setting STATUPDATE ON. If you specify

STATUPDATE OFF, an ANALYZE is not performed. Only the table owner or a superuser can run the

ANALYZE command or run the COPY command with STATUPDATE set to ON.

Amazon Redshift also analyzes new tables that you create with the following commands:

• CREATE TABLE AS (CTAS)

• CREATE TEMP TABLE AS

• SELECT INTO

Amazon Redshift returns a warning message when you run a query against a new table that was not

analyzed after its data was initially loaded. No warning occurs when you query a table after a subsequent

update or load. The same warning message is returned when you run the EXPLAIN command on a query

that references tables that have not been analyzed.

Whenever adding data to a nonempty table signiﬁcantly changes the size of the table, you can explicitly

update statistics. You do so either by running an ANALYZE command or by using the STATUPDATE ON

option with the COPY command. To view details about the number of rows that have been inserted or

deleted since the last ANALYZE, query the PG_STATISTIC_INDICATOR (p. 939) system catalog table.

You can specify the scope of the ANALYZE (p. 380) command to one of the following:

• The entire current database

• A single table

• One or more speciﬁc columns in a single table

• Columns that are likely to be used as predicates in queries

The ANALYZE command gets a sample of rows from the table, does some calculations, and saves

resulting column statistics. By default, Amazon Redshift runs a sample pass for the DISTKEY column

and another sample pass for all of the other columns in the table. If you want to generate statistics for

a subset of columns, you can specify a comma-separated column list. You can run ANALYZE with the

PREDICATE COLUMNS clause to skip columns that aren’t used as predicates.

ANALYZE operations are resource intensive, so run them only on tables and columns that actually require

statistics updates. You don't need to analyze all columns in all tables regularly or on the same schedule.

If the data changes substantially, analyze the columns that are frequently used in the following:

• Sorting and grouping operations

• Joins

• Query predicates

To reduce processing time and improve overall system performance, Amazon Redshift skips

ANALYZE for any table that has a low percentage of changed rows, as determined by the

analyze_threshold_percent (p. 948) parameter. By default, the analyze threshold is set to 10 percent.

You can change the analyze threshold for the current session by running a SET (p. 560) command.

Columns that are less likely to require frequent analysis are those that represent facts and measures and

any related attributes that are never actually queried, such as large VARCHAR columns. For example,

consider the LISTING table in the TICKIT database.

select "column", type, encoding, distkey, sortkey

API Version 2012-12-01

224

Amazon Redshift Database Developer Guide

Analysis of New Table Data

from pg_table_def where tablename = 'listing';

column | type | encoding | distkey | sortkey

---------------+--------------------+----------+---------+---------

listid | integer | none | t | 1

sellerid | integer | none | f | 0

eventid | integer | mostly16 | f | 0

dateid | smallint | none | f | 0

numtickets | smallint | mostly8 | f | 0

priceperticket | numeric(8,2) | bytedict | f | 0

totalprice | numeric(8,2) | mostly32 | f | 0

listtime | timestamp with... | none | f | 0

If this table is loaded every day with a large number of new records, the LISTID column, which is

frequently used in queries as a join key, needs to be analyzed regularly. If TOTALPRICE and LISTTIME are

the frequently used constraints in queries, you can analyze those columns and the distribution key on

every weekday.

analyze listing(listid, totalprice, listtime);

Suppose that the sellers and events in the application are much more static, and the date IDs refer to a

ﬁxed set of days covering only two or three years. In this case,the unique values for these columns don't

change signiﬁcantly. However, the number of instances of each unique value will increase steadily.

In addition, consider the case where the NUMTICKETS and PRICEPERTICKET measures are queried

infrequently compared to the TOTALPRICE column. In this case, you can run the ANALYZE command on

the whole table once every weekend to update statistics for the ﬁve columns that are not analyzed daily:

Predicate Columns

As a convenient alternative to specifying a column list, you can choose to analyze only the columns

that are likely to be used as predicates. When you run a query, any columns that are used in a join, ﬁlter

condition, or group by clause are marked as predicate columns in the system catalog. When you run

ANALYZE with the PREDICATE COLUMNS clause, the analyze operation includes only columns that meet

the following criteria:

• The column is marked as a predicate column.

• The column is a distribution key.

• The column is part of a sort key.

If none of a table's columns are marked as predicates, ANALYZE includes all of the columns, even when

PREDICATE COLUMNS is speciﬁed. If no columns are marked as predicate columns, it might be because

the table has not yet been queried.

You might choose to use PREDICATE COLUMNS when your workload's query pattern is relatively stable.

When the query pattern is variable, with diﬀerent columns frequently being used as predicates, using

PREDICATE COLUMNS might temporarily result in stale statistics. Stale statistics can lead to suboptimal

query execution plans and long execution times. However, the next time you run ANALYZE using

PREDICATE COLUMNS, the new predicate columns are included.

To view details for predicate columns, use the following SQL to create a view named

PREDICATE_COLUMNS.

CREATE VIEW predicate_columns AS

WITH predicate_column_info as (

SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num, a.attname

as col_name,

CASE

WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||')

API Version 2012-12-01

225

Amazon Redshift Database Developer Guide

Analysis of New Table Data

WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')

WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')

WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')

ELSE NULL::varchar

END AS pred_ts

FROM pg_statistic s

JOIN pg_class c ON c.oid = s.starelid

JOIN pg_namespace ns ON c.relnamespace = ns.oid

JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)

SELECT schema_name, table_name, col_num, col_name,

pred_ts NOT LIKE '2000-01-01%' AS is_predicate,

CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts,

'||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,

CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts,

'||',2))::timestamp ELSE NULL::timestamp END as last_analyze

FROM predicate_column_info;

Suppose you run the following query against the LISTING table. Note that LISTID, LISTTIME, and

EVENTID are used in the join, ﬁlter, and group by clauses.

select s.buyerid,l.eventid, sum(l.totalprice)

from listing l

join sales s on l.listid = s.listid

where l.listtime > '2008-12-01'

group by l.eventid, s.buyerid;

When you query the PREDICATE_COLUMNS view, as shown in the following example, you see that

LISTID, EVENTID, and LISTTIME are marked as predicate columns.

select * from predicate_columns

where table_name = 'listing';

last_analyze

------------+------------+---------+----------------+--------------+---------------------

+--------------------

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

2017-05-03 18:27:41

Keeping statistics current improves query performance by enabling the query planner to choose optimal

plans. Amazon Redshift refreshes statistics automatically in the background, and you can also explicitly

run the ANALYZE command. If you choose to explicitly run ANALYZE, do the following:

• Run the ANALYZE command before running queries.

• Run the ANALYZE command on the database routinely at the end of every regular load or update

cycle.

API Version 2012-12-01

226

Amazon Redshift Database Developer Guide

ANALYZE Command History

• Run the ANALYZE command on any new tables that you create and any existing tables or columns that

undergo signiﬁcant change.

• Consider running ANALYZE operations on diﬀerent schedules for diﬀerent types of tables and

columns, depending on their use in queries and their propensity to change.

• To save time and cluster resources, use the PREDICATE COLUMNS clause when you run ANALYZE.

An analyze operation skips tables that have up-to-date statistics. If you run ANALYZE as part of your

extract, transform, and load (ETL) workﬂow, automatic analyze skips tables that have current statistics.

Similarly, an explicit ANALYZE skips tables when automatic analyze has updated the table's statistics.

ANALYZE Command History

It's useful to know when the last ANALYZE command was run on a table or database. When an ANALYZE

command is run, Amazon Redshift executes multiple queries that look like this:

padb_fetch_sample: select * from table_name

Query STL_ANALYZE to view the history of analyze operations. If Amazon Redshift analyzes a table using

automatic analyze, the is_background column is set to t (true). Otherwise, it is set to f (false). The

following example joins STV_TBL_PERM to show the table name and execution details.

select distinct a.xid, trim(t.name) as name, a.status, a.rows, a.modified_rows,

a.starttime, a.endtime

from stl_analyze a

join stv_tbl_perm t on t.id=a.table_id

where name = 'users'

order by starttime;

-------+-------+-----------------+-------+---------------+---------------------

+--------------------

1582 | users | Full | 49990 | 49990 | 2016-09-22 22:02:23 | 2016-09-22