TF V1400.4 Student Manual Teradata Factory
User Manual:
Open the PDF directly: View PDF .
Page Count: 3462
Download | ![]() |
Open PDF In Browser | View PDF |
Teradata Factory Course # 9038 Version 14.00.4 Student Guide Notes Module 0 Course Overview Teradata Factory Teradata Concepts MPP System Architectures Physical Design and Implementation Application Utilities Database Administration Teradata Proprietary and Confidential Course Introduction Page 0-3 Tenth Edition April, 2012 Trademarks The following names are registered names or trademarks and are used throughout this manual. The product or products described in this book are licensed products of Teradata Corporation or its affiliates. Teradata, BYNET, DBC/1012, DecisionCast, DecisionFlow, DecisionPoint, Eye logo design, InfoWise, Meta Warehouse, MyCommerce, SeeChain, SeeCommerce, SeeRisk, Teradata Decision Experts, Teradata Source Experts, WebAnalyst, and You’ve Never Seen Your Business Like This Before, and Raising Intelligence are trademarks or registered trademarks of Teradata Corporation or its affiliates. Adaptec and SCSISelect are trademarks or registered trademarks of Adaptec, Inc. AMD Opteron and Opteron are trademarks of Advanced Micro Devices, Inc. BakBone and NetVault are trademarks or registered trademarks of BakBone Software, Inc. EMC2, PowerPath, SRDF, and Symmetrix are registered trademarks of EMC2 Corporation. GoldenGate is a trademark of GoldenGate Software, a division of Oracle Corporation. Hewlett-Packard and HP are registered trademarks of Hewlett-Packard Company. Intel, Pentium, and XEON are registered trademarks of Intel Corporation. IBM, CICS, RACF, Tivoli, z/OS, and z/VM are registered trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Engenio is a registered trademarks of NetApp Corporation. Microsoft, Active Directory, Windows, Windows NT, and Windows Server are registered trademarks of Microsoft Corporation in the United States and other countries. Novell and SUSE are registered trademarks of Novell, Inc., in the United States and other countries. QLogic and SANbox trademarks or registered trademarks of QLogic Corporation. SAS and SAS/C are trademarks or registered trademarks of SAS Institute Inc. SPARC is a registered trademark of SPARC International, Inc. Symantec, NetBackup, and VERITAS are trademarks or registered trademarks of Symantec Corporation or its affiliates in the United States and other countries. Unicode is a collective membership mark and a service mark of Unicode, Inc. UNIX is a registered trademark of The Open Group in the United States and other countries. Other product and company names mentioned herein may be the trademarks of their respective owners. The materials included in this book are a licensed product of Teradata Corporation. Copyright Teradata Corporation ©2010-2012 Miamisburg, Ohio, U.S.A. All Rights Reserved. Material developed by: Teradata Learning Page 0-4 Course Introduction Table of Contents Trademarks................................................................................................................................... 0-4 Course Materials .......................................................................................................................... 0-6 Options for Displaying PDF Files ................................................................................................ 0-8 Example of Left Page – Right Page Display.............................................................................. 0-10 View and search a PDF .......................................................................................................... 0-10 PDF Comment and Markup Tools ............................................................................................. 0-12 Example of Highlighter and Sticky Note Tools ......................................................................... 0-14 Example of Typewriter Tool ...................................................................................................... 0-16 Course Description ..................................................................................................................... 0-18 Who Should Attend .................................................................................................................... 0-20 Prerequisites ............................................................................................................................... 0-20 Class Format .............................................................................................................................. 0-22 Classroom Rules ........................................................................................................................ 0-22 Outline of the Two Weeks ......................................................................................................... 0-24 Teradata Certification Tests ....................................................................................................... 0-26 Course Introduction Page 0-5 Course Materials The Teradata Factory course materials that are provided on a USB flash drive are listed on the facing page. These materials are provided at the beginning of the class. The Teradata Factory Student Manual and the Lab Workbook have been created as PDF files which can be viewed using Adobe® Reader®. These PDF files were created using Adobe Acrobat® and commenting has been enabled for both files. This allows you to use Adobe® Reader® Comment and Markup tools to place your own notes and comments within the files. Page 0-6 Course Introduction Course Materials Teradata Factory course materials include: • Paper copy of TF Lab Workbook • Electronic copy (PDF files) of Student Manual and Lab Workbook Contents of the flash drive include: • Teradata Factory Class Files – Class Files (these PDF files allow use of Comment and Markup tools) • • TF v1400.4 Lab Workbook.pdf TF v1400.4 Student Manual.pdf – Miscellaneous Software • • • • Acrobat Reader Microsoft .NET Packages Putty – use for secure shell Linux connections Secure FTP – use for secure FTP to Linux servers – TD 14.0 Reference Manuals – TD 14.0 TTU – Subset of tools and utilities (numbered in order of installation) • 01_piom__windows_i386.14.00.00.06.zip • 02_TeraGSS__windows_i386.14.00.00.01.zip • : – TD Demo Lab Setup (numbered in order of installation) Course Introduction Page 0-7 Options for Displaying PDF Files Adobe® Reader® is a tool that you can use to open and view the Teradata Factory PDF course files. You can also use the Adobe Reader to make comments or notes and save your PDF file. Since the Teradata Factory course materials have been created in a book format (left page right page), you may want to set options in Adobe Reader to view the materials in a book format. The left page contains additional information about the right or slide page. The right page is copy of the PPT slide that is used during the presentation. To view the Teradata Factory Student Manual in a book format using Adobe Reader 9.2 or before, use the View Menu > Page Display and set the following options. Page 0-8 Two-Up Continuous Show Gaps Between Pages (normally checked or set by default) Show Cover Page During Two-up Course Introduction Options for Displaying PDF Files The Teradata Factory course materials are created in a left page – right page format. • Left page – contains additional information about the slide page • Right page – copy of the PPT slide that is used during the presentation To display PDF files in a book type (left page – right page) format, Adobe Reader options need to be set. In Adobe Reader 9.2 and earlier versions, the options are named: • Two-Up Continuous • Show Gaps Between Pages • Show Cover Page During Two-Up Course Introduction Page 0-9 Example of Left Page – Right Page Display The facing page illustrates an example of displaying the Teradata Factory Student Manual in a left page – right page format. View and search a PDF In the Adobe Reader toolbar, use the Zoom tools and the Magnification menu to enlarge or reduce the page. Use the options on the View menu to change the page display. There are various options in the Tools menu to provide you with more ways to adjust the page for better viewing (Tools > Select & Zoom). This is an example of menus using Adobe Reader 9.2. These Adobe Reader toolbars open by default: A. B. C. D. E. Page 0-10 File toolbar Page Navigation toolbar Select & Zoom toolbar Page Display toolbar Find toolbar Course Introduction Example of Left Page – Right Page Display After setting the Page Display options, the PDF file is displayed as below. This PDF file has been created to allow the use of comment and markup tools. Course Introduction Page 0-11 PDF Comment and Markup Tools The Teradata Factory course materials have "commenting" enabled. Therefore, you can make comments in these files using the commenting and markup tools. Of the many commenting and markup tools that are available, you may find it easier to use the following tools (highlighted on the facing page). Add Sticky Note Highlight Text Tool Typewriter Comments can include both notes and drawings (if you have the time during class). You can enter a text message using the Sticky Note tool. You can use a drawing tool to add a line, circle, or other shape and then type a note in the associated pop-up note. You can enable the Comment & Markup Toolbar or you can simply select the tools using the pull-down menus. The example below is for Adobe Reader 9.2. Enable the Comment & Markup Toolbar and select the tool to use Menus (View > Toolbars > Comment & Markup) to add notes or comments. Options on the Comment & Markup toolbar: A. B. C. D. E. F. G. H. I. J. K. L. M. Sticky Note tool Text Edits tool Stamp tool and menu Highlight Text tool Callout tool Text Box tool Cloud tool Arrow tool Line tool Rectangle tool Oval tool Pencil tool Show menu After you add a note or comment, it stays selected until you click elsewhere on the page. A selected comment is highlighted by a blue halo to help you find the markup on the page. Page 0-12 Course Introduction PDF Comment and Markup Tools Comment and markup tools that may be useful include: • Sticky Note (Comment) Sticky Note • Highlight Text Tool (Comment) • Add a Text Box (Extended) or Typewriter Highlight Text Tool In Adobe Reader 9.2 and earlier versions, the options are in the Tools Menu: • Comment & Markup > Sticky Note • Comment & Markup > Highlight Text Tool • Typewriter Course Introduction Page 0-13 Example of Highlighter and Sticky Note Tools The facing page illustrates an example of using the Highlighter and Sticky Note tools. Select a commenting or markup tool. Choose Tools > Comment & Markup > Highlighter or Sticky Note (or another tool) Note: After you make an initial comment, the tool changes back to the Select tool so that you can move, resize, or edit your comment. (The Pencil, Highlight Text, and Line tools stay selected.) To keep a commenting tool selected so you can add multiple comments without reselecting the tool, do the following: Select the tool you want to use (but don’t use it yet). Choose View > Toolbars > Properties Bar. Select Keep Tool Selected. You can change the font of a text in a sticky note. Open the sticky note, choose View > Toolbars > Properties Bar, select the text in a note, and then change the font size in the Properties Bar. Page 0-14 Course Introduction Example of Highlighter and Sticky Note Tools The left page illustrates the Highlighter tool and the right page illustrates the Sticky Note tool. Course Introduction Page 0-15 Example of Typewriter Tool The facing page illustrates an example of using the Typewriter tool. This example also illustrates that the Typewriter Toolbar is enabled. The Typewriter Toolbar may be useful when completing review questions as shown on the facing page. You already have the answer to one of hundreds of questions in this course. After making notes and comments, save your changes. You may want to save your changes to a different PDF file name in order to preserve the original PDF file. Page 0-16 Course Introduction Example of Typewriter Tool The Typewriter tool can be used to add text at any location in the PDF file. To enable the Typewriter Toolbar in Adobe 9.2 or before: • Tools > Typewriter > Show Typewriter Toolbar Course Introduction Page 0-17 Course Description This course provides information on the following major topics: Page 0-18 Teradata Concepts System Architectures (e.g., 2650, 2690, 6650, and 6690 Systems) Teradata Physical Database Design Teradata SQL ANSI Differences for Version 2 Teradata Application Utilities Teradata Database Administration Course Introduction Course Description Description The primary focus of this ten day course is to teach you about the design, implementation, and administration of the Teradata Database. The major topics in this course include: • Teradata Database features and functions • The parallelism of the Teradata Database • How Teradata is implemented on MPP systems (e.g., 6690 systems) • How to perform physical database design for Teradata Database • Teradata SQL ANSI Differences • How to load and export data using the Teradata application utilities • How to perform common administrative functions for the Teradata Database Course Introduction Page 0-19 Who Should Attend This class is a learning event for relational database experienced individuals who need to learn the Teradata Database. This course is designed for Teradata practitioners who need to get hands-on practice with the Teradata Database in a learning environment. Professional Services Consultants Channel Partners Prerequisites An understanding of relational databases, SQL, and the logical data model is necessary before attending this course. Experience with large systems, relational databases and SQL, and an understanding of the UNIX operating system is useful, but not required before attending this course. There are Web Based Training classes that provide information about Teradata concepts and SQL. Page 0-20 Overview of Teradata Teradata SQL Course Introduction Who Should Attend and Prerequisites Who Should Attend This course is designed for ... • Teradata Professional Services Consultants • Channel Partners Prerequisites Required: • An understanding of the logical data model, relational, SQL, and data processing concepts. Useful, but not required: • Experience with relational databases and SQL • Experience with large systems used with Teradata Course Introduction Page 0-21 Class Format This ten-day class will be conducted as a series of lectures with classroom discussions, review questions, and workshops. Classroom Rules The classroom rules are listed on the facing page. Page 0-22 Course Introduction Class Format and Rules Class Format This ten day class consists of ... • Instructor presentations • Class discussions • Workshop exercises Classroom Rules The classroom rules are … • Turn off your cellular phones. • During lecture, only use your laptop to follow the class materials. • Come to class on time in the morning and after breaks. • Enjoy the two weeks. Course Introduction Page 0-23 Outline of the Two Weeks An outline of the two weeks is described on the following page. Major topic examples are listed for each week. Page 0-24 Course Introduction Outline of the Two Weeks 1. Teradata Concepts Teradata features and functions Parallelism and Teradata MPP System Architectures Characteristics of MPP (e.g., 6690) systems – typical configurations Disk Array subsystems and how Teradata utilizes disk arrays Teradata Physical Database Design (continued in week #2) Primary and secondary index selection; partitioned, NoPI, and columnar tables How the Teradata database works Collecting Statistics and Explains SQL ANSI syntax & features; Teradata and ANSI transaction modes Temporary tables, System Calendar, and Teradata System Limits 2. Teradata Application Utilities Load utilities (e.g., BTEQ, FastLoad, MultiLoad, and TPump) Export utilities (e.g., BTEQ and FastExport) Teradata Database Administration Dictionary tables and views; system hierarchy and space management Users, Databases, Access Rights, Roles, and Profiles Administrator and System Utilities – Teradata Administrator, Viewpoint, DBSControl How to use the archive facility to do Archive, Restore, and Recovery procedures Course Introduction Page 0-25 Teradata Certification Tests The facing page lists the various Teradata certification tests. Depending upon the tests that are completed, you can earn various Teradata Certified designations such as Teradata Certified Professional. The Teradata 12 Certification tests require knowledge plus experience with Teradata. This manual will help you prepare for these Teradata 12 tests, but many of the test questions are scenario-based and Teradata experience is needed to answer these types of questions. The Teradata V2R5 Certification tests were retired on March 31, 2010. Page 0-26 Course Introduction Teradata Certification Tests Teradata 12.0 Certification Tests 1 – Teradata 12 Basics 2 – Teradata 12 SQL 3 – Teradata 12 Physical Design and Implementation 4 – Teradata 12 Database Administration 5 – Teradata 12 Solutions Development 6 – Teradata 12 Enterprise Architecture 7 – Teradata 12 Comprehensive Mastery By passing all seven Teradata 12 certification tests, you become a Teradata 12 Certified Master. This course (along with Teradata experience) will prepare you for these tests. Options for Teradata V2R5 Certified Masters: • The Teradata 12 Qualifying Exam is available as an alternative to taking tests 1 – 6. • To achieve the Teradata 12 Master certification … 1. Pass the Teradata 12 Qualifying Exam OR pass each of the 6 tests 2. Pass the Teradata 12 Comprehensive Mastery exam Course Introduction Page 0-27 Notes Page 0-28 Course Introduction Module 1 Teradata Overview After completing this module, you will be able to: Describe the purpose of the Teradata product Understand the history of the Teradata Corporation List major architectural features of the product Teradata Proprietary and Confidential Teradata Overview Page 1-1 Notes Page 1-2 Teradata Overview Table of Contents What is Teradata?......................................................................................................................... 1-4 How large is a Trillion and a Quadrillion? .............................................................................. 1-4 Teradata – A Brief History........................................................................................................... 1-6 What is a Data Warehouse? ......................................................................................................... 1-8 Data Marts ................................................................................................................................ 1-8 Independent Data Marts ....................................................................................................... 1-8 Logical Data Marts............................................................................................................... 1-8 Dependent Data Marts.......................................................................................................... 1-8 What is Active Data Warehousing? ........................................................................................... 1-10 What is a Relational Database? .................................................................................................. 1-12 Primary Key ........................................................................................................................... 1-12 Answering Questions with a Relational Database ..................................................................... 1-14 Foreign Key............................................................................................................................ 1-14 Teradata Database Competitive Advantages ............................................................................. 1-16 Module 1: Review Questions ..................................................................................................... 1-18 Teradata Overview Page 1-3 What is Teradata? Teradata is a Relational Database Management System (RDBMS) for the world’s largest commercial databases. It is possible to have databases with over 100 terabytes (of data) in size. This characteristic makes Teradata an obvious choice for large data warehousing applications; however the Teradata system may also be as small as 100 gigabytes. With its parallelism and scalability, Teradata allows you to start small with a single node and grow large with many nodes through linear expandability. Teradata is comparable to a large database server, with multiple client application making inquiries against it concurrently. Teradata 14.0 was released on February 14, 2012. The acronym SUSE comes from the German name "Software und System Entwicklung" which means Software and Systems Development. The ability to manage terabytes of data is accomplished using the concept of parallelism, wherein many individual processors perform smaller tasks concurrently to accomplish an operation against a huge repository of data. To date, only parallel architectures can handle databases of this size. Acronyms: SLES – SUSE Linux Enterprise Server SUSE – Software und System Entwicklung (German name which means Software and Systems Development) How large is a Trillion and a Quadrillion? The Teradata Database was the first commercial database system to support a trillion bytes of data. It is hard to imagine the size of a trillion. To put it in perspective, the life span of the average person is 2.5 gigaseconds (or said differently 2,500,000,000 seconds). A trillion seconds is 31,688 years! Teradata has customers with multiple petabytes of data. One petabyte is one quadrillion bytes of data. A petabyte is effectively 1000 terabyes. 1 Kilobyte (KB) 1 Megabyte (MB) 1 Gigabyte (GB) 1 Terabyte (TB) 1 Petabyte (PB) 1 Exabyte 1 Zetabyte 1 Yottabyte Page 1-4 = 1024 bytes = 10242 >= 1,000,000 bytes = 10243 >= 1,000,000,000 bytes = 10244 >= 1,000,000,000,000 bytes = 10245 >= 1,000,000,000,000,000 bytes = 10246 >= 1,000,000,000,000,000,000 bytes = 10247 >= 1,000,000,000,000,000,000,000 bytes = 10248 >= 1,000,000,000,000,000,000,000,000 bytes Teradata Overview What is Teradata? The Teradata Database is a Relational Database Management System. Designed to run the world’s largest commercial databases. • Preferred solution for enterprise data warehousing • Acts as a "database server" to client applications throughout the enterprise • Uses parallelism to manage terabytes or petabytes of data – A terabyte is a trillion bytes of data – 1012. – A petabyte is a quadrillion bytes of data – 1015, effectively 1000 terabytes. • Capable of supporting many concurrent users from various client platforms (over TCP/IP or IBM channel connections). • The latest Teradata release is 14.0 and executes as a SUSE Linux application. Windows XP Windows 7 Teradata Database Linux Client Teradata Overview Mainframe Client Page 1-5 Teradata – A Brief History The Teradata Corporation was founded in 1979 in Los Angeles, California. The corporate goal was the creation of a “database computer” which could handle billions of rows of data, up to and beyond a terabyte of data storage. It took five years of development before a product was shipped to a first customer in 1984. In 1982, the YNET technology was patented as the enabling technology for the parallelism that was at the heart of the architecture. The YNET was the interconnect which allowed hundreds of individual processors to share the same bandwidth. In 1987, Teradata went public with its first stock offering. In 1988, Teradata partnered with the NCR Corporation to build the next generation of database computers (e.g., 3700). Before either company could market its next generation product, NCR was purchased by AT&T Corporation at the end of 1991. AT&T purchased Teradata and folded Teradata into the NCR structure in January of 1992. The new division was named AT&T GIS (Global Information Solutions). In 1996, AT&T spun off three separate companies, one of which was NCR which then returned to its old name. Teradata was a division of NCR from 1997 until 2001. In 1997, Teradata (as part of NCR) had become the world leader in scalable data warehouse solutions. In 2007, NCR and Teradata separated as two corporations. Page 1-6 Teradata Overview Teradata – A Brief History 1979 – Teradata Corp founded in Los Angeles, California – Development begins on a massively parallel computer 1982 – YNET technology is patented. 1984 – Teradata markets the first database computer DBC/1012 – First system purchased by Wells Fargo Bank of California 1989 – Teradata and NCR partner on next generation of DBC. 1992 – NCR Corporation is acquired by AT&T and Teradata is merged into NCR within AT&T and named AT&T GIS (Global Information Solutions). 1996 – AT&T spins off NCR Corporation with Teradata; Teradata Version 2 is released. 1997 – The Teradata Database becomes the industry leader in data warehousing. 2000 – The first 100+ Terabyte system is put into production. 2002 – Teradata V2R5 released 12/2002; major release including features such as PPI, roles and profiles, multi-value compression, and more. 2007 – NCR and Teradata become two separate corporations. Teradata 12.0 is released. 2010 – Teradata 13.10 is released as well as 2650/4600/5600/5650 systems. 2011 – Teradata releases 6650/6680/6690 systems. – More than 20 customers with 1 PB or larger systems 2012 – Teradata 14.0 is released on February 14, 2012. Teradata Overview Page 1-7 What is a Data Warehouse? A data warehouse is a central, enterprise-wide database that contains information extracted from the operational data stores. Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years worth of detail data so that historical trends may be analyzed using the full power of the data. Many data warehouses get their data directly from operational systems so that the data is timely and accurate. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise. Data Warehousing is a process, not a product. It is a technique to properly assemble and manage data from various sources to answer business questions not previously possible or known. Data Marts A data mart is a special purpose subset of enterprise data used by a particular department, function or application. Data marts may have both summary and detail data, however, usually the data has been pre-aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Independent Data Marts Independent data marts are created directly from operational systems, just as is a data warehouse. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate. Logical Data Marts Logical data marts are not separate physical structures but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology, it removes the need for massive data loading and transforming, making a single data store available for all user needs. Dependent Data Marts Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for performance-critical user queries. Page 1-8 Teradata Overview What is a Data Warehouse? A Data Warehouse is a central, enterprise-wide database that contains information extracted from Operational Data Stores (ODS). • • • • • Based on enterprise-wide model Can begin small but may grow large rapidly Populated by extraction/loading data from operational systems Responds to end-user "what if" queries Can store detailed as well as summary data ATM PeopleSoft ® Point of Service (POS) Operational Data Data Warehouse Teradata Database Teradata Warehouse Miner Cognos ® MicroStrategy ® Examples of Access Tools End Users Teradata Overview Page 1-9 What is Active Data Warehousing? The facing page provides a simple definition of Active Data Warehousing (ADW). Examples of why ADW is important (possibly mission critical applications) to different industries include: Airlines want an accurate view of customer value contribution so as to provide optimum customer service to the appropriate customer, whether or not they are frequent flyers. Health care organizations need to control costs, but not at the expense of jeopardizing quality of care. Proactive intervention programs where high-risk patients are identified and steered into case-management programs accomplish both. Financial institutions must fully understand a customer’s profitability characteristics to automate appropriate and timely communications for increased revenue opportunity and/or better customer service. Retailers need to have a single, integrated view of each customer across multiple channels of opportunity - web, in-store, and catalog - to provide the right offer through the right vehicle. Communications companies must manage a constantly changing competitive environment and offer products and services to reduce customer churn rates. One of the capabilities of ADW is to execute tactical queries in a timely fashion. Tactical queries are not the same as OLTP queries. Characteristics of a tactical query include: More read-oriented Focused on decision making More casual arrival rate than OLTP queries Examples of tactical queries include determining the best offer for a customer or altering an advertising campaign based on current demand and results. Another example of utilizing Active Data Warehousing is in the “Rental Car Business”. Assume a service provider has a limited (relatively) fixed inventory of cars. The goal is to rent the maximum number of vehicles at the maximum price possible under the constraint that all prices offered exceed variable cost of the rental. Pricing can be determined by forecasting demand and price elasticity as it relates to demand Differentiated pricing is the ultimate yield management strategy In order to do this, the business requires up to date, complete, and detailed data across the entire company. Page 1-10 Teradata Overview What is Active Data Warehousing? Data Warehousing … is the timely, integrated, logically consistent store of detailed data available for analytic business decision making. • Primarily batch feeds and updates • Ad hoc (or decision support) queries to support strategic decisions that return in minutes and maybe hours Active Data Warehousing … is the timely, integrated, logically consistent store of detailed data available for strategic, tactical driven business decisions. • Timely updates – close to real time • Short, tactical queries that return in seconds • Event driven activity plus strategic queries Business requirements for an ADW (Active Data Warehouse)? • • • • Performance – response within seconds Scalability – support for large data volumes, mixed workloads, and concurrent users Availability – 7 x 24 x 365 Data Freshness – Accurate, up to the minute, data Teradata Overview Page 1-11 What is a Relational Database? A database is a collection of permanently stored data that is used by an application or enterprise. A database contains logically related data. Basically, that means that the database was created with a purpose in mind. A database supports shared access by many users. A database also is protected to control access and managed to retain its value and integrity. The key to understanding relational databases is the concept of the table made up of rows and columns. A column always contains like data. In the example on the following page, the column named LAST NAME contains last name, and never anything else. The position of the column in the table is arbitrary. A row is one instance of all the columns of a table. In our example, all of the information about a single employee is in one row. The sequence of the rows in a table is arbitrary. Specifically, in a Relational Database, tables are defined as a named collection of one or more named columns by zero or more rows of related information. Notice that each row of the table is about a person. There are no rows with data on two people, nor are there rows with information on anything other than people. This may seem obvious, but the concept underlying it is very important. Each row represents an occurrence of an entity defined by the table. An entity is defined as a person, place or thing about which the table contains information. In this case the entity is the employee. Primary Key Tables, made up of rows and columns, represent entities or relationships. Entities are the people, places, things, or events that the Entity Tables Model. Each table holds only one kind of row, and each row is uniquely identified within a table by a Primary Key (PK). A Primary Key is required. A Primary Key can be more than one column. A Primary Key uniquely identifies each row in a table. No duplicate values are allowed. Only one Primary Key is allowed per table. The Primary Key for the EMPLOYEE table is the Employee number. No two employees can have the same number. Because it is used to identify, the Primary Key cannot be NULL. There must be something in that field to uniquely identify each occurrence. Primary Key values cannot be changed. Historical information as well as relationships with other entities may be lost if a PK value is changed or re-used. Page 1-12 Teradata Overview What is a Relational Database? • A Relational Database consists of a set of logically related tables. • A table is a two dimensional representation of data consisting of rows and columns. • Each row is in the table uniquely identified by a Primary Key (PK) – 1 or more columns. – A PK cannot have duplicate values and cannot be NULL; only one per table. – A PK are considered “non-changing” values. • A table may optionally have 1 or more Foreign Keys (FK). – A FK can be 1 or more columns, can have duplicate values, and allows NULLs – Each FK value must exist somewhere as a PK value Column Employee Table Row MANAGER EMPLOYEE EMPLOYEE NUMBER NUMBER DEPT NUMBER JOB CODE PK FK FK FK 1006 1008 1007 1003 1019 1019 1005 0801 301 301 ? 401 312101 312102 432101 411100 LAST NAME Stein Kanieski Villegas Trader FIRST NAME HIRE DATE BIRTH DATE SALARY AMOUNT John Carol Arnando James 861015 870201 870102 860731 631015 680517 470131 570619 3945000 3925000 5970000 4785000 This Employee table has 9 columns and 4 rows of sample data – one row per employee. There is no prescribed order for the rows of the table. There is only one row “format” for the entire table. Missing data values are represented by “NULLs”. Teradata Overview Page 1-13 Answering Questions with a Relational Database A relational database is a collection of relational tables stored in a single installation of a relational database management system (RDBMS). The words “management system” indicate that not only is this a relational database but also there is underlying software to provide additional functions that the industry expects. This includes transaction integrity, security, journaling, and other features that are expected of databases in general. The Teradata Database is a Relational Database Management System. Relational databases do not use access paths to locate data, rather data connections are made by data values. In other words, data connections are made by matching values in one column with the values in a corresponding column in another table. This connection is referred to as a JOIN in relational terminology. The diagram on the facing page show how the values in one table may be matched to values in another. Both tables have a column named “Department Number”. That connection allows the database to answer questions like, “What is the name of the department in which an employee works?” One reason relational databases are so powerful is that, unlike other databases, they are based on a mathematical model developed by Dr. Edgar Codd and implement a query language solidly founded in set theory. To summarize, a relational database is a collection of tables. The data contained in the tables can be associated using data values, specifically, columns with matching data values. Foreign Key Relational Databases permit associations by data value across more than one table. Foreign Keys (FKs) model the relationships between entities. On the facing page you will see that the employee table has 3 FK columns, one of which models the relationship between employees and their departments. A second one models the relationship between employees and their job codes. A third FK column is used to model the relationship between employees and each other. This is called a “recursive” relationship. Rules of Foreign Keys include: Duplicate values are allowed in a FK column. Missing values are allowed in a FK column. Values may be changed in a FK column. Each FK value must exist as a Primary Key. Note that Dept_Number is the Primary Key for the DEPARTMENT table. Page 1-14 Teradata Overview Answering Questions with a Relational Database Employee (partial listing) MANAGER EMPLOYEE EMPLOYEE NUMBER NUMBER DEPT NUMBER JOB CODE PK FK FK FK 1006 1008 1005 1004 1007 1003 1019 1019 0801 1003 1005 0801 301 301 403 401 403 401 312101 312102 431100 412101 432101 411100 LAST NAME Stein Kanieski Ryan Johnson Villegas Trader FIRST NAME HIRE DATE BIRTH DATE SALARY AMOUNT John Carol Loretta Darlene Arnando James 861015 870201 861015 861015 870102 860731 631015 680517 650910 560423 470131 570619 3945000 3925000 4120000 4630000 5970000 4785000 Department DEPT NUMBER DEPARTMENT NAME MANAGER BUDGET EMPLOYEE AMOUNT NUMBER PK 501 301 403 402 401 FK marketing sales research and development education software support customer support 80050000 46560000 93200000 30800000 98230000 1017 1019 1005 1011 1003 Questions: 1. Name the department in which James Trader works. 2. Who manages the Education Department? 3. Identify by name an employee who works for James Trader. Teradata Overview Page 1-15 Teradata Database Competitive Advantages As technology has improved, a number of aspects of the decision support environment have changed (improved). DSS systems are expected to: Store and efficiently process detailed data (reduces the need for summarized data). Process ad hoc queries in a timely fashion. Contain current (up-to-date) data. Teradata meets these requirements. The facing page lists a number of the key competitive advantages that Teradata provides. This course will look at these features in detail and explain why these are competitive advantages. Teradata provides a central, enterprise-wide database that contains information extracted from operational data stores. It provides for a single version of the business (or truth). Characteristics include: Based on enterprise-wide model – this type of model provides the ability to look/work across functional processes. Customers can begin small (right size), but may grow large rapidly Populated by extraction/loading of data from operational systems Allows end-users to submit “what if” queries Examples of applications that Teradata enables include: Customer Relationship Management (CRM) Campaign Management Yield Management Supply Chain Management Some of the reasons that Teradata is the leader in data warehousing include: Scalable – supports a small (10 GB) to a massive (Petabytes) database. Provides a query optimizer with approximately 30+ years of experience in largetable query planning. Does not require complex indexing schemes, complex data partitioning or timeconsuming reorganizations (re-orgs). Supports ad hoc querying against the detail data in the warehouse, not just summary data in the data mart. Designed and built with parallelism from day one (not a parallel retrofit). Page 1-16 Teradata Overview Teradata Database Competitive Advantages • Unlimited, Proven Scalability – amount of data and number of users; allows for an enterprise wide model of the data. • Unlimited Parallelism – parallel access, sorts, and aggregations. • Mature Optimizer – handles complex queries, up to 128 joins per query, adhoc processing. • Models the Business – normalized data (usually in 3NF), robust view processing, & provides star schema capabilities. • Provides a “single version of the business”. • Low TCO (Total Cost of Ownership) – ease of setup, maintenance, & administration; no re-orgs, lowest disk to data ratio, and robust expansion utility (reconfig). • High Availability – no single point of failure. • Parallel Load and Unload utilities – robust, parallel, and scalable load and unload utilities such as FastLoad, MultiLoad, TPump, and FastExport. Teradata Overview Page 1-17 Module 1: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 1-18 Teradata Overview Module 1: Review Questions 1. Which feature allows the Teradata Database to process enormous volumes of data quickly? ____ a. b. c. d. High availability software and hardware components High performance servers from Intel Proven Scalability Parallelism 2. The Teradata Database is primarily a ____ . a. Client b. Server 3. Which choice represents a quadrillion bytes or a Petabyte (PB) of data? ____ a. b. c. d. 109 1012 1015 1018 4. In a relational table, the set of columns that uniquely identifies a row is the _________ _________. Teradata Overview Page 1-19 Notes Page 1-20 Teradata Overview Module 2 Teradata Basics After completing this module, you will be able to: List and describe the major components of the Teradata architecture. Describe how the components interact to manage incoming and outgoing data. List 5 types of Teradata database objects. Teradata Proprietary and Confidential Teradata Basics Page 2-1 Notes Page 2-2 Teradata Basics Table of Contents Major Components of Teradata ................................................................................................... 2-4 Teradata Storage Architecture...................................................................................................... 2-6 Teradata Retrieval Architecture ................................................................................................... 2-8 Multiple Tables on Multiple AMPs ........................................................................................... 2-10 Here's how it works: ........................................................................................................... 2-10 Linear Growth and Expandability .............................................................................................. 2-12 Teradata Objects......................................................................................................................... 2-14 Tables ................................................................................................................................. 2-14 Views ................................................................................................................................. 2-14 Macros ................................................................................................................................ 2-14 Triggers .............................................................................................................................. 2-14 Stored Procedures .............................................................................................................. 2-14 The Data Dictionary Directory (DD/D) ..................................................................................... 2-16 Structure Query Language (SQL) .............................................................................................. 2-18 Data Definition Language (DDL) .......................................................................................... 2-18 Data Manipulation Language (DML) .................................................................................... 2-18 Data Control Language (DCL) .............................................................................................. 2-18 User Assistance ...................................................................................................................... 2-18 CREATE TABLE – Example of DDL ...................................................................................... 2-20 Views ......................................................................................................................................... 2-22 Single-table View ................................................................................................................... 2-22 Multi-Table Views ..................................................................................................................... 2-24 Macros ........................................................................................................................................ 2-26 Features of Macros ................................................................................................................. 2-26 Benefits of Macros ................................................................................................................. 2-26 HELP Commands ...................................................................................................................... 2-28 SHOW Command ...................................................................................................................... 2-30 EXPLAIN Facility ..................................................................................................................... 2-32 Summary .................................................................................................................................... 2-34 Module 2: Review Questions ..................................................................................................... 2-36 Teradata Basics Page 2-3 Major Components of Teradata Up until now we have discussed relational databases in terms of how the user perceives them – as a collection of tables that relate to one another. Now it's time to describe the components of the system. The major software components are the Parsing Engine (PE) and the Access Module Processor (AMP). The Parsing Engine is a component that interprets SQL requests, receives input records and passes data. To do that it sends the messages through the Message Passing Layer to the AMPs. The Message Passing Layer (MPL) handles the internal communication of the Teradata Database. The MPL is a combination of hardware and software (BYNET and PDE as we will see later). All communication between PEs and AMPs is done via the Message Passing Layer. The Access Module Processor (AMP) is responsible for managing a portion of the database. An AMP will control some portion of each table on the system. AMPs do all of the physical work associated with generating an answer set including, sorting, aggregating, formatting and converting. A Virtual Disk is disk space associated with an AMP. Tables/data rows are stored in this space. A virtual disk is usually assigned to two or more disk drives in a disk array. This concept will be discussed in detail later in the course. Page 2-4 Teradata Basics Major Components of Teradata SQL Request Answer Set Response Parsing Engine … Parsing Engine Message Passing Layer (MPL) • Allows PEs and AMPs to communicate with each other Message Passing Layer AMP Vdisk AMP Vdisk AMP Vdisk … … Parsing Engines (PE) • Manage sessions for users • Parse, optimize, and send your request to the AMPs as execution steps • Returns answer set response back to client AMP Access Module Processors (AMP) • Owns and manages its storage • Performs the steps sent by the PEs Vdisk Virtual Disks (Vdisk) • Space owned by the AMP and is used to hold user data (rows within tables). • Maps to physical space in a disk array. AMPs store and retrieve rows to and from disk. Teradata Basics Page 2-5 Teradata Storage Architecture On the facing page you will see a simplified view of how the physical components of a Teradata database work to insert a row of data. The PEs and AMPs are actually implemented as virtual processors (vprocs) in the system. A vproc is effectively a group of processes that represents a Teradata software component. The Parsing Engine interprets the SQL command and converts the data record from the host into an AMP message. The Parsing Engine is a component that interprets SQL requests, receives input records and passes data. To do that it sends the messages through the Message Passing Layer to the AMPs. The Message Passing Layer distributes the row to the appropriate Access Module Processor (AMP). The Message Passing Layer is implemented as hardware and/or software, depending on the platform used. It determines which vprocs should receive a message. The AMP formats the row and writes it to its associated disks (Vdisks) which are assigned to physical disks in a disk array. The physical disk holds the row for subsequent access. The Host or Client system supplies the records. These records are the raw data from which the database will be constructed. Think of the AMP (Access Module Processor) as a independent computer designed for and dedicated to managing a portion of the entire database. It performs all the database management functions – such as sorting, aggregating, and formatting the data. It receives data from the PE, formats the rows, and distributes the rows to the disk storage units it controls. It also retrieves the rows requested by the parsing engine. Page 2-6 Teradata Basics Teradata Storage Architecture Records From Client (in random sequence) 2 32 67 12 90 6 54 75 18 25 80 41 Teradata Parsing Engine(s) The Parsing Engine dispatches request to insert a row. The Message Passing Layer insures that a row gets to the appropriate AMP (Access Module Processor). Message Passing Layer AMP 1 2 AMP 2 AMP 3 12 80 54 18 Teradata Basics 90 41 75 AMP 4 25 32 67 6 The AMP stores the row on its associated (logical) disk. An AMP manages a logical or virtual disk which is mapped to multiple physical disks in a disk array. Page 2-7 Teradata Retrieval Architecture Retrieving data from the Teradata Database simply reverses the process of the storage model. A request is made for data and is passed on to a Parsing Engine (PE). The PE optimizes the request for efficient processing and creates tasks for the AMPs to perform, which will result in the request being satisfied. These tasks are then dispatched to the AMPs via the Message Passing Layer. Often times all AMPs must participate in creating the answer set, such as in returning all rows of a table. Other times, only one or a few AMPs need participate, depending on the nature of the request. The PE will insure that only the AMPs that are needed will be assigned tasks on behalf of this request. Once the AMPs have been given their assignments, they will retrieve the desired rows from their respective disks. If sorting, aggregating or formatting of any kind is needed, the AMPs will also take care of that. The rows are then returned to the requesting PE via the Message Passing Layer. The PE takes the returned answer set and returns it to the requesting client application. Page 2-8 Teradata Basics Teradata Retrieval Architecture Rows retrieved from table 2 32 67 12 90 6 54 75 18 25 80 41 Teradata The Parsing Engine dispatches a request to retrieve one or more rows. Parsing Engine(s) The Message Passing Layer insures that the appropriate AMP(s) are activated. Message Passing Layer AMP 1 2 AMP 2 Teradata Basics 90 41 AMP 4 75 The AMP(s) locate and retrieve desired row(s) in parallel access. Message Passing Layer returns the retrieved rows to PE. 80 12 54 18 AMP 3 25 32 67 6 The PE returns row(s) to requesting client application. Page 2-9 Multiple Tables on Multiple AMPs Logically, you might think that the Teradata Database would assign each table to a particular AMP, and that the AMP would put that table on a single disk. However, as you see on the diagram on the facing page, that’s not what will happen. The system takes the rows that composes a table and divides those rows up among all available AMPs. TO MAKE IT PARALLEL!!!!!!! Here's how it works: Tables are distributed across all AMPs. This distribution of rows should be even across all AMPs. This way, a request to get the rows of a given table will result in the workload being evenly distributed across the AMPs. Each table has some rows distributed to each AMP. Each AMP controls one logical storage unit which may consist of several physical disks VDISK Each AMP places, maintains, and manages the rows on its own disks. Large configurations may have hundreds of AMPs. Full table scans, operations that require looking at all the rows of a table, access all AMPs in parallel. That parallelism is what makes possible the accessing of enormous amounts of data. faster Consider the following three tables: EMPLOYEE, DEPARTMENT, and JOB. The Teradata Database takes the rows from each of the tables and divides them up among all the AMPs. The AMPs divide the rows up among their disks. Notice that each AMP gets part of each table. Dividing up the tables this way means that all the AMPs and their associated disks will be activated in a full table scan, thus speeding up requests against these tables. In our example, if you assume four AMPs, each AMP would get approximately 25% of each table. If, however, AMP #1 were to get 90% of the rows from the EMPLOYEE table that would be called "lumpy" data distribution. Lumpy data distribution would slow the system down because any request that required scanning all the rows of EMPLOYEE would have three AMPs sitting idle while AMP #1 finished its work. It is better to divide all the tables up evenly among all the available AMPs. You will see how this distribution is controlled in a later chapter. Page 2-10 Teradata Basics Multiple Tables on Multiple AMPs EMPLOYEE Table DEPARTMENT Table JOB Table Parsing Engine Row from each table will usually be stored on each AMP. Each AMP may have rows from all tables. Message Passing Layer AMP 1 AMP 2 EMPLOYEE Rows EMPLOYEE Rows DEPARTMENT Rows DEPARTMENT Rows JOB Rows JOB Rows Teradata Basics AMP 3 EMPLOYEE Rows DEPARTMENT Rows JOB Rows Ideally, each AMP will hold roughly the same amount of data. AMP 4 EMPLOYEE Rows DEPARTMENT Rows JOB Rows Page 2-11 Linear Growth and Expandability The Teradata DBS is the first commercial database system to offer true parallelism and the performance increase that goes with it. Think back to the example of how rows are divided up among AMPs that we just discussed. Assume that our three tables, EMPLOYEE, DEPARTMENT, and JOB total 100,000 rows, with a certain number of users, say 50. What happens if you double the number of AMPs and the number of users stays the same? Performance doubles. Each AMP can only work on half as many rows as they used to. Now think of that system in a situation where the number of users is doubled, as well as the number of AMPs. We now have 100 users, but we also have twice as many AMPs. What happens to performance? It stays the same. There is no drop-off in the speed with which requests are executed. That's because the system is modular and the workload is easily partitioned into independent pieces. In the last example, each AMP is still doing the same amount of work. This feature – that the amount of time (or money) required to do a task is directly proportional to the size of the system – is unique to the Teradata Database. Traditional databases show a sharp drop in performance when the system approaches a critical size. Look at the diagram on the facing page. As the number of Parsing Engines increases, the number of SQL requests that can be supported increases. As you add AMPs, data is spread out more even as you add processing power to handle the data. As you add disks, you add space for each AMP to store and process more information. All AMPs must have the same amount of disk storage space. There are numerous advantages to having a system that has linear scalability. Two advantages include: Page 2-12 Linear scalability allows for increased workload without decreased throughput. Investment protection for application development Teradata Basics Linear Growth and Expandability Parsing Engine Parsing Engine Parsing Engine SESSIO • Teradata is a linearly expandable RDBMS. NS • Components may be added as AMP requirements grow. AMP AMP LE PARAL CESSIN L PR O • Linear scalability allows for G increased workload without decreased throughput. • Performance impact of adding Disk components is shown below. Disk Disk DAT A USERS Same Double Same Same Teradata Basics AMPs Same Double Double Double DATA Same Same Double Same Performance Same Same Same Double Page 2-13 Teradata Objects A “database” or “user” in Teradata database systems is a collection of objects such as tables, views, macros, triggers, stored procedures, user-defined functions, or indexes (join and hash). Database objects are created and accessed using standard Structured Query Language or SQL. All database object definitions are stored in a system database called the Data Dictionary/Directory (DD/D). Databases provide a logical grouping for information. They are also the foundation for space allocation and access control. A description of some of the objects follows. Tables A table is the logical structure of data in a relational database. It is a two-dimensional structure made up of columns and rows. A user defines a table by giving it a table name that refers to the type of data that will be stored in the table. A column represents attributes of the table. Column names are given to each column of the table. All the information in a column is the same type, for example, date of birth. Each occurrence of an entity is stored in the table as a row. Entities are the people, things, or events that the table is about. Thus a row would represent a particular person, thing, or event. Views A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. The database administrator can use views to control access to the underlying tables. Views can be used to hide columns from users, to insulate applications from database changes, and to simplify or standardize access techniques. Macros A macro is a predefined, stored set of one or more SQL commands and optionally, report formatting commands. Macros are used to simplify the execution of frequently used SQL commands. Triggers A trigger is a set of SQL statements usually associated with a column or a table and when that column changes, the trigger is fired – effectively executing the SQL statements. Stored Procedures A stored procedure is a program that is stored within Teradata and executes within the Teradata Database. A stored procedure uses permanent disk space. A stored procedure is a pre-defined set of statements invoked through a single SQL CALL statement. Stored procedures may contain both Teradata SQL statements and procedural statements (in Teradata, referred to as Stored Procedure Language, or SPL). Page 2-14 Teradata Basics Teradata Objects Examples of objects within a Teradata database or user include: Tables – rows and columns of data Views – predefined subsets of existing tables Macros – predefined, stored SQL statements Triggers – SQL statements associated with a table Stored Procedures – program stored within Teradata User-Defined Function – function (C or Java program) to provide additional SQL functionality Join and Hash Indexes – separate index structures stored as objects within a database Permanent Journals – table used to store before and/or after images for recovery DATABASE or USER can have a mix of various objects. * - require Permanent Space These objects are created, maintained, and deleted using SQL. Object definitions are stored in the DD/D. TABLE 1 * TABLE 2 * TABLE 3 * VIEW 1 VIEW 2 VIEW 3 MACRO 1 Stored Procedure 1 * TRIGGER 1 UDF 1 * Join/Hash Index 1 * These aren't directly accessed by users. Permanent Journal * Teradata Basics Page 2-15 The Data Dictionary Directory (DD/D) The Data Dictionary/Directory is an integrated set of system tables which store database object definitions and accumulate information about users, databases, resource usage, data demographics, and security rules. It records specifications about tables, views, and macros. It also contains information about ownership, space allocation, accounting, and access rights (privileges) for these objects. Data Dictionary/Directory information is updated automatically during the processing of Teradata SQL data definition (DDL) statements. It is used by the Parser to obtain information needed to process all Teradata SQL statements. Users may access the DD/D through Teradata-supplied views, if permitted by the system administrator. Page 2-16 Teradata Basics The Data Dictionary Directory (DD/D) The DD/D ... – is an integrated set of system tables – contains definitions of and information about all objects in the system – is entirely maintained by the Teradata Database – is “data about the data” or “metadata” – is distributed across all AMPs like all tables – may be queried by administrators or support staff – is normally accessed via Teradata supplied views Examples of DD/D views: DBC.TablesV – information about all tables DBC.UsersV – information about all users DBC.AllRightsV – information about access rights DBC.AllSpaceV – information about space utilization Teradata Basics Page 2-17 Structure Query Language (SQL) Structured Query Language (SQL) is the language of relational databases. It is sometimes referred to as a "Fourth Generation Language (4GL)" to differentiate it from "Third Generation Languages" such as FORTRAN and COBOL, though it is quite different from other 4GL’s. It acts as an intermediary between the user and the database. SQL is different in some very important ways from other computer languages. Its statements resemble English-like structures. It provides powerful, set-oriented database manipulation including structural modification, data retrieval, modification, and security functions. SQL is a non-procedural language. Because of its set orientation it does not require IF, GOTO, DO, FOR NEXT or PERFORM statements. We'll describe three important subsets of SQL – the Data Definition Language, the Data Manipulation Language, and the Data Control Language. Data Definition Language (DDL) The DDL allows a user to define the database objects and the relationships that exist among them. Examples of DDL uses are creating or modifying tables and views. Data Manipulation Language (DML) The DML consists of the statements that manipulate, change or retrieve the data rows of the database. If the DDL defines the database, the DML lets the user change the information contained in the database. The DML is the most commonly used subset of SQL. It is used to select, update, delete, and insert rows. Data Control Language (DCL) The Data Control Language is used to restrict or permit a user's access in various ways. It can selectively limit a user's ability to retrieve, add, or modify data. It is used to grant and revoke access privileges on tables and views. An example is granting update privileges on a table, or read privileges on a view to specified users. User Assistance These commands allow you to list the objects in a database, or the characteristics of a table, see how a query will execute, or show you the details of your system. They vary widely from vendor to vendor. Page 2-18 Teradata Basics Structured Query Language (SQL) SQL is a query language for Relational Database Systems and is used to access Teradata. – A fourth-generation language – A set-oriented language – A non-procedural language (e.g., doesn’t have IF, DO, FOR NEXT, etc. ) SQL consists of: Data Definition Language (DDL) – Defines database structures (tables, users, views, macros, triggers, etc.) CREATE DROP ALTER Data Manipulation Language (DML) – Manipulates rows and data values SELECT INSERT UPDATE DELETE Data Control Language (DCL) – Grants and revokes access rights GRANT REVOKE Teradata SQL also includes Teradata Extensions to SQL HELP Teradata Basics SHOW EXPLAIN CREATE MACRO Page 2-19 CREATE TABLE – Example of DDL To create and store the table structure definition in the DD/D, you can execute the CREATE TABLE DDL statement as shown on the facing page. An example of the output from a SHOW TABLE command follows: SHOW TABLE Employee; CREATE SET TABLE Per_DB.Employee, FALLBACK, NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( employee_number INTEGER NOT NULL, manager_emp_number INTEGER NOT NULL, dept_number INTEGER COMPRESS, job_code INTEGER COMPRESS , last_name CHAR(20) NOT CASESPECIFIC NOT NULL, first_name VARCHAR(20) NOT CASESPECIFIC, hire_date DATE FORMAT 'YYYY-MM-DD' birth_date DATE FORMAT 'YYYY-MM-DD', salary_amount DECIMAL(10,2) COMPRESS 0 ) UNIQUE PRIMARY INDEX (employee_number) INDEX (dept_number); You can create secondary indexes after a table has been created by executing the CREATE INDEX command. An example of creating an index for the job_code column is shown on the facing page. Examples of the DROP INDEX and DROP TABLE commands are also shown on the facing page. Page 2-20 Teradata Basics CREATE TABLE – Example of DDL CREATE TABLE Employee (employee_number INTEGER NOT NULL ,manager_emp_number INTEGER COMPRESS ,dept_number INTEGER COMPRESS ,job_code INTEGER COMPRESS ,last_name CHAR(20) NOT NULL ,first_name VARCHAR (20) ,hire_date DATE FORMAT 'YYYY-MM-DD' ,birth_date DATE FORMAT 'YYYY-MM-DD' ,salary_amount DECIMAL (10,2) COMPRESS 0 ) UNIQUE PRIMARY INDEX (employee_number) INDEX (dept_number); Other DDL Examples CREATE INDEX (job_code) ON Employee ; DROP INDEX (job_code) ON Employee ; DROP TABLE Employee ; Teradata Basics Page 2-21 Views A view is a pre-defined subset or filter of one or more tables. Views are used to control access to the underlying tables and simplify access to data. Authorized users may use views to read data specified in the view and/or to update data specified in the view. Views are used to simplify query requests, to limit access to data, and to allow different users to look at the same data from different perspectives. A view is a window that accesses selected portions of a database. Views can show parts of one table (single-table view), more than one table (multi-table view), or a combination of tables and other views. To the user, views look just like tables. Views are an alternate way of organizing and presenting information. A view, like a table, has rows and columns. However, the rows and columns of a view are not stored directly but are derived from the rows and columns of tables whenever the view is referenced. A view looks like a table, but has no data of its own, and therefore takes up no storage space except for its definition. One way to think of a view is as if it was a window through which you can look at selected portions of a table or tables. Single-table View A single-table view takes specified columns and/or rows from a table and makes them available in a fashion that looks like a table. An example might be an employee table from which you select only certain columns for employees in a particular department number, for example, department 403, and present them in a view. Example of a CREATE VIEW statement: CREATE VIEW Emp403_v AS SELECT employee_number ,department_number ,last_name ,first_name ,hire_date FROM Employee WHERE department_number = 403; It is also possible to execute SHOW VIEW viewname; Page 2-22 Teradata Basics Views Views are pre-defined filters of existing tables consisting of specified columns and/or rows from the table(s). A single table view: – is a window into an underlying table – allows users to read and update a subset of the underlying table – has no data of its own EMPLOYEE (Table) MANAGER EMPLOYEE EMP NUMBER NUMBER DEPT NUMBER JOB CODE PK FK FK FK 1006 1008 1005 1004 1007 1003 1019 1019 0801 1003 1005 0801 301 301 403 401 403 401 312101 312102 431100 412101 432101 411100 LAST NAME Stein Kanieski Ryan Johnson Villegas Trader FIRST NAME HIRE DATE BIRTH DATE SALARY AMOUNT John Carol Loretta Darlene Arnando James 861015 870201 861015 861015 870102 860731 631015 680517 650910 560423 470131 570619 3945000 3925000 4120000 4630000 5970000 4785000 Emp403_v (View) EMP NO 1005 801 Teradata Basics DEPT NO 403 403 LAST NAME Villegas Ryan FIRST NAME Arnando Loretta HIRE DATE 870102 861015 Page 2-23 Multi-Table Views A multi-table view combines data from more than one table into one pre-defined view. These views are also called “join views” because more than one table is involved. An example might be a view that shows employees and the name of their department, information that comes from two different tables. Note: Multi-table Views are read only. The user cannot update the data via the view. One might wish to create a view containing the last name and department name for all employees. A Join operation joins rows of multiple tables and creates rows in work space or spool. These are rows that contain data from more than one table but are not maintained anywhere in permanent storage. These rows in spool are created dynamically as part of a join operation. Rows are matched up based on Primary and Foreign Key relationships. Example of SQL to create a join view: CREATE VIEW EmpDept_v AS SELECT Last_Name ,Department_Name FROM Employee E INNER JOIN Department D ON E.dept_number = D.dept_number ; An example of reading via this view is: SELECT FROM Last_Name ,Department_Name EmpDept_v; This example utilizes an alias name of E for the Employee table and D for the Department table. Page 2-24 Teradata Basics Multi-Table Views A multi-table view allows users to access data from multiple tables as if it were in a single table. Multi-table views (i.e., join views) are used for reading only, not updating. EMPLOYEE (Table) MANAGER EMPLOYEE EMP NUMBER NUMBER DEPARTMENT (Table) DEPT NUMBER JOB CODE LAST NAME FIRST NAME DEPT NUMBER DEPARTMENT NAME BUDGET AMOUNT PK PK FK FK FK 1006 1008 1005 1004 1007 1003 1019 1019 0801 1003 1005 0801 301 301 403 401 403 401 312101 312102 431100 412101 432101 411100 Stein Kanieski Ryan Johnson Villegas Trader John Carol Loretta Darlene Arnando James 501 301 302 403 402 401 MANAGER EMP NUMBER FK Marketing Sales Research & Development Product Planning Education Software Support Customer Support 80050000 46560000 22600000 93200000 30800000 98230000 1017 1019 1016 1005 1011 1003 Joined Together Example of SQL to create a join view: EmpDept_v CREATE VIEW EmpDept_v AS SELECT Last_Name ,Department_Name FROM Employee E INNER JOIN Department D ON E.dept_number = D.dept_number; Teradata Basics (View) Last_Name Department_Name Stein Kanieski Ryan Johnson Villegas Trader Research & Development Research & Development Education Customer Support Education Customer Support Page 2-25 Macros The Macro facility allows you to define a sequence of Teradata SQL statements (and optionally Teradata report formatting statements) so that they execute as a single transaction. Macros reduce the number of keystrokes needed to perform a complex task. This saves you time, reduces the chance of errors, reduces the communication volume to Teradata, and allows efficiencies internal to Teradata. Macros are a Teradata SQL extension. Features of Macros Macros are source code stored on the DBC. They can be modified and executed at will. They are re-optimized at execution time. They can be executed by interactive or batch applications. They are executed by one EXECUTE command. They can accept user-provided parameter values. Benefits of Macros Macros simplify and control access to the system. They enhance system security. They provide an easy way of installing referential integrity. They reduce the amount of source code transmitted from the client application. They are stored in the Teradata DD/D and are available to all connected hosts. To create a macro: CREATE MACRO Customer_List AS (SELECT customer_name FROM Customer; ); To execute a macro: EXEC Customer_List; To replace a macro: REPLACE MACRO Customer_List AS (SELECT customer_name, customer_number FROM Customer; ); To drop a macro: DROP MACRO Customer_List; Page 2-26 Teradata Basics Macros A MACRO is a predefined set of SQL statements which is logically stored in a database. Macros may be created for frequently occurring queries of sets of operations. Macros have many features and benefits: • • • • • • • Simplify end-user access Control which operations may be performed by users May accept user-provided parameter values Are stored in the Teradata Database, thus available to all clients Reduces query size, thus reduces LAN/channel traffic Are optimized at execution time May contain multiple SQL statements To create a macro: CREATE MACRO Customer_List AS (SELECT customer_name FROM Customer;); To execute a macro: EXEC Customer_List; To replace a macro: REPLACE MACRO Customer_List AS (SELECT customer_name, customer_number FROM Customer;); Teradata Basics Page 2-27 HELP Commands HELP commands (a Teradata SQL extension) are available to display information on database objects: Databases and Users Tables Views Macros Triggers Join Indexes Hash Indexes Stored Procedures User-Defined Functions The facing page contains an example of a HELP DATABASE command. This command lists the tables, views, macros, triggers, etc. in the specified database. The Kind (TableKind) column codes represent the following: T O V M G P F I N J A B D E H Q U X Page 2-28 – – – – – – – – – – – – – – – – – – Table Table without a Primary Index View Macro Trigger Stored Procedure User-defined Function Join Index Hash Index Permanent Journal Aggregate Function Combined aggregate and ordered analytical function JAR External Stored Procedure Instance or Constructor Method Queue Table User-defined data type Authorization Teradata Basics HELP Commands Databases and Users HELP DATABASE HELP USER Customer_Service; Dave_Jones; Tables, Views, Macros, etc. HELP HELP HELP HELP TABLE VIEW MACRO COLUMN Employee; Emp_v; Payroll_3; Employee.*; Employee.last_name; HELP INDEX Employee; HELP TRIGGER Raise_Trigger; HELP STATISTICS Employee; HELP CONSTRAINT Employee.over_21; HELP JOIN INDEX Cust_Order_JI; HELP SESSION; Example: HELP DATABASE Customer_Service; *** Help information returned. 15 rows. *** Total elapsed time was 1 second. Table/View/Macro name Contact Customer Cust_Comp_Orders Cust_Order_JI Department : Orders Orders_Temp Orders_HI Raise_Trigger Set_Ansidate_on Kind T T V I T : T O N G M Comment ? ? ? ? ? : ? ? ? ? ? This is not an inclusive list of HELP commands. Teradata Basics Page 2-29 SHOW Command HELP commands display information about database objects (users/databases, tables, views, macros, triggers, and stored procedures) and session characteristics. SHOW commands (another Teradata extension) display the data definition (DDL) associated with database objects (tables, views, macros, triggers, or stored procedures). BTEQ contains a SHOW command, in addition to and separate from the SQL SHOW command. The BTEQ SHOW provides information on the formatting and display settings for the current BTEQ session, if applicable. Page 2-30 Teradata Basics SHOW Command SHOW commands display how an object was created. Examples include: Command SHOW TABLE SHOW VIEW SHOW MACRO SHOW TRIGGER SHOW PROCEDURE SHOW JOIN INDEX Returns statement table_name; view_name; macro_name; trigger_name; procedure_name; join_index_name; CREATE TABLE statement … CREATE VIEW ... CREATE MACRO ... CREATE TRIGGER … CREATE PROCEDURE … CREATE JOIN INDEX … SHOW TABLE Employee; CREATE SET TABLE PD.Employee, FALLBACK, NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( Employee_Number INTEGER NOT NULL, Emp_Mgr_Number INTEGER COMPRESS, Dept_Number INTEGER COMPRESS, Job_Code INTEGER COMPRESS, Last_Name CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC, First_Name VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC, Salary_Amount DECIMAL(10,2) COMPRESS 0) UNIQUE PRIMARY INDEX ( Employee_Number ) INDEX ( Dept_Number ); Teradata Basics Page 2-31 EXPLAIN Facility The EXPLAIN facility (a very useful and robust Teradata extension) allows you to preview how Teradata will execute a query you have requested. It returns a summary of the steps the Teradata Database would perform to execute the request. EXPLAIN also discloses the strategy and access method to be used, how many rows will be involved, and its “cost” in minutes and seconds. You can use EXPLAIN to evaluate a query performance and to develop an alternative processing strategy that may be more efficient. EXPLAIN works on any SQL request. The request is fully parsed and optimized, but it is not run. Instead, the complete plan is returned to the user in readable English statements. EXPLAIN also provides information about locking, sorting, row selection criteria, join strategy and conditions, access method, and parallel step processing. There are a lot of reasons for using EXPLAIN. The main ones we’ve already pointed out – it lets you know how the system will do the job, what kind of results you will get back, and the relative cost of the query. EXPLAIN is also useful for performance tuning, debugging, pre-validation of requests, and for technical training. The following is an example of an EXPLAIN on a very simple query doing a FTS (Full Table Scan). EXPLAIN SELECT * FROM Employee WHERE Dept_Number = 1018; Explanation (full) --------------------------------------------------------------------------1) First, we lock a distinct PD."pseudo table" for read on a RowHash to prevent global deadlock for PD.Employee. 2) Next, we lock PD.Employee for read. 3) We do an all-AMPs RETRIEVE step from PD.Employee by way of an all-rows scan with a condition of ("PD.Employee.Dept_Number = 1018") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 10 rows (730 bytes). The estimated time for this step is 0.14 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.14 seconds. Page 2-32 Teradata Basics EXPLAIN Facility The EXPLAIN modifier in front of any SQL statement generates an English translation of the Parser’s plan. The request is fully parsed and optimized, but not actually executed. EXPLAIN returns: • Text showing how a statement will be processed (a plan) • An estimate of how many rows will be involved • A relative cost of the request (in units of time) This information is useful for: • • • • predicting row counts predicting performance testing queries before production analyzing various approaches to a problem EXPLAIN SELECT * FROM Employee WHERE Dept_Number = 1018; : 3) We do an all-AMPs RETRIEVE step from PD.Employee by way of an all-rows scan with a condition of ("PD.Employee.Dept_Number = 1018") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 10 rows (730 bytes). The estimated time for this step is 0.14 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.14 seconds. Teradata Basics Page 2-33 Summary The Teradata system is a high-performance database system that permits the processing of enormous quantities of detail data, quantities which are beyond the capability of conventional systems. The system is specifically designed for large relational databases. From the beginning the Teradata system was created to do one thing: manage enormous amounts of data. Over one thousand terabytes of on-line storage capacity is currently available making it an ideal solution for enterprise data warehouses or even smaller data marts. Uniform data distribution across multiple processors facilitates parallel processing. The system is designed in such a way that the component parts divides the work up into approximately equal pieces. This keeps all the parts busy all the time; this enables the system to accommodate a larger number of users and/or more data. Open architecture adapts readily to new technology. As higher-performance industry standard computer chips and disk drives are made available, they are easily incorporated into the architecture. As the configuration grows, performance increase is linear. Structured Query Language (SQL) is the industry standard for communicating with relational databases. The Teradata Database currently runs as a database server on a variety of Linux, UNIX, and Windows based hardware platforms. Page 2-34 Teradata Basics Summary The major components of the Teradata Database are: Parsing Engines (PE) • Manage sessions for users • Parse, optimize, and send your request to the AMPs as execution steps • Returns answer set response back to client Message Passing Layer (MPL) • Allows PEs and AMPs to communicate with each other Access Module Processors (AMP) • Owns and manages its storage • Performs the steps sent by the PEs Virtual Disks (Vdisk) • Space owned by the AMP and is used to hold user data (rows within tables). • Maps to physical space in a disk array. Teradata Basics Page 2-35 Module 2: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 2-36 Teradata Basics Module 2: Review Questions 1. What language is used to access a Teradata table? 2. What are five Teradata database objects? 3. What are four major components of the Teradata architecture? 4. What are views? 5. What are macros? Teradata Basics Page 2-37 Notes Page 2-38 Teradata Basics Module 3 Teradata Database Architecture After completing this module, you will be able to: Describe the purpose of the PE and the AMP. Describe the overall Teradata Database parallel architecture. Describe the relationship of the Teradata Database to its client side applications. Teradata Proprietary and Confidential Teradata Database Architecture Page 3-1 Notes Page 3-2 Teradata Database Architecture Table of Contents Teradata and MPP Systems .......................................................................................................... 3-4 Teradata Functional Overview ..................................................................................................... 3-6 Channel-Attached Client Software Overview.............................................................................. 3-8 Network-Attached Client Software Overview ........................................................................... 3-10 The Parsing Engine .................................................................................................................... 3-12 Message Passing Layer .............................................................................................................. 3-14 The Access Module Processor (AMP) ....................................................................................... 3-16 Teradata Parallelism ................................................................................................................... 3-18 Module 3: Review Questions ..................................................................................................... 3-20 Teradata Database Architecture Page 3-3 Teradata and MPP Systems Teradata is the software that makes a MPP system appear to be a single system to users and administrators. The BYNET (BanYan NETwork) is the software and hardware interconnect that provides high performance networking capabilities to Teradata MPP (Massively Parallel Processing) systems. Using communication switching techniques, the BYNET allows for point-to-point, multicast, and broadcast communications among the nodes, thus supporting a monumental increase in throughput in very large databases. This technology allows Teradata users to grow massively parallel databases without fear of a communications bottleneck for any database operations. Although the BYNET software also supports the multicast protocol, Teradata software uses the point-to-point protocol whenever possible. When an all-AMP operation is needed, Teradata software uses the broadcast protocol to broadcast the request to the AMPs. The BYNET is linearly scalable for point-to-point communications. For each new node added to the system, an additional 960 MB (with BYNET Version 4) of bandwidth is added, thus providing scalability as the system grows. Scalability comes from the fact that multiple point-to-point circuits can be established concurrently. With the addition of another node, more circuits can be established concurrently. Page 3-4 Teradata Database Architecture Teradata and MPP Systems Teradata is the software that makes a MPP system appear to be a single system to users and administrators. BYNET 0 BYNET 1 The major components of the Teradata Database are implemented as virtual processors (vproc). • Parsing Engine (PE) • Access Module PE PE PE PE PE PE PE PE AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP : : : : : : : : AMP AMP AMP AMP AMP AMP AMP AMP Processor (AMP) The Communication Layer or Message Passing Layer (MPL) consists of PDE and BYNET SW/HW and connects multiple nodes together. Teradata Database Architecture PDE O.S. PDE O.S. PDE O.S. PDE O.S. Node 0 Node 1 Node 2 Node 3 Page 3-5 Teradata Functional Overview The client may be a mainframe system (e.g., IBM) in which case it is channel-attached to the Teradata Database. Also, a client may be a PC or UNIX-based system that is LAN or network-attached. The client application submits an SQL request to the Teradata Database, receives the response, and submits the response to the user. The Call Level Interface (CLI) is a library of routines that resides on the client side. Client application programs use these routines to perform operations such as logging on and off, submitting SQL queries and receiving responses which contain the answer set. These routines are 98% the same in a network-attached environment as they are in a channelattached. Page 3-6 Teradata Database Architecture Teradata Functional Overview Channel-Attached System Network-Attached System Client Application Client Application ODBC, JDBC, or .NET CLI CLI Teradata Database MTDP Channel TDP Parsing Engine LAN Parsing Engine MOSI Message Passing Layer AMP Teradata Database Architecture AMP AMP AMP Page 3-7 Channel-Attached Client Software Overview In channel-attached systems, there are three major software components, which play important roles in getting the requests to and from the Teradata Database. The client application is either written by a programmer or is one of Teradata’s provided utility programs. Many client applications are written as “front ends” for SQL submission, but they also are written for file maintenance and report generation. Any client-supported language may be used provided it can interface to the Call Level Interface (CLI). For example, a user could write a COBOL application with “embedded SQL”. The application developer would have to use the Teradata COBOL Preprocessor and COBOL compiler programs to generate an object module and link this object module with the CLI. The CLI application interface provides maximum control over Teradata connectivity and access. The Call Level Interface (CLI) is the lowest level interface to the Teradata Database. It consists of system calls which create sessions, allocate request and response buffers, create and de-block “parcels” of information, and fetch response information to the requesting client. The Teradata Director Program (TDP) is a Teradata-supplied program that must run on any client system that will be channel-attached to the Teradata Database. The TDP manages the session traffic between the Call-Level Interface and the Database. Its functions include session initiation and termination, logging, verification, recovery, and restart, as well as physical input to and output from the PEs, (including session balancing) and the maintenance of queues. The TDP may also handle system security. The Host Channel Adapter is a mainframe hardware component that allows the mainframe to connect to an ESCON or Bus/Tag channel. The PBSA (PCI Bus ESCON Adapter) is a PCI adapter card that allows a Teradata server to connect to an ESCON channel. The PBCA (PCI Bus Channel Adapter) is a PCI adapter card that allows a Teradata server to connect to a Bus/Tag channel. Page 3-8 Teradata Database Architecture Channel-Attached Client Software Overview Channel-Attached System Client Application Client Application CLI CLI Channel (ESCON or FICON) Host Channel Adapter PBSA TDP Parsing Engine Parsing Engine Client Application – Your own application(s) – Teradata utilities (BTEQ, etc.) CLI (Call-Level Interface) Service Routines – Request and Response Control – Parcel creation and blocking/unblocking – Buffer allocation and initialization TDP (Teradata Director Program) – Session balancing across multiple PEs – Insures proper message routing to/from the Teradata Database – Failure notification (application failure, Teradata restart) Teradata Database Architecture Page 3-9 ` Network-Attached Client Software Overview In a network-attached environment, the SMPs running Teradata will typically have 1 or more Ethernet adapters that are used to connect to Teradata via a LAN connection. One of the key reasons for having multiple Ethernet adapters in a node is redundancy. In network-attached systems, there are four major software components that play important roles in getting the requests to and from the Teradata Database. The client application is written by the programmer using a client-supported language such as “C”. The purpose of the application is usually to submit SQL statements to the Teradata Database and perform processing on the result sets. The application developer can “embed” SQL statements in the application and use the Teradata Preprocessor to interpret the embedded SQL statements. In a networked environment, the application developer can use either the CLI interface or the ODBC driver to access Teradata. The Teradata CLI application interface provides maximum control over Teradata connectivity and access. The ODBC and JDBC drivers are a much more open standard and are widely used with client applications. The Teradata ODBC™ (Open Database Connectivity) or JDBC (Java) drivers use open standards-based ODBC or JDBC interfaces to provide client applications access to Teradata across LAN-based environments. Note: ODBC 3.02.0 is the minimum certified version for Teradata V2R5. The Micro Teradata Director Program (MTDP) is a Teradata-supplied program that must be linked to any application that will be network-attached to the Teradata Database. The MTDP performs many of the functions of the channel based TDP including session management. The MTDP does not control session balancing across PEs. Connect and Assign Servers that run on the Teradata system handle this activity. The Micro Operating System Interface (MOSI) is a library of routines providing operating system independence for clients accessing the Teradata Database. By using MOSI, we only need one version of the MTDP to run on all network-attached platforms. Teradata Gateway software executes on every node. Gateway software runs as a number of tasks. Two of the key tasks are called "ycgastsk" (assign task) and "ycgcntsk" (connect task). On a 4-node system with one gateway, only one node has the assign task (ycgastsk) running on it and every node will have the connect task (ycgcntsk) running on it. Initial session assignment is done by the assign task and will assign a user session to a PE and to the connect task in the same node as the PE. The connect task on a node will handle connections to the PEs on that node. Page 3-10 Teradata Database Architecture Network-Attached Client Software Overview LAN-Attached Servers Client Application (ex., FastLoad) Client Application (ex., SQL Assistant) CLI ODBC (CLI) MTDP MTDP MOSI MOSI LAN (TCP/IP) Ethernet Adapter Gateway Software (tgtw) Parsing Engine Parsing Engine Client Application (ex., BTEQ) CLI MTDP MOSI CLI (Call Level Interface) – Library of routines for blocking/unblocking requests and responses to/from the Teradata Database ODBC™ (Open Database Connectivity), JDBC™ (Java), or .NET Drivers – Use open standards-based ODBC, JDBC, or .NET interfaces to provide client applications access to Teradata. MTDP (Micro Teradata Director Program) – Library of session management routines MOSI (Micro Operating System Interface) – Library of routines providing OS independent interface Teradata Database Architecture Page 3-11 The Parsing Engine Parsing Engines (PEs) are made up of the following software components: session control, the Parser, the Optimizer, and the Dispatcher. Once a valid session has been established, the PE is the component that manages the dialogue between the client application and the Teradata Database. The major functions performed by session control are logon and logoff. Logon takes a textual request for session authorization, verifies it, and returns a yes or no answer. Logoff terminates any ongoing activity and deletes the session’s context. When connected to an EBCDIC host the PE converts incoming data to the internal 8-bit ASCII used by the Teradata Database, thus allowing input values to be properly evaluated against the database data. When a PE receives an SQL request from a client application, the Parser interprets the statement, checks it for proper SQL syntax and evaluates it semantically. The PE also must consult the Data Dictionary/Directory to ensure that all objects and columns exist and that the user has authority to access these objects. The Optimizer’s role is to develop the least expensive plan to return the requested response set. Processing alternatives are evaluated and the fastest alternative is chosen. This alternative is converted to executable steps, to be performed by the AMPs, which are then passed to the dispatcher. The Dispatcher controls the sequence in which the steps are executed and passes the steps on to the Message Passing Layer. It is composed of execution control and response control tasks. Execution control receives the step definitions from the Parser, transmits the step definitions to the appropriate AMP or AMPs for processing, receives status reports from the AMPs as they process the steps, and passes the results on to response control once the AMPs have completed processing. Response control returns the results to the user. The Dispatcher sees that all AMPs have finished a step before the next step is dispatched. Depending on the nature of the SQL request, the step will be sent to one AMP, a few AMPs, or all AMPs. Note: Teradata Gateway software can support up to 1200 sessions per processing node. Therefore a maximum of 10 Parsing Engines can be defined for a node using the Gateway. Page 3-12 Teradata Database Architecture The Parsing Engine Answer Set Response SQL Request The Parsing Engine is responsible for: Parser • Managing individual sessions (up to 120) Parsing Engine • Parsing and Optimizing your SQL Optimizer requests • Dispatching the optimized plan to the Dispatcher AMPs • Input conversion (EBCDIC / ASCII) - if necessary Message Passing Layer • Sending the answer set response back to the requesting client AMP AMP Teradata Database Architecture AMP AMP Page 3-13 Message Passing Layer The Message Passing Layer (MPL) or Communications Layer handles the internal communication of the Teradata Database. All communication between PEs and AMPs is done via the Message Passing Layer. When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the MPL. The messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the MPL. The Message Passing Layer is a combination of the Teradata PDE software, the BYNET software, and the BYNET interconnect itself. PDE and BYNET software - used for multi-node MPP systems and single-node SMP systems. With a single-node SMP, the BYNET device driver is used in conjunction with the PDE even though a physical BYNET network is not present. Depending on the nature of the dispatch request, the communication may be a: Broadcast - message is routed to all AMPs and PEs on the system Multi-Cast - message is routed to a group of AMPs Point-to-Point - message is routed to one specific AMP or PE on the system The technology of the MPL is a key piece in the system part that makes possible the parallelism of the Teradata Database. Page 3-14 Teradata Database Architecture Message Passing Layer SQL Request Answer Set Response The Message Passing Layer or Communications Layer is responsible for: • Carrying messages between the AMPs and PEs Parsing Engine • Point-to-Point, Multi-Cast, and Broadcast communications • Merging answer sets back to the PE • Making Teradata parallelism possible Message Passing Layer (PDE and BYNET) AMP AMP AMP The Message Passing Layer or Communications Layer is a combination of: AMP • Parallel Database Extensions (PDE) Software • BYNET Software • BYNET Hardware for MPP systems Teradata Database Architecture Page 3-15 The Access Module Processor (AMP) The Access Module Processor (AMP) is responsible for managing a portion of the database. An AMP will control some portion of each table on the system. AMPs do all of the physical work associated with generating an answer set including, sorting, aggregating, formatting and converting. An AMP responds to Parser/Optimizer steps transmitted across the MPL by selecting data from or storing data to its disks. For some requests the AMPs may also redistribute a copy of the data to other AMPs. The Database Manager subsystem resides on each AMP. It receives the steps from the Dispatcher and processes the steps. To do that it has the ability to lock databases and tables, to create, modify, or delete definitions of tables, to insert, delete, or modify rows within the tables, and to retrieve information from definitions and tables. It collects accounting statistics, recording accesses by session so those users can be billed appropriately. Finally, the Database manager returns responses to the Dispatcher. Earlier in this course we discussed the logical organization of data into tables. The Database Manager provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs a space management function that controls the use and allocation of space. AMPs also perform output data conversion, checking the session and changing the internal, 8-bit ASCII used by Teradata to the format of the requester. This is the reverse of the process performed by the PE when it converts the incoming data into internal ASCII. Page 3-16 Teradata Database Architecture The Access Module Processor (AMP) SQL Request Answer Set Response Parsing Engine Message Passing Layer AMP AMP AMP AMP AMPs store and retrieve rows to and from disk. Teradata Database Architecture The AMPs are responsible for: • Accesses storage using Teradata's File System Software • Lock management • Sorting rows • Aggregating columns • Join processing • Output conversion and formatting • Creating answer set for client • Disk space management • Accounting • Special utility protocols • Recovery processing Teradata File System Software: • Translates DatabaseID/TableID/RowID into location on storage • Controls a portion of physical storage • Allocates storage space by “Cylinders” Page 3-17 Teradata Parallelism Parallelism is at the very heart of the Teradata Database. There is virtually no part of the system where parallelism has not been built in. Without the parallelism of the system, managing enormous amounts of data would either not be possible or, best case, would be prohibitively expensive and inefficient. Each PE can support up to 120 user sessions in parallel. This could be 120 distinct users, or a single user harnessing the power of all 120 sessions for a single application. Each session may handle multiple requests concurrently. While only one request at a time may be active on behalf of a session, the session itself can manage the activities of 16 requests and their associated answer sets. The Message Passing Layer was designed such that it can never be a bottleneck for the system. Because the MPL is implemented differently for different platforms, this means that it will always be well within the needed bandwidth for each particular platform’s maximum throughput. Each AMP can perform up to 80 tasks in parallel. This means that AMPs are not dedicated at any moment in time to the servicing of only one request, but rather are multi-threading multiple requests concurrently. The value 80 represents the number of AMP Worker Tasks and may be changed on some systems. Because AMPs are designed to operate on only one portion of the database, they must operate in parallel to accomplish their intended results. In addition to this, the optimizer may direct the AMPs to perform certain steps in parallel if there are no contingencies between the steps. This means that an AMP might be concurrently performing more than one step on behalf of the same request. A recently added feature called Parallel CLI allows for parallelizing the client application, particularly useful for multi-session applications. This is accomplished by setting a few environmental variables and requires no changes to the application code. In truth, parallelism is built into the Teradata Database from the ground up! Page 3-18 Teradata Database Architecture Teradata Parallelism PE PE PE Session A Session C Session E Session B Session D Session F Message Passing Layer AMP 0 Task 1 Task 2 Task 3 AMP 1 Task 4 Task 5 Task 6 AMP 2 Task 7 Task 8 Task 9 AMP 3 Parallelism is built into Teradata from the ground up! Task 10 Task 11 Task 12 Notes: • Each PE can handle up to 120 sessions in parallel. • Each Session can handle multiple REQUESTS. • The Message Passing Layer can handle all message activity in parallel. • Each AMP can perform up to 80 tasks in parallel. • All AMPs can work together in parallel to service any request. • Each AMP can work on several requests in parallel. Teradata Database Architecture Page 3-19 Module 3: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 3-20 Teradata Database Architecture Module 3: Review Questions 1. What are the two software elements that accompany an application on all client side environments? 2. What is the purpose of the PE? 3. What is the purpose of the AMP? 4. How many sessions can a PE support? Match Quiz ____ 1. CLI a. Does Aggregating and Locking ____ 2. MTDP b. Validates SQL syntax ____ 3. MOSI c. Connects AMPs and PEs ____ 4. Parser d. Balances sessions across PEs ____ 5. AMP e. Provides Client side OS independence ____ 6. Message Passing Layer f. Library of Session Management Routines ____ 7. TDP g. PE S/W turns SQL into AMP steps ____ 8. Optimizer h. PE S/W sends plan steps to AMP ____ 9. Dispatcher i. Library of Teradata Service Routines ____10. Parallelism j. Foundation of Teradata architecture Teradata Database Architecture Page 3-21 Notes Page 3-22 Teradata Database Architecture Module 4 Teradata Databases and Users After completing this module, you will be able to: • Distinguish between a Teradata Database and Teradata User. • Define Perm Space and explain how it is used. • Define Spool Space and its use. • Visualize the hierarchy of objects in a Teradata system. Teradata Proprietary and Confidential Creating a Teradata Database Page 4-1 Notes Page 4-2 Creating a Teradata Database Table of Contents A Teradata Database .................................................................................................................... 4-4 Tables ................................................................................................................................... 4-4 Views ................................................................................................................................... 4-4 Macros .................................................................................................................................. 4-4 Triggers ................................................................................................................................ 4-4 A Teradata User ........................................................................................................................... 4-6 Database – User Comparison ....................................................................................................... 4-8 The Hierarchy of Databases and Users ...................................................................................... 4-10 Example of a System Hierarchy................................................................................................. 4-12 Permanent Space ........................................................................................................................ 4-14 Spool Space ................................................................................................................................ 4-16 Temporary Space ....................................................................................................................... 4-18 Creating Tables .......................................................................................................................... 4-20 Data Types ................................................................................................................................. 4-22 Access Rights and Privileges ..................................................................................................... 4-24 Module 4: Review Questions ..................................................................................................... 4-26 Creating a Teradata Database Page 4-3 A Teradata Database A Teradata database is a collection of tables, views, macros, triggers, stored procedures, join indexes, hash indexes, UDFs, access rights and space limits used for administration and security. All databases have a defined upper limit of permanent space. Permanent space is used for storing the data rows of tables. Perm space is not pre-allocated. It represents a maximum limit. All databases also have an upper limit of spool space. Spool space is temporary space used to hold intermediate query results or formatted answer sets to queries. Databases provide a logical grouping for information. They are also the foundation for space allocation and access control. We'll review the definitions of tables, views, and macros. Tables A table is the logical structure of data in a database. It is a two-dimensional structure made up of columns and rows. A user defines a table by giving it a table name that refers to the type of data that will be stored in the table. A column represents attributes of the table. Attributes identify, describe, or qualify the table. Column names are given to each column of the table. All the information in a column is the same type, for example, data of birth. Each occurrence of an entity is stored in the table as a row. Entities are the people, things, or events that the table is about. Thus a row would represent a particular person, thing, or event. Views A view is a pre-defined subset of one of more tables or other views. It does not exist as a real table, but serves as a reference to existing tables or views. One way to think of a view is as a virtual table. Views have definitions in the data dictionary, but do not contain any physical rows. Views can be used by the database administrator to control access to the underlying tables. Views can be used to hide columns from users, to insulate applications from database changes, and to simplify or standardize access techniques. Macros A macro is a definition containing one or more SQL commands and report formatting commands that is stored in the Data Dictionary/Directory. Macros are used to simplify the execution of frequently-used SQL commands. Triggers A trigger consists of one or more SQL statements that are associated with a table and are executed when the trigger is “fired”. Page 4-4 Creating a Teradata Database A Teradata Database A Teradata database is a defined logical repository for: • • • • • Tables Views • • Join Indexes Hash Indexes Macros Triggers • • Permanent Journals User-defined Functions (UDF) Stored Procedures Attributes that may be specified for a database: • Perm Space – max amount of space available for tables, stored procedures, and UDFs • Spool Space – max amount of work space available for requests • Temp Space – max amount of temporary table space A Teradata database is created with the CREATE DATABASE command. Example CREATE DATABASE Database_2 FROM Sysdba AS PERMANENT = 20E9, SPOOL = 500E6; Notes: "Database_2" is owned by "Sysdba". A database is empty until objects are created within it. Creating a Teradata Database Page 4-5 A Teradata User A user can also be thought of as a collection of tables, views, macros, triggers, stored procedures, join indexes, hash indexes, UDFs, and access rights. A user is almost the same as a database except that a user can actually log on to the DBS. To accomplish this, a user must have a password. A user may or may not have perm space. Even with no perm space, a user can access other databases depending on the privileges the user has been granted. Users are created with the SQL statement CREATE USER. Page 4-6 Creating a Teradata Database A Teradata User A Teradata user is a database with an assigned password. A Teradata user may logon to Teradata and access objects within: • itself • other databases for which it has access rights Examples of attributes that may be specified for a user: • Perm Space – max amount of space available for tables, stored procedures, and UDFs • Spool Space – max amount of work space available for requests • Temp Space – max amount of temporary table space A user is an active repository while a database is a passive repository. A user is created with the CREATE USER command. Example CREATE USER User_C FROM User_A AS PERMANENT = 100E6 ,SPOOL = 500E6 ,TEMPORARY = 150E6 ,PASSWORD = lucky_day ; "User_C" is owned by "User_A". A user is empty until objects are created within it. Creating a Teradata Database Page 4-7 Database – User Comparison In Teradata, a Database and a User are essentially the same. Database/User names must be unique within the entire system and represent the highest level of qualification in an SQL statement. A User represents a logon point within the hierarchy and Access Rights apply only to Users. In many systems, end users do not have Perm space given to them. They are granted rights to access database(s) containing views and macros, which in turn are granted rights to access the corporate production tables. At any time, another authorized User can change the Spool (workspace) limit assigned to a User. Databases may be empty. They may or may not have any tables, views, macros, triggers, or stored procedures. They may or may not have Perm Space allocated. The same is true for Users. The only absolute requirement is that a User must have a password. Once Perm Space is assigned, then and only then can tables be put into the database. Views, macros, and triggers may be added at any time, with or without Perm Space. Remember that databases and users are both repositories for database objects. The main difference is the user ability to logon and acquire a session with the Teradata Database. A row exists in DBC.Dbase for each User and Database. Page 4-8 Creating a Teradata Database Database – User Comparison User Database Unique Name Password = Value Define and use Perm space Define and use Spool space Define and use Temporary space Set Fallback protection default Set Permanent Journal defaults Multiple Account strings Logon and establish a session with a priority May have a startup string Default database, dateform, timezone, and default character set Collation Sequence Unique Name • • • • Define and use Perm space Define Spool space Define Temporary space Set Fallback protection default Set Permanent Journal defaults One Account string You can only LOGON as a known User to establish a session with Teradata. Tables, Join/Hash Indexes, Stored Procedures, and UDFs require Perm Space. Views, Macros, and Triggers are definitions in the DD/D and require NO Perm Space. A database (or user) with zero Perm Space may have views, macros, and triggers, but cannot have tables, join/hash indexes, stored procedures, or user-defined functions. Creating a Teradata Database Page 4-9 The Hierarchy of Databases and Users As you define users and databases, a hierarchical relationship among them will evolve. When you create new objects, you subtract permanent space from the assigned limit of an existing database or user. A database or user that subtracts space from its own permanent space to create a new object becomes the immediate owner of that new object. An “owner” or “parent” is any object above you in the hierarchy. (Note that you can use the terms owner and parent interchangeably.) A “child” is any object below you in the hierarchy. An owner or parent can have many children. The term “immediate parent” is sometimes used to describe a database or user just above you in the hierarchy. Page 4-10 Creating a Teradata Database Hierarchy of Databases and Users Maximum Perm Space – maximum available space for a user or database. User DBC Current Perm Space – space that is currently allocated – contains tables, stored procedures, UDFs. User SYSDBA No Box No Perm Space User_A Database_1 Database_2 User_D Database_3 User_B User_C • A new database or user must be created from an existing database or user. • All Perm space specifications are subtracted from the immediate owner or parent. • Perm space is a zero sum game – the total of all Perm Space for all databases and users equals the total amount of disk space available to Teradata. • Perm space is only used for tables, join/hash indexes, stored procedures, and UDFs. • Perm space currently unused is available to be used as Spool or Temp space. Creating a Teradata Database Page 4-11 Example of a System Hierarchy An example of a system structure for the Teradata database is shown on the facing page. Page 4-12 Creating a Teradata Database Example of a System Hierarchy DBC CrashDumps QCD SysAdmin SysDBA SystemFE Customer_Service A User and/or a Database may be given PERM space. In this example, Mark and Tom have no PERM space, but Susan does. Sys_Calendar CS_Users Mark Tom Susan CS_VM CS_Tables View_1 View_2 Table_1 Table_2 Table_3 Table_4 Macro_1 Macro_2 Users may use views and macros to access the actual tables. Creating a Teradata Database Page 4-13 Permanent Space Permanent Space (Perm space) is the maximum amount of storage assigned to a user or database for holding table rows, Fallback tables, secondary index subtables, stored procedures, UDFs, and permanent journals. Perm space is specified in the CREATE statement as illustrated below. Perm space is not pre-allocated which means that it is available on demand, as entities are created not reserved ahead of time. Perm space is deducted from the owner’s specified Perm space and is divided equally among the AMPs. Perm space can be dynamically modified. The total amount of Perm space assigned divided by the number of AMPs equals the perAMP limit. Whenever the per AMP limit is exceeded on any AMP, a Database Full message is generated. CREATE DATABASE CS_Tables FROM Customer_Service AS PERMANENT = 100000000000 BYTES … ; Page 4-14 Creating a Teradata Database Permanent Space CREATE DATABASE CS_Tables FROM Customer_Service AS PERMANENT = 100E9 BYTES, ... ; AMP Perm Space Limit per AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB 10 GB • Table rows, index subtable rows, join indexes, hash indexes, stored procedures, and • • • • • • • • UDFs use Perm space. Fallback protection uses twice the Perm space of No Fallback. Perm space is deducted from the owner’s database space. Disk space is not reserved ahead of time, but is available on demand. Perm space is defined globally for a database. Perm space can be dynamically modified. The global limit divided by the number of AMPs is the per/AMP limit. The per/AMP limit cannot be exceeded. Good data distribution is crucial to space management. Creating a Teradata Database Page 4-15 Spool Space Spool Space is work space acquired automatically by the system and used for work space and answer sets for intermediate and final results of Teradata SQL statements (e.g., SELECT statements generally use Spool space to store the SELECTed data). When the spool space is no longer needed by a query, it is released back to the system. A Spool limit is specified in the CREATE statement shown below. This limit cannot exceed the Spool limit of the owner. However, a single user can create multiple databases or users, and each can have a Spool limit as large as the Spool limit of that owner. The total amount of Spool space assigned divided by the number of AMPs equals the per AMP limit. Whenever the per-AMP limit is exceeded on any AMP, an Insufficient Spool message is generated to that client. CREATE USER Susan FROM CS_Users AS PERMANENT = 100000000 BYTES, SPOOL = 500000000 BYTES, PASSWORD = secret ... ; Page 4-16 Creating a Teradata Database Spool Space CREATE USER Susan FROM CS_Users AS PERMANENT = 100E6 BYTES, SPOOL = 500E6 BYTES, PASSWORD = secret … ; AMP Spool Space Limit per AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP 50 MB 50 MB 50 MB 50 MB 50 MB 50 MB 50 MB 50 MB 50 MB 50 MB • Spool space is work space acquired automatically by the system for intermediate query results or answer sets. – SELECT statements generally use Spool space. – Only INSERT, UPDATE, and DELETE statements affect table contents. • The Spool limit cannot exceed the Spool limit of the original owner. • The Spool limit is divided by the number of AMPS in the system, giving a perAMP limit that cannot be exceeded. – "Insufficient Spool" errors often result from poorly distributed data or joins on columns with large numbers of non-unique values. – Keeping Spool rows small and few in number reduces Spool I/O. Creating a Teradata Database Page 4-17 Temporary Space Temporary (Temp) Space is temporary space acquired automatically by the system when Global Temporary tables are materialized and used. A Temporary limit is specified in the CREATE statement shown below. This limit cannot exceed the Temporary limit of the owner. However, a single user can create multiple databases or users, and each can have a Temporary limit as large as the Temporary limit of that owner. The total amount of Temporary space assigned divided by the number of AMPs equals the per AMP limit. Whenever the per-AMP limit is exceeded on any AMP, an Insufficient Temporary message is generated to that client. CREATE USER Susan FROM CS_Users AS PERMANENT = 100000000 BYTES, SPOOL = 500000000 BYTES, TEMPORARY = 150000000 BYTES, PASSWORD = secret ... Page 4-18 Creating a Teradata Database Temporary Space CREATE USER Susan FROM CS_Users AS PERMANENT = 100E6 BYTES, SPOOL = 500E6 BYTES, TEMPORARY = 150E6 BYTES, PASSWORD = secret … ; AMP Temporary Space Limit per AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP 15 MB 15 MB 15 MB 15 MB 15 MB 15 MB 15 MB 15 MB 15 MB 15 MB • Temporary space is space acquired automatically by the system when a "Global Temporary" table is used and materialized. • The Temporary limit cannot exceed the Temporary limit of the original owner. • The Temporary limit is divided by the number of AMPS in the system, giving a per-AMP limit that cannot be exceeded. – "Insufficient Temporary" errors often result from poorly distributed data or joins on columns with large numbers of non-unique values. • Note: Volatile Temporary tables and derived tables utilize Spool space. Creating a Teradata Database Page 4-19 Creating Tables Creation of tables is done via the DDL portion of the SQL command vocabulary. The table definition, once accepted, is stored in the DD/D. Prior to Teradata 13.0, creating tables required the definition of at least one column and the assignment of a Primary Index. With Teradata 13.0, it is possible to create tables without a primary index. Columns are assigned data types, attributes and optionally may be assigned constraints, such as a range constraint. Tables, like views and macros, may be dropped when they are no longer needed. Dropping a table both deletes the data from the table and removes the definition of the table from the DD/D. Secondary indexes may also optionally be assigned at table creation, or may be deferred until after the table has been built. Secondary indexes may also be dropped, if they are no longer needed. It is not uncommon to create secondary indexes to assist in the processing of a specific job sequence, then to delete the index, and its associated overhead, once the job is complete. We will have more to say on indexes in general in future modules. Page 4-20 Creating a Teradata Database Creating Tables Creating a table requires ... – defining columns – a primary index (Teradata 13.0 provides an option of a No Primary Index table) – optional assignment of secondary indexes CREATE TABLE Employee (Employee_Number ,Last_Name ,First_Name ,Salary_Amount ,Department_Number ,Job_Code Primary Secondary INTEGER NOT NULL CHAR(20) NOT NULL VARCHAR(20) DECIMAL(10,2) SMALLINT CHAR(3)) UNIQUE PRIMARY INDEX (Employee_Number) INDEX (Last_Name) ; Database objects may be created or dropped as needed. CREATE DROP Tables Views Macros Triggers Procedures Secondary indexes may be – created at table creation – created after table creation – dropped after table creation Creating a Teradata Database CREATE DROP INDEX (secondary only) Page 4-21 Data Types When a table is created, a data type is specified for each column. Data types are divided into three classes – numeric, byte, and character. The facing page shows data types. DATE is a 32-bit integer that represents the date as YYYYMMDD. It supports century and year 2000 and is implemented with calendar-based intelligence. TIME WITH ZONE and TIMESTAMP WITH ZONE are ANSI standard data types that allow support of clock and time zone based intelligence. DECIMAL (n, m) is a number of n digits, with m digits to the right of the decimal point. BYTEINT is an 8-bit signed binary whole number that may vary in range from -128 to +127. SMALLINT is a 16-bit signed binary whole number that may vary in range from -32,768 to +32,767. INTEGER is a 32-bit signed binary whole number that may vary in size from -2,147,483,648 to +2,147,483,647. BIGINT is a 64-bit (8 bytes) signed binary whole number that may vary in size from -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 or as (-263 to 263 - 1). FLOAT, REAL, and DOUBLE PRECISION is a 64-bit IEEE floating point number. BYTE (n) is a fixed-length binary string of n bytes. BYTE and VARBYTE are never converted to a different internal format. They can also be used for digitized objects. VARBYTE (n) is a variable-length binary string of n bytes. BINARY LARGE OBJECT (n) is similar to a VARBYTE; however it may be as large as 2 GB. A BLOB may be used to store graphics, video clips and binary files. CHAR (n) is a fixed-length character string of n characters. VARCHAR (n) is a variable-length character string of n characters. LONG VARCHAR is the longest variable-length character string. It is equivalent to VARCHAR (64000). GRAPHIC, VARGRAPHIC and LONG VARGRAPHIC are the equivalent character types for multi-byte character sets such as Kanji. CHARACTER LARGE OBJECT (n) is similar to a VARCHAR; however it may be as large as 2 GB. A CLOB may be used to store simple text, HTML, or XML documents. Page 4-22 Creating a Teradata Database Data Types TYPE Name Bytes Description Date/Time DATE TIME (WITH ZONE) TIMESTAMP (WITH ZONE) 4 6/8 10 / 12 YYYYMMDD HHMMSSZZ YYYYMMDDHHMMSSZZ Numeric DECIMAL or NUMERIC (n, m) 2, 4, 8 or 16 BYTEINT 1 SMALLINT 2 INTEGER 4 BIGINT 8 FLOAT, REAL, DOUBLE PRECISION 8 Byte BYTE(n) VARBYTE (n) BLOB 0 – 64,000 0 – 64,000 0 – 2 GB Binary Large Object (V2R5.1) Character CHAR (n) VARCHAR (n) LONG VARCHAR GRAPHIC VARGRAPHIC LONG VARGRAPHIC CLOB 0 – 64,000 0 – 64,000 Creating a Teradata Database + OR – (up to 18 digits V2R6.1 and prior) (up to 38 digits is V2R6.2 feature) -128 to +127 -32,768 to +32,767 -2,147,483,648 to +2,147,483,647 -263 to +263 - 1 (+9,223,372,036,854,775,807) IEEE floating pt same as VARCHAR(64,000) 0 – 32,000 0 – 32,000 0 – 2 GB same as VARGRAPHIC(32,000) Character Large Object (V2R5.1) Page 4-23 Access Rights and Privileges The diagram on the facing page shows access rights and privileges as they might be defined for the database administrator, a programmer, a user, a system operator, and an administrative user. The database administrator has right to use all of the commands in the data definition privileges, the data manipulation privileges, and the data control privileges. The programmer has all of those except the ability to GRANT privileges to others. A typical user is limited to data manipulation privileges, while the operator is limited to data control privileges. Finally, the administrative user is limited to a subset of data manipulation privileges, SELECT and EXECUTE. Each site should carefully consider the access rules that best meet their needs. Page 4-24 Creating a Teradata Database Access Rights and Privileges Data Definition Privileges Command CREATE DROP A Sample Scenario Object Database and/or User Table and/or View Macro and/or Trigger Stored Procedure Role and/or Profile Data Manipulation Privileges SELECT INSERT UPDATE DELETE Table View EXECUTE Macro and/or Stored Procedure Data Control Privileges DUMP RESTORE CHECKPOINT Database Table Journal GRANT REVOKE Privileges on Databases Users Objects Creating a Teradata Database D B A P R O G R A M M E R S U S E R ADMIN O P E R Page 4-25 Module 4: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 4-26 Creating a Teradata Database Module 4: Review Questions True or False ______ 1. ______ 2. ______ 3. ______ 4. ______ 5. ______ 6. ______ 7. A database will always have tables. A user will always have a password. A user creating a subordinate user must give up some of his/her Perm Space. Creating tables requires the definition of at least 1 column and a Primary Index. The sum of all user and database Perm Space will equal the total space on the system. The sum of all user and database Spool Space will equal the total space on the system. Before a user can read a table, a database or table SELECT privilege must exist in the DD/D for that user. ______ 8. Deleting a macro from a database reclaims Perm Space for the database. 9. Which statement is TRUE about PERM space? ____ a. b. c. d. PERM space cannot be dynamically modified. The per/AMP limit of PERM space can be exceeded. Tables, index subtables, and stored procedures use PERM space. Maximum PERM space can be defined at the database or table level. 10. Which statement is TRUE about SPOOL space? ____ a. b. c. d. SPOOL space cannot be dynamically modified. Maximum SPOOL space can be defined at the database or user level. The SPOOL limit is dependent on the database limit where the table is located. Maximum SPOOL space can be defined at a value greater than the immediate parent's value. Creating a Teradata Database Page 4-27 Notes Page 4-28 Creating a Teradata Database Module 5 PI Access and Mechanics After completing this module, you will be able to: Explain the purpose of the Primary Index • Distinguish between Primary Index and Primary Key • Explain the role of the hashing algorithm and the hash map in locating a row. • Explain the makeup of the Row ID and its role in row storage. • Describe the sequence of events for locating a row given its PI value. Teradata Proprietary and Confidential Storing and Accessing Data Rows Page 5-1 Notes Page 5-2 Storing and Accessing Data Rows Table of Contents Primary Keys and Primary Indexes ............................................................................................. 5-4 Distribution of Rows .................................................................................................................... 5-6 Specifying a Primary Index.......................................................................................................... 5-8 Primary Index Values................................................................................................................. 5-10 Accessing Via a Unique Primary Index ..................................................................................... 5-12 Accessing Via a Non-Unique Primary Index ............................................................................. 5-14 Row Distribution Using a Unique Primary Index (UPI) – Case 1 ............................................. 5-16 Row Distribution Using a Non-Unique Primary Index (NUPI) – Case 2 .................................. 5-18 Row Distribution Using a Highly Non-Unique Primary Index (NUPI) – Case 3...................... 5-20 Which AMP has the Row? ......................................................................................................... 5-22 Hashing Down to the AMPs ...................................................................................................... 5-24 A Hashing Example ................................................................................................................... 5-26 The Hash Map ............................................................................................................................ 5-28 Hash Maps for Different Systems .............................................................................................. 5-30 Identifying Rows ........................................................................................................................ 5-32 The Row ID ................................................................................................................................ 5-34 Storing Rows (1 of 2) ................................................................................................................. 5-36 Storing Rows (2 of 2) ............................................................................................................. 5-38 Locating a Row on an AMP Using a PI ..................................................................................... 5-40 Module 5: Review Questions ..................................................................................................... 5-42 Storing and Accessing Data Rows Page 5-3 Primary Keys and Primary Indexes While it is true that many tables use the same columns for both Primary Indexes and Primary Keys, Indexes are conceptually different from Keys. The table on the facing page summarizes those differences. A Primary Key is relational data modeling term that defines, in the logical model, the columns that uniquely identify a row. A Primary Index is a physical database implementation term that defines the actual columns used to distribute and access rows in a table. It is also true that a significant percentage of the tables in any database will use the same column(s) for both the PI and the PK. However, one should expect that in any real-world scenario there would be some tables that will not conform to this simplistic rule. Only through a careful analysis of the type of processing that will take place can the tables be properly evaluated for PI candidates. Remember, changing your mind about the columns that comprise the PI means recreating (and reloading) the table. Page 5-4 Storing and Accessing Data Rows Primary Keys and Primary Indexes • • • • • Indexes are conceptually different from keys. A PK is a relational modeling convention which allows each row to be uniquely identified. A PI is a Teradata convention which determines how the row will be stored and accessed. A significant percentage of tables may use the same columns for both the PK and the PI. A well-designed database will use a PI that is different from the PK for some tables. Primary Key Primary Index Logical concept of data modeling Physical mechanism for access and storage Teradata doesn’t need to recognize Each table can have (at most) one primary index No limit on number of columns 64 column limit Documented in data model Defined in CREATE TABLE statement (Optional in CREATE TABLE) Must be unique May be unique or non-unique Identifies each row Identifies 1 (UPI) or multiple rows (NUPI) Values should not change Values may be changed (Delete + Insert) May not be NULL – requires a value May be NULL Does not imply an access path Defines most efficient access path Chosen for logical correctness Chosen for physical performance Storing and Accessing Data Rows Page 5-5 Distribution of Rows Ideally, the rows of every table will be distributed among all of the AMPs. There may be some circumstances where this is not true. What if there are fewer rows than AMPs? Clearly in this case, at least some AMPs will hold no rows from that table. This should be considered the exceptional situation, and not the rule. Each AMP is designed to hold a portion of the rows of each table. The AMP is responsible for the storage, maintenance and retrieval of the data under its control. More ideally, the rows of each table will be evenly distributed across all of the AMPs. This is desirable because in operations involving all rows of the table (such as a full table scan); each AMP will have an equal portion of the work to do. When workloads are not evenly distributed, the desired response will only be as fast as the slowest AMP. Controlling the distribution of the rows of a table is done by the selection of the Primary Index. The relative uniqueness of the Primary Index will determine the uniformity of distribution of the rows of this table among the AMPs. Page 5-6 Storing and Accessing Data Rows Distribution of Rows AMP AMP AMP AMP Table A rows Table B rows • The rows of every table are distributed among all AMPs • Each AMP is responsible for a subset of the rows of each table. – Ideally, each table will be evenly distributed among all AMPs. – Evenly distributed tables result in evenly distributed workloads. • For tables with a Primary Index (majority of the tables), the uniformity of distribution of the rows of a table depends on the choice of the Primary Index. The actual distribution is determined by the hash value of the Primary Index. • For tables without a Primary Index (Teradata 13.0 feature), the rows of a table are still distributed between the AMPs based on random generator code within the PE or AMP. – A small number of tables will typically be created as NoPI tables. Common uses for NoPI tables are as staging/intermediate tables used in load operations or as column partitioned tables. Storing and Accessing Data Rows Page 5-7 Specifying a Primary Index Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes. The choice will affect the distribution of the rows of the table and, consequently, the performance of the table in a production environment. Although many tables used combined columns as the Primary Index choice, the examples used here are single column indexes, mostly for the sake of simplicity. Unique Primary Indexes (UPI’s) are desirable because they guarantee the uniform distribution of the rows of that table. Because it is not always feasible to pick a Unique Primary Index, it is sometimes necessary to pick a column (or columns) which have non-unique values; that is there are duplicate values. This type of index is called a Non-Unique Primary Index or NUPI. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values which are repeated frequently typically do not make good PI candidates. The choosing of a Primary Index is not an exact science. It requires analysis and thoughtfulness for some tables and will be completely self-evident on other tables. The Primary Index is always designated as part of the CREATE TABLE statement. Once a Primary Index choice has been designated for a table, it cannot be changed to something else. If an alternate choice of column(s) is desired for the PI, it is necessary to drop and recreate the table. Teradata, adhering to the ANSI standard, permits duplicate rows by specifying that you wish to create a MULTISET table. In Teradata transaction mode, the default, however, is a SET table that does not permit duplicate rows. Also, if MULTISET is enabled, it will be overridden by choosing a UPI as the Primary Index or by having a unique index (e.g., unique secondary) on another column(s) on the table. Doing this effectively disables the MULTISET. Multiset tables will be covered in more detail later in the course. Starting with Teradata 13.0, the option of NO PRIMARY INDEX is also available. Page 5-8 Storing and Accessing Data Rows Specifying a Primary Index • A Primary Index is defined at table creation. • It may consist of a single column, or a combination of columns (up to 64 columns) – With Teradata 13.0, an option of NO PRIMARY INDEX is available. UPI NUPI NoPI CREATE TABLE sample_1 (col_a INTEGER ,col_b CHAR(10) ,col_c DATE) UNIQUE PRIMARY INDEX (col_b); If the index choice of column(s) is unique, then this is referred to as a UPI (Unique Primary Index). CREATE TABLE sample_2 (col_m INTEGER ,col_n CHAR(10) ,col_o DATE) PRIMARY INDEX (col_m); If the index choice of column(s) isn’t unique, then this is referred to as a NUPI (Non-Unique Primary Index). CREATE TABLE sample_3 (col_x INTEGER ,col_y CHAR(10) ,col_z DATE) NO PRIMARY INDEX; A NoPI choice will result in distribution of the data between AMPs based on random generator code. A UPI choice will result in even distribution of the rows of the table across all AMPs. The distribution of the rows of the table is proportional to the degree of uniqueness of the index. Note: Changing the choice of Primary Index requires dropping and recreating the table. Storing and Accessing Data Rows Page 5-9 Primary Index Values Indexes are used to access rows from a table without having to search the entire table. On Teradata, the Primary Index is the mechanism for assigning a data row to an AMP and a location on the AMP’s disks. Prior to Teradata 13.0, when a table is created, a table must have a Primary Index specified (either user-assigned or Teradata assigned). This cannot be changed without dropping and creating the table. Primary Indexes are very important because they have a powerful effect on the performance of the database. The most important thing to remember is that a Primary Index is the mechanism used to assign each row to an AMP and may be used to retrieve that row from the AMP. Thus retrievals, updates and deletes that specify the Primary Index value will be much faster than those queries that do not specify the PI value. Primary Index selection is probably the most important factor in the efficiency of join processing. Earlier we learned that the Primary Key was always unique and unchanging. This is based on the logical model of the data. The Primary Index may (and frequently is) be different than the Primary Key and may be non-unique; it is chosen for the physical performance of the database. There are three types of primary index selection – unique (UPI), non-unique (NUPI), or NO PRIMARY INDEX. Page 5-10 Storing and Accessing Data Rows Primary Index Values • The value of the Primary Index for a specific row determines the AMP assignment for that row. • This is done using a hashing algorithm. PE Row assignment Row access AMP PI Value Other table access techniques: • Secondary index access • Full table scans Hashing Algorithm AMP AMP • Accessing the row by its Primary Index value is: – always a one-AMP operation – the most efficient way to access a row Storing and Accessing Data Rows Page 5-11 Accessing Via a Unique Primary Index A Primary Index operation is always a one-AMP operation. In the case of a UPI, the oneAMP access can return, at most, one row. In the facing example, we are looking for the row whose primary index value is 345. By specifying the PI value as part of our selection criteria, we are guaranteed that only the AMP containing the specified row will need to be searched. The correct AMP is located by taking the PI value and passing it through a hashing algorithm. The hashing takes place in the Parsing Engine. The output of the hashing algorithm contains information that will point the request to a specific AMP. Once it has isolated the appropriate AMP, finding the row is quick and efficient. How this happens we will see in a future module. Page 5-12 Storing and Accessing Data Rows Accessing Via a Unique Primary Index A UPI access is a one-AMP operation which may access at most a single row. CREATE TABLE sample_1 (col_a INTEGER ,col_b INTEGER ,col_c CHAR(4)) UNIQUE PRIMARY INDEX (col_b); PE SELECT col_a ,col_b ,col_c FROM sample_1 WHERE col_b = 345; Hashing Algorithm AMP col_a col_b Storing and Accessing Data Rows UPI = 345 AMP col_c col_a col_b AMP col_c col_a col_b 123 345 567 234 456 678 col_c Page 5-13 Accessing Via a Non-Unique Primary Index A Non-Unique Primary Index operation is also a one-AMP operation. In the case of a NUPI, the one-AMP access can return zero to many rows. In the facing example, we are looking for the rows whose primary index value is 25. By specifying the PI value as part of our selection criteria, we are once again guaranteeing that only the AMP containing the required rows will need to be searched. As before, the correct AMP is located by taking the PI value and passing it through a hashing algorithm executing in the Parsing Engine. The output of the hashing algorithm will once again point to a specific AMP. Once it has isolated the appropriate AMP, it must now find all rows that have the specified value. In this example, the AMP returns two rows. Page 5-14 Storing and Accessing Data Rows Accessing Via a Non-Unique Primary Index A NUPI access is a one-AMP operation which may access multiple rows. CREATE TABLE sample_2 (col_x INTEGER ,col_y INTEGER ,col_z CHAR(4)) PRIMARY INDEX (col_x); PE NUPI = 25 SELECT col_x ,col_y ,col_z FROM sample_2 WHERE col_x = 25; Hashing Algorithm AMP Both UPI and NUPI accesses are one AMP operations. Storing and Accessing Data Rows col_x col_y AMP col_z col_x col_y AMP col_z col_x col_y col_z 10 10 30 30 A B 20 25 50 55 A A 5 30 70 80 B B 35 40 B 25 60 B 30 80 A Page 5-15 Row Distribution Using a Unique Primary Index (UPI) – Case 1 At the heart of the Teradata database is a way of predictably distributing and retrieving rows across AMPs. The same value stored in the same data type will always produce the same hash value. If the Primary Index is unique, Teradata can distribute the rows evenly. If the Primary Index is slightly non-unique, that is, there are only four or five rows per index value; the table will still distribute evenly. But if there are hundreds or thousands of rows for some index values the distribution will probably be lumpy. In this example, the Order_Number is used as a unique primary index. Since the primary index value for Order_Number is unique, the distribution of rows among AMPs is very uniform. This assures maximum efficiency because each AMP is doing approximately the same amount of work. No AMPs sit idle waiting for another AMP to finish a task. This way of storing the data provides for maximum efficiency and makes the best use of the parallel features of the Teradata system. Page 5-16 Storing and Accessing Data Rows Row Distribution Using a UPI – Case 1 Orders Notes: O rd e r N um ber C u s to m e r Num ber O rd e r D a te O rd e r S ta tu s • Often, but not always, the PK column(s) will be used as a UPI. – Order_Number can be a UPI since all the PK UPI 7325 7324 7415 7103 7225 7384 7402 7188 7202 2 3 1 1 2 1 3 1 2 4 /1 3 4 /1 3 4 /1 3 4 /1 0 4 /1 5 4 /1 2 4 /1 6 4 /1 3 4 /0 9 AMP o_# c_# 7202 7415 2 1 values are unique. O O C O C C C C C • Teradata will distribute different UPI values evenly across all AMPs. – Resulting row distribution among AMPs is very uniform. – Assures maximum efficiency for parallel operations. AMP AMP o_dt o_st o_# c_# o_dt o_st o_# c_# 4/09 4/13 7325 7103 2 1 4/13 4/10 O O 7188 7225 1 2 7402 3 4/16 C C C Storing and Accessing Data Rows AMP o_dt o_st o_# c_# 4/13 4/15 7324 7384 3 1 C C o_dt o_st 4/13 4/12 O C Page 5-17 Row Distribution Using a Non-Unique Primary Index (NUPI) – Case 2 In the example on the facing page Customer_Number has been used as a non-unique Primary Index (NUPI). Note row distribution among AMPs is uneven. All rows with the same primary index value (in other words, with the same customer number) are stored on the same AMP. Customer_Number has three possible values, so all the rows are hashed to three AMPs, leaving the fourth AMP without rows from this table. While this distribution will work, it is not as efficient as spreading all the rows among all the AMPs. AMP 2 has a disproportionate number of rows and AMP 3 has none. In an all-AMP operation AMP 2 will take longer than the other AMPs. The operation cannot complete until AMP 2 completes its tasks. The overall operation time is increased and some of the AMPs are under-utilized. NUPI’s can create irregular distributions, called “skewed distributions”. AMPs that have more than an average number or rows will take longer for full table operations than the other AMPs will. Because an operation is not complete until all AMPs have finished, this will cause the operation to finish less quickly due to being underutilized. Page 5-18 Storing and Accessing Data Rows Row Distribution Using a NUPI – Case 2 Orders Notes: O rd e r N um ber C u s to m e r Num ber O rd e r D a te O rd e r S ta tu s • Customer_Number may be the preferred access column for this table, thus a good index candidate. – Since a customer can have multiple PK NUPI 7325 7324 7415 7103 7225 7384 7402 7188 7202 2 3 1 1 2 1 3 1 2 4 /1 3 4 /1 3 4 /1 3 4 /1 0 4 /1 5 4 /1 2 4 /1 6 4 /1 3 4 /0 9 O O C O C C C C C AMP o_# c_# 7325 7202 7225 2 2 2 orders, Customer_Number will be a NUPI. • Rows with the same PI value distribute to the same AMP. – Row distribution is less uniform or skewed. AMP AMP AMP o_dt o_st o_# c_# o_dt o_st o_# c_# 4/13 4/09 4/15 7384 7103 7415 1 1 1 4/12 4/10 4/13 C O C 7402 7324 3 3 7188 1 4/13 C O C C Storing and Accessing Data Rows o_dt o_st 4/16 4/13 C O Page 5-19 Row Distribution Using a Highly Non-Unique Primary Index (NUPI) – Case 3 This example uses Order_Status as a NUPI. Order_Status is a poor choice, because it yields the most uneven distribution. Because there are only two possible values for Order_Status, all of the rows are placed on two AMPs. STATUS is an example of a highly non-unique Primary Index. When choosing a Primary Index, you should never choose a column with such a severely limited value set. The degree of uniqueness is critical to efficiency. Choose NUPI’s that allow all AMPs to participate fairly equally. The degree of uniqueness of a NUPI is critical to efficiency. Page 5-20 Storing and Accessing Data Rows Row Distribution Using a Highly Non-Unique Primary Index (NUPI) – Case 3 Orders Notes: O rd e r N um ber C u s to m e r Num ber O rd e r D a te O rd e r S ta tu s PK NUPI 7325 7324 7415 7103 7225 7384 7402 7188 7202 2 3 1 1 2 1 3 1 2 4 /1 3 4 /1 3 4 /1 3 4 /1 0 4 /1 5 4 /1 2 4 /1 6 4 /1 3 4 /0 9 AMP O O C O C C C C C • Values for Order_Status are “highly” nonunique. – Order_Status would be a NUPI. – If only two values exist, then only two AMPs will be used for this table. • Highly non-unique columns are generally poor PI choices. – The degree of uniqueness is critical to efficiency. AMP AMP o_# c_# o_dt o_st o_# c_# 7402 7202 7225 3 2 2 4/16 4/09 4/15 C C C 7103 7324 7325 1 3 2 7415 7188 7384 1 1 1 4/13 4/13 4/12 C C C Storing and Accessing Data Rows AMP o_dt o_st 4/10 4/13 4/13 O O O Page 5-21 Which AMP has the Row? This discussion (rest of this module) will assume that a table has a primary index assigned and is not using the NO PRIMARY INDEX option. A hashing algorithm is a standard data processing technique that takes in a data value, like last name or order number, and systematically mixes it up so that the incoming values are converted to a number in a range from zero to the specified maximum value. A successful hashing scheme scatters the input evenly over the range of possible output values. It is predictable in that Smith will always hash to the same value and Jones will always hash to another (and they do) different value. With a good hashing algorithm any patterns in the input data should disappear in the output data. If many names begin with “S”, they should and will not all hash to the same group of hash values. If order numbers all have “00” in the hundreds and tens place or if all names are four letters long we should still see the hash values spread fairly evenly over the whole range. Textbooks still say that this requires manually designing and tuning a hash algorithm for each new type of data values. However, the Teradata algorithm works predictably well over any data, typically loading each AMP with variations in the range of .1% to .5% between AMPs. For extremely large systems, the variation can be as low as .001% between AMPs. Teradata also uses hashing quite differently than other data storage systems. Other hashed data storage systems equate a bucket with a physical location on disk. In Teradata, a bucket is simply an entry in a hash map. Each hash map entry points to a single AMP. Therefore, changing the number of AMPs does not require any adjustment to the hashing algorithm. Teradata simply adjusts the hash maps and redistributes any affected rows. The hash maps must always be available to the Message Passing Layer. For systems using a 16-bit hash bucket number, the hash map has 65,536 entries. For systems using a 20-bit hash bucket number, the hash map has 1,048,576 entries (approximately 1 million entries). 20-bit hash bucket numbers are available starting with Teradata 12.0. When the hash bucket has determined the destination AMP, the full 32-bit row hash plus the Table-ID is used to assign the row to a cylinder and a data block on the AMPs disk storage. The 32-bit row hash can produce over 4 billion row hash values. Page 5-22 Storing and Accessing Data Rows Which AMP has the Row? PARSER SQL with primary index values and data. PI value = 197190 Hashing Algorithm For example: Assume PI value is 197190 Table ID Hashing Algorithm Row Hash HBN PI values and data 000A1F4A HBN Message Passing Layer (Hash Maps) AMP 0 AMP 1 ... ... AMP x Hash Maps AMP n - 1 AMP n AMP # Data Table Summary Row ID Row Hash Row Data Uniq Value x '00000000' RH Data x'000A1F4A' 0000 0001 x 'FFFFFFFF' Storing and Accessing Data Rows 38 The MPL accesses the Hash Map using Hash Bucket Number (HBN) of 000A1. Bucket # 000A1 contains the AMP number that has this hash value – effectively the AMP with this row. HBN – Hash Bucket Number Page 5-23 Hashing Down to the AMPs The rows of all tables are distributed across the AMPs according to their Primary Index value. The Primary Index value goes into the hashing algorithm and the output is a 32-bit Row Hash. The high order bits (16 or 20) are referred to as the “bucket number” and are used to identify a hash map entry. This entry, in turn, is used to identify the AMP that will be targeted. The remaining 12 or 16 bits are not used to locate the AMP. The entire 32-bit Row Hash is used by the selected AMP to locate the row within its disk space. Hash maps are uniquely configured for each size of system, thus a 96 AMP system will have a hash map different from a 64 AMP system, but another 64 AMP system will have the same map (if the have the same number of bits in their HBN). Each hash map is simply an array that associates Hash Bucket Number (HBN) values or bucket numbers with specific AMPs. The Hash Bucket Number (prior to Teradata 12.0) has also been referred to as the DSW or Destination Selection Word. When a system grows, new AMPs are typically added. This requires a change to the hash map to reflect the new total number of possible target AMPs. Page 5-24 Storing and Accessing Data Rows Hashing Down to the AMPs Index value(s) Hashing Algorithm Row Hash Hash Bucket Number Hash Map AMP # { { { { Storing and Accessing Data Rows The hashing algorithm is designed to insure even distribution of unique values across all AMPs. Different hashing algorithms are used for different international character sets. A Row Hash is the 32-bit result of applying a hashing algorithm to an index value. The Hash Bucket Number is represented by the high order bits (usually 20 on newer systems) of the Row Hash. A Hash Map is uniquely configured for each system. It is a array of entries (buckets) which associates bucket numbers with specific AMPs. Two systems with the same number of AMPs will have the same Hash Map (if both have the same number of bits in their HBN). Changing the number of AMPs in a system requires a change to the Hash Map. Page 5-25 A Hashing Example The facing page shows an example of how the hashing algorithm would produce a 32-bit row hash value on the primary index value of 197190. The hash value is divided into two parts. The first 20 bits in this example are the Hash Bucket Number. These bits are also simply referred to as the Hash Bucket. The hash bucket points to a particular hash map entry, which in turn points to one AMP. The entire Row Hash along with the Table ID references a particular logical location on that AMP. Page 5-26 Storing and Accessing Data Rows A Hashing Example Orders Order Number Customer Number PK UPI 197185 197186 197187 197188 197189 197190 197191 197192 197193 197194 2005 3018 1035 1035 1001 2087 1012 3600 5650 1009 Order Date SELECT * FROM Orders WHERE order_number = 197190; Order Status 197190 2012-04-10 2012-04-10 2012-04-11 2012-04-11 2012-04-11 2012-04-11 2012-04-12 2012-04-12 2012-04-13 2012-04-13 C O O C O C C C O O Hashing Algorithm 000A1 F4A 32 bit Row Hash Hash Bucket Number * Remaining 12 bits 0000 0000 0000 1010 0001 1111 0100 1010 0 0 0 A 1 * Assumes 20-bit hash bucket numbers. Storing and Accessing Data Rows Page 5-27 The Hash Map A hash map is simply an array of entries where each entry is two bytes long. The hash map is loaded into memory and is used by Teradata software. Each entry contains an AMP number for the system on which Teradata is implemented. The hash bucket number (or bucket number) is an offset into the hash map to locate a specific entry (or AMP). For systems using a 16-bit hash bucket number, the hash map has 65,536 entries. For systems using a 20-bit hash bucket number, the hash map has 1,048,576 entries (approximately 1 million entries). To determine the destination AMP for a Primary Index operation, the hash map is checked by BYNET software using the row hash information. A message is placed on the BYNET to be sent to the target AMP using point-to-point communication. In the example, the HBN entry 000A1 (hexadecimal) contains an entry that identified AMP 13. AMP 13 will be the recipient of the message from the Message Passing Layer. The facing page identifies a portion of an actual primary hash map for a 26 AMP system. An example of hash functions that can be used in SQL follows: SELECT HASHROW (197190) AS "Hash Value" ,HASHBUCKET (HASHROW (197190)) AS "Bucket Num" ,HASHAMP (HASHBUCKET (HASHROW (197190))) AS "AMP Num" ,HASHBAKAMP (HASHBUCKET (HASHROW (197190))) AS "AMP Fallback Num" ; *** Query completed. One row found. 4 columns returned. *** Total elapsed time was 1 second. Hash Value 000A1F4A Page 5-28 Bucket Num 161 AMP Num 13 AMP Fallback Num 0 Storing and Accessing Data Rows The Hash Map 197190 000A1F4A Hashing Algorithm 32 bit Row Hash Hash Bucket Number * Remaining 12 bits 0000 0000 0000 1010 0001 1111 0100 1010 0 0 0 A * With 20-bit hash bucket numbers, the hash map has 1,048,576 entries. 1 With 16-bit hash bucket numbers, the hash map only has 65,536 entries. HASH MAP 0007 0008 0009 000A 000B 000C 0 1 2 3 4 5 6 7 8 9 A B C D E F 24 21 21 08 25 16 25 22 21 13 06 12 19 20 22 23 15 09 12 20 14 14 08 25 19 22 21 24 24 12 19 23 21 09 24 16 25 25 22 11 12 14 23 21 23 11 24 04 20 11 22 23 12 09 20 23 22 23 09 09 23 10 24 15 10 13 24 10 25 23 05 25 20 12 11 07 24 13 20 22 12 25 24 14 21 13 25 23 25 25 13 24 11 13 24 03 AMP 13 197190 2087 2012-04-11 C Portion of actual hash map (20-bit hash bucket numbers) for a 26 AMP system. AMPs are shown in decimal format. Storing and Accessing Data Rows Page 5-29 Hash Maps for Different Systems The diagrams on the facing page show a graphical representation of a Primary Hash Map for an 8 AMP system and a Primary Hash Map for a 16 AMP system. A data value which hashes to “000028CF” will be directed to different AMPs on different systems. For example, this hash value will be associated with AMP 7 on an 8 AMP system and AMP 15 on a 16 AMP system. Note: These are the actual partial hash maps for 8 and 16 AMP systems. Page 5-30 Storing and Accessing Data Rows Hash Maps for Different Systems Row Hash (32 bits) Hash Bucket Number Remaining bits PRIMARY HASH MAP – 8 AMP System 0000 0001 0002 0003 0004 0005 0 1 2 3 4 5 6 7 8 9 A B C D E F 07 07 01 07 04 01 06 07 00 06 04 00 07 02 05 03 05 05 06 04 05 03 07 04 07 01 03 06 05 03 04 00 02 06 06 02 05 05 04 02 07 06 06 04 03 02 07 05 05 03 01 01 03 01 05 02 00 00 02 00 06 03 06 01 03 06 07 00 04 07 00 06 04 06 00 07 06 07 The integer value 337772 hashes to: Portions of actual hash maps with 1,048,576 hash buckets. Storing and Accessing Data Rows 07 01 04 07 01 07 03 02 01 05 02 05 PRIMARY HASH MAP – 16 AMP System 00002 8CF 8 AMP system – AMP 07 16 AMP system – AMP 15 06 05 02 00 04 05 0000 0001 0002 0003 0004 0005 0 1 2 3 4 5 6 7 8 9 15 13 10 15 15 01 14 14 10 15 04 00 15 14 13 13 05 05 15 10 14 14 07 04 13 15 11 06 09 08 14 08 11 08 06 10 12 11 12 13 09 10 14 11 12 14 07 05 13 15 11 13 15 08 15 09 11 13 15 08 A B C 15 10 14 14 03 06 12 12 12 14 08 09 11 09 13 07 15 07 D E F 12 09 14 08 15 06 14 13 12 07 06 11 13 10 12 15 02 05 Page 5-31 Identifying Rows Can two different PI values come out of the hashing algorithm with the same row hash value? The answer is “Yes”. There are two ways that can happen. First, two different primary index values may happen to hash identically. This is called a hash synonym. Secondly, if a non-unique primary index is used; duplicate NUPI values will produce the same row hash. Page 5-32 Storing and Accessing Data Rows Identifying Rows A row hash is not adequate to uniquely identify a row. Consideration #1 1254 A Row Hash = 32 bits = 4.2 billion possible values Because there is an infinite number of possible data values, some data values will have to share the same row hash. Consideration #2 A Primary Index may be non-unique (NUPI). Different rows will have the same PI value and thus the same row hash. 7769 Data values input Hash Algorithm 40A70 3BE 40A70 3BE (John) 'Smith' (Dave) 'Smith' Hash Synonyms NUPI Duplicates Hash Algorithm 2482A D73 2482A D73 Rows have same hash Conclusion A row hash is not adequate to uniquely identify a row. Storing and Accessing Data Rows Page 5-33 The Row ID In order to differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is a combination of the row hash value plus a uniqueness value. The AMP appends the uniqueness value to the row hash when it is inserted. The Uniqueness Value is used to differentiate between PI values that generate identical row hashes. The first row inserted with a particular row hash value is assigned a uniqueness value of 1. Each new row with the same row hash is assigned an integer value one greater than the current largest uniqueness value for this Row ID. If a row is deleted or the primary index is modified, the uniqueness value can be reused. Only the Row Hash portion is used in Primary Index operations. The entire Row ID is used for Secondary Index support that is discussed in a later module. In summary, Row Hash is a 32-bit value. Up to and including Teradata V2R6.2, the Message Passing Layer looks at the high-order 16 bits (previously called “DSW” Destination Selection Word). This is used to index into the Hash Map to determine which AMP gets the row or is used to retrieve a row. Once the AMP has been determined, the entire 32-bits of the Row Hash are passed to the AMP. The AMP uses the entire 32-bit Row Hash to store/retrieve the row. Since there are only 4 billion permutations of Row Hash, you can get duplicates. NUPI Duplicates also cause duplicate Row Hashes, therefore the Row Hash is not sufficient to uniquely identify a row in a table. Therefore, the AMP adds another 32-bit number (called a uniqueness value) to the Row Hash. This total 64-bit number (32-bit Row Hash + 32-bit Uniqueness Value) is called the Row ID. This number uniquely identifies a row in a table. Page 5-34 Storing and Accessing Data Rows The Row ID To uniquely identify a row, we add a 32-bit uniqueness value. The combined row hash and uniqueness value is called a Row ID. Row ID Row Hash (32 bits) Each stored row has a Row ID as a prefix. Rows are logically maintained in Row ID sequence. Uniqueness Id (32 bits) Row ID Row Data Row ID Row Data Row Hash Unique ID 3B11 5032 3B11 5032 3B11 5032 3B11 5033 3B11 5034 3B11 5034 : 0000 0000 0000 0000 0000 0000 Storing and Accessing Data Rows 0001 0002 0003 0001 0001 0002 : Emp_No 1018 1020 1031 1014 1012 1021 : Last_Name Reynolds Davidson Green Jacobs Chevas Carnet : First_Name Jane Evan Jason Paul Jose Jean : Page 5-35 Storing Rows (1 of 2) Rows are stored in a data block in Row ID sequence. As rows are added to a table with the same row hash, the uniqueness value is incremented by one in order to provide a unique Row ID. Assume Last_Name is a NUPI and that all rows in this example hash to the same AMP. The ‘John Smith’ row is assigned to AMP 3 based on the bucket number portion of the row hash. Because it is the first row with this row hash, a uniqueness id of 1 is assigned. The ‘Sam Adams’ row has a different row hash and thus is also assigned a uniqueness value of 1. The bucket number, although different, also points to AMP 3 in the hash map. Page 5-36 Storing and Accessing Data Rows Storing Rows (1 of 2) Assumptions: Last_Name is defined as a NUPI. All rows in this example hash to the same AMP. Add a row for 'John Smith' 'Smith' Hash Algorithm 2482A D73 Hash Map Row ID Row Hash Unique ID 2482A D73 0000 0001 AMP #3 Row Data Last_Name First_Name Smith John Etc. Add a row for 'Sam Adams' 'Adams' Hash Algorithm 782B7 E4D Hash Map Row ID Row Hash Unique ID 2482A D73 0000 0001 782B7 E4D 0000 0001 Storing and Accessing Data Rows AMP #3 Row Data Last_Name First_Name Smith Adams John Sam Etc. Page 5-37 Storing Rows (2 of 2) The ‘Fred Smith’ row hashes to the same row hash as ‘John Smith’ because it is a NUPI duplicate. It is therefore assigned a uniqueness id of 2. The ‘Dan Jones’ row also hashes to the same row hash because it is a hash synonym. It is thus assigned a uniqueness id of 3. Note: In reality, the last names of Smith and Jones DO NOT hash to the same value. This is simply an example that illustrates how the uniqueness ID is used when a hash synonym does occur. Page 5-38 Storing and Accessing Data Rows Storing Rows (2 of 2) Add a row for 'Fred Smith' - (NUPI Duplicate) 'Smith' Hash Algorithm 2482A D73 Hash Map Row ID Row Hash Unique ID 2482A D73 0000 0001 2482A D73 0000 0002 782B7 E4D 0000 0001 AMP #3 Row Data Last_Name First_Name Smith Smith Adams John Fred Sam Etc. Add a row for 'Dan Jones' - (Hash Synonym) 'Jones' Hash Algorithm 2482A D73 Hash Map Row ID AMP #3 Row Data Row Hash Unique ID Last_Name First_Name 2482A D73 2482A D73 2482A D73 782B7 E4D 0000 0001 0000 0002 0000 0003 0000 0001 Smith Smith Jones Adams John Fred Dan Sam Etc. Given the row hash, what other information would be needed to find the 'Dan Jones' row? … The 'Fred Smith' row? Storing and Accessing Data Rows Page 5-39 Locating a Row on an AMP Using a PI To locate a row, the AMP file system searches through a memory-resident structure called the Master Index. An entry in the Master Index will indicate that if a row with this Table ID and row hash exists, then it must be on a specific disk cylinder. The file system will use the cylinder number to locate the Cylinder Index and search through the designated Cylinder Index. There it will find an entry that indicates that if a row with this Table ID and row hash exists, it must be in one specific data block on that cylinder. The file system then searches the data block until it locates the row(s) or returns a No Rows Found condition code. Table-id Row-hash Master Index Table-id Row-hash Cylinder Index Row-hash PI Value Page 5-40 Data Block Data Row Storing and Accessing Data Rows Locating a Row On An AMP Using a PI Locating a row on an AMP requires three input elements: 1. The Table ID 2. The Row Hash of the PI 3. The PI value itself Table ID M a s t e r Cyl 1 Index Cyl 2 Index I n d e x Row Hash START WITH: AMP #3 APPLY TO: Table Id Row Hash Master Index Cylinder # Table Id Row Hash Cylinder Index Row Hash PI Value Data Block Storing and Accessing Data Rows Cyl 3 Index Cyl 4 Index Cyl 5 Index Cyl 6 Index Cyl 7 Index PI Value DATA BLOCK Data Row Row Data FIND: Cylinder # Data Block Address Data Row Page 5-41 Module 5: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 5-42 Storing and Accessing Data Rows Module 5: Review Questions Answer the following either as True or False as these apply to Primary Indexes: True or False 1. UPI and NUPI equality value accesses are always a one-AMP operation. True or False 2. UPI and NUPI indexes allow NULL in a primary index column. True or False 3. UPI, NUPI, and NOPI tables allow duplicate rows in the table. True or False 4. Only UPI can be used as a Primary Key implementation. Fill in the Blanks 5. The output of the hashing algorithm is called the _____ _____. 6. To determine the target AMP, the Message Passing Layer must lookup an entry in the Hash Map based on the _______ _______ _______. 7. A Row ID consists of a row hash plus a ____________ value. 8. A uniqueness value is required to produce a unique Row ID because of ______ ___________ and ________ ___________. 9. Once the target AMP has been determined for a PI search, the _______ ________ for that AMP is accessed to determine the cylinder that may hold the row. 10. The Cylinder Index points us to the address and length of the data ________. Storing and Accessing Data Rows Page 5-43 Notes Page 5-44 Storing and Accessing Data Rows Module 6 Secondary Indexes and Table Scans After completing this module, you will be able to: Define Secondary Indexes. Distinguish between the implementation of unique and non-unique secondary indexes. Define Full Table Scans and what causes them. Describe the operation of a Full Table Scan in a parallel environment. Teradata Proprietary and Confidential Secondary Indexes and Table Scans Page 6-1 Notes Page 6-2 Secondary Indexes and Table Scans Table of Contents Secondary Indexes ....................................................................................................................... 6-4 Choosing a Secondary Index........................................................................................................ 6-6 Unique Secondary Index (USI) Access........................................................................................ 6-8 Non-Unique Secondary Index (NUSI) Access .......................................................................... 6-10 Comparison of Primary and Secondary Indexes ........................................................................ 6-12 Full Table Scans ......................................................................................................................... 6-14 Module 6: Review Questions ..................................................................................................... 6-16 Secondary Indexes and Table Scans Page 6-3 Secondary Indexes A secondary index is an alternate path to the data. Secondary Indexes are used to improve performance by allowing the user to avoid scanning the entire table. A Secondary Index is like a Primary Index in that it allows the user to locate rows. It is unlike a Primary Index in that it has no influence on the way rows are distributed among AMPs. A database designer typically chooses a secondary index because it provides faster set selection. Primary Index requests require the services of only one AMP to access rows, while secondary indexes require at least two and possibly all AMPs, depending on the index and the type of operation. A secondary index search will typically be less expensive than a full table scan. Secondary indexes add overhead to the table, both in terms of disk space and maintenance; however they may be dropped when not needed, and recreated whenever they would be helpful. Page 6-4 Secondary Indexes and Table Scans Secondary Indexes There are 3 general ways to access a table: Primary Index access (one AMP access) Secondary Index access (two or all AMP access) Full Table Scan (all AMP access) • A secondary index provides an alternate path to the rows of a table. • A secondary index can be used to maintain uniqueness within a column or set of columns. • A table can have from 0 to 32 secondary indexes. • Secondary Indexes: – – – – Do not effect table distribution. Add overhead, both in terms of disk space and maintenance. May be added or dropped dynamically as needed. Are chosen to improve table performance. Secondary Indexes and Table Scans Page 6-5 Choosing a Secondary Index Just as with primary indexes, there are two types of secondary indexes – unique (USI) and non-unique (NUSI). Secondary Indexes may be specified at table creation or at any time during the life of the table. It may consist of up to 64 columns, however to get the benefit of the index, the query would have to specify a value for all 64 values. Unique Secondary Indexes (USI) have two possible purposes. They can speed up access to a row which otherwise might require a full table scan without having to rely on the primary index. Additionally, they can be used to enforce uniqueness on a column or set of columns. This is sometimes the case with a Primary Key which is not designated as the Primary Index. Making it a USI has the effect of enforcing the uniqueness of the PK. Non-Unique Secondary Indexes (NUSI) are usually specified in order to prevent full table scans. However, a NUSI does activate all AMPs – after all, the value being sought might well exist on many different AMPs (only Primary Indexes have same values on same AMPs). If the optimizer decides that the cost of using the secondary index is greater than a full table scan would be, it opts for the table scan. All secondary indexes cause an AMP local subtable to be built and maintained as column values change. Secondary index subtables consist of rows which associate the secondary index value with one or more rows in the base table. When the index is dropped, the subtable is physically removed. Page 6-6 Secondary Indexes and Table Scans Choosing a Secondary Index A Secondary Index may be defined ... – at table creation – following table creation (CREATE TABLE) (CREATE INDEX) – may be up to 64 columns USI NUSI If the index choice of column(s) is unique, it is called a USI. If the index choice of column(s) is nonunique, it is called a NUSI. Unique Secondary Index Non-Unique Secondary Index Accessing a row via a USI is a 2 AMP operation. Accessing row(s) via a NUSI is an all AMP operation. CREATE UNIQUE INDEX CREATE INDEX (Employee_Number) ON Employee; (Last_Name) ON Employee; Notes: • Creating a Secondary Index cause an internal sub-table to be built. • Dropping a Secondary Index causes the sub-table to be deleted. Secondary Indexes and Table Scans Page 6-7 Unique Secondary Index (USI) Access The facing page shows the two AMP accesses necessary to retrieve a row via a Unique Secondary Index access. After the row hash of the secondary index value is calculated, the hash map points us to AMP 1 as containing the subtable row for this USI value. After locating the subtable row in AMP 1, we find the row-id of the base row we are seeking. This base row id (which includes the row hash) again allows the hash map to point us to AMP 3 which contains the base row. Secondary index access uses the complete row-id to locate the row, unlike primary index access, which only uses the row hash portion. The Customer table below is the table used in the example. It is only a partial listing of the rows. Customer Table Cust Name NUPI USI 37 98 74 95 27 56 45 Page 6-8 Phone White Brown Smith Peters Jones Smith Adams 555-4444 333-9999 555-6666 555-7777 222-8888 555-7777 444-6666 Secondary Indexes and Table Scans Unique Secondary Index (USI) Access Message Passing Layer Create USI AMP 0 CREATE UNIQUE INDEX (Cust) ON Customer; AMP 1 USI Subtable Access via USI RowID 244, 1 505, 1 744, 4 757, 1 SELECT * FROM Customer WHERE Cust = 56; Cust 74 77 51 27 RowID 884, 1 639, 1 915, 9 388, 1 AMP 2 USI Subtable RowID 135, 1 296, 1 602, 1 969, 1 Cust 98 84 56 49 100 Cust 31 40 45 95 RowID 638, 1 640, 1 471, 1 778, 3 RowID 175, 1 489, 1 838, 4 919, 1 Cust 37 72 12 62 RowID 107, 1 717, 2 147, 2 822, 1 778 7 Message Passing Layer USI Value = 56 Hashing Algorithm Table ID RowID 288, 1 339, 1 372, 2 588, 1 USI Subtable Row Hash Unique Val 100 Customer Table ID = 100 USI Subtable RowID 555, 6 536, 5 778, 7 147, 1 Table ID PE AMP 3 AMP 0 AMP 1 AMP 2 AMP 3 Base Table Base Table Base Table Base Table Row Hash USI Value 602 56 RowID Cust USI 107, 1 37 536, 5 84 638, 1 31 640, 1 40 Name Phone NUPI White 555-4444 Rice 666-5555 Adams 111-2222 Smith 222-3333 RowID Cust USI 471, 1 45 555, 6 98 717, 2 72 884, 1 74 Name Adams Brown Adams Smith Phone NUPI 444-6666 333-9999 666-7777 555-6666 RowID Cust Name USI 147, 1 49 Smith 147, 2 12 Young 388, 1 27 Jones 822, 1 62 Black Phone NUPI 111-6666 777-4444 222-8888 444-5555 RowID Cust USI 639, 1 77 778, 3 95 778, 7 56 915, 9 51 Name Jones Peters Smith Marsh Phone NUPI 777-6666 555-7777 555-7777 888-2222 to MPL Secondary Indexes and Table Scans Page 6-9 Non-Unique Secondary Index (NUSI) Access The facing page shows an all-AMP access necessary to retrieve a row via a Non-Unique Secondary Index access. After the row hash of the secondary index value is calculated, the Message Passing Layer will automatically activate all AMPs per instructions of the Parsing Engine. Each AMP locates the subtable rows containing the qualifying value and row hash. These subtable rows contain the row-id(s) for the base rows, which are guaranteed to be on the same AMP as the subtable row. This reduces activity in the MPL and essentially makes the query an AMP-local operation. Because each AMP may have more than one qualifying row, it is possible for the subtable row to have multiple row-ids for the base table rows. The Customer table below is the table used in the example. It is only a partial listing of the rows. Customer Table Cust 37 98 74 95 27 56 45 Page 6-10 Name Phone NUSI NUPI White Brown Smith Peters Jones Smith Adams 555-4444 333-9999 555-6666 555-7777 222-8888 555-7777 444-6666 Secondary Indexes and Table Scans Non-Unique Secondary Index (NUSI) Access Message Passing Layer Create NUSI CREATE INDEX (Name) ON Customer; AMP 0 AMP 1 NUSI Subtable Access via NUSI SELECT * FROM Customer WHERE Name = 'Adams'; RowID 432,8 448,1 567,3 656,1 Name Smith White Adams Rice RowID 640,1 107,1 638,1 536,5 AMP 2 NUSI Subtable RowID 432,3 567,2 852,1 Name Smith Adams Brown RowID 884,1 471,1 717,2 555,6 AMP 3 NUSI Subtable RowID 432,1 448,4 567,6 770,1 Name Smith Black Jones Young RowID 147,1 822,1 338,1 147,2 NUSI Subtable RowID 155,1 396,1 432,5 567,1 Name Marsh Peters Smith Jones RowID 915, 9 778, 3 778, 7 639, 1 PE Customer Table Table ID = 100 NUSI Value = 'Adams' AMP 0 AMP 1 AMP 2 AMP 3 Hashing Algorithm Base Table Table ID Row Hash Value 100 567 Adams RowID Cust Name NUSI 107,1 37 White 536,5 84 Rice 638,1 31 Adams 640,1 40 Smith Phone NUPI 555-4444 666-5555 111-2222 222-3333 Base Table RowID Cust Name NUSI 471,1 45 Adams 555,6 98 Brown 717,2 72 Adams 884,1 74 Smith Phone NUPI 444-6666 333-9999 666-7777 555-6666 Base Table RowID Cust Name NUSI 147,1 49 Smith 147,2 12 Young 388,1 27 Jones 822,1 62 Black Phone NUPI 111-6666 777-4444 222-8888 444-5555 Base Table RowID Cust Name NUSI 639,1 77 Jones 778,3 95 Peters 778,7 56 Smith 915,9 51 Marsh Phone NUPI 777-6666 555-7777 555-7777 888-2222 to MPL Secondary Indexes and Table Scans Page 6-11 Comparison of Primary and Secondary Indexes The table on the facing page compares and contrasts primary and secondary indexes: Primary indexes are required; secondary indexes are optional. All tables must have a method of distributing rows among AMPs -- the Primary Index. A table can only have one primary index, but it can have up to 32 secondary indexes. Both primary and secondary indexes can have up to 64 columns. Secondary indexes, like primary indexes, can be either unique (USI) or non-unique (NUSI). The secondary index does not affect the distribution of rows. Rows are only distributed according to the Primary Index values. Secondary indexes can be created and dropped dynamically. In other words, Secondary Indexes can be added as needed. In fact, in some cases it is a good idea to wait and see how the database is used and then add Secondary Indexes to facilitate that usage. Both primary and secondary indexes affect system performance. However, Primary and Secondary Indexes affect performance for different reasons. A poorly-chosen PI results in “lumpy” data distribution which makes some AMPs do more work than others and slows the system. Secondary Indexes affect performance because they require subtables. Both indexes allow rapid retrieval of specific rows. Both primary and secondary indexes can be created using multiple data types. Secondary indexes are stored in separate subtables; primary indexes are not. Because secondary indexes require separate subtables, extra I/O is needed to maintain those subtables. Page 6-12 Secondary Indexes and Table Scans Comparison of Primary and Secondary Indexes Index Feature Primary Secondary Yes* No Number per Table 1 0 - 32 Max Number of Columns 64 64 Unique or Non-unique Both Both Affects Row Distribution Yes No Created/Dropped Dynamically No Yes Improves Access Yes Yes Multiple Data Types Yes Yes Separate Physical Structure No Sub-table Extra Processing Overhead No Yes May be ordered by value No Yes (NUSI) May be Partitioned Yes No Required? * Not required with NoPI table in Teradata 13.0 Secondary Indexes and Table Scans Page 6-13 Full Table Scans A full table scan is another way to access data without using any Primary or Secondary Indexes. In evaluating an SQL request, the Parser examines all possible access methods and chooses the one it believes to be the most efficient. The coding of the SQL request along with the demographics of the table and the availability of indexes all play a role in the decision of the Parser. Some coding constructs, listed on the facing page, always cause a full table scan. In other cases, it might be chosen because it is the most efficient method. In general, if the number of physical reads exceeds the number of data blocks then the optimizer may decide that a full-table scan is faster. With a full table scan, each data block is found using the Master and Cylinder Indexes and each data row is accessed only once. As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs can do the full table scan quickly. The file system keeps each table on as few cylinders as practical to help reduce the cost full table scans. While full table scans are impractical and even disallowed on some systems, the Teradata Database routinely executes ad hoc queries with full table scans. Page 6-14 Secondary Indexes and Table Scans Full Table Scans Every row of the table must be read. All AMPs scan their portion of the table in parallel. • Fast and efficient on Teradata due to parallelism. Full table scans typically occur when either: • An index is not used in the query • An index is used in a non-equality test Customer Cust_ID USI Cust_Name Cust_Phone NUPI Examples of Full Table Scans: SELECT * FROM Customer WHERE Cust_Phone LIKE '858-485-_ _ _ _ '; SELECT * FROM Customer WHERE Cust_Name = 'Koehler'; SELECT * FROM Customer WHERE Cust_ID > 1000; Secondary Indexes and Table Scans Page 6-15 Module 6: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 6-16 Secondary Indexes and Table Scans Module 6: Review Questions Fill each box with either Yes, No, or the appropriate number. USI Access NUSI Access FTS # AMPs # rows Parallel Operation Uses Hash Maps Uses Separate Sub-table Reads all data blocks of table Secondary Indexes and Table Scans Page 6-17 Notes Page 6-18 Secondary Indexes and Table Scans Module 7 Teradata System Architecture After completing this module, you will be able to: Identify characteristics of various components. Specify the difference between a TPA and non-TPA node. Teradata Proprietary and Confidential Teradata System Architecture Page 7-1 Notes Page 7-2 Teradata System Architecture Table of Contents Teradata Database Releases ......................................................................................................... 7-4 Teradata Version 1 ................................................................................................................... 7-4 Teradata Version 2 ................................................................................................................... 7-4 Teradata Database Architecture ................................................................................................... 7-6 Teradata Database – Multiple Nodes ........................................................................................... 7-8 MPP Systems ............................................................................................................................. 7-10 Example of 3 Node Teradata Database System ......................................................................... 7-12 Example: 5650 and 6844 Disk Arrays ................................................................................... 7-12 Teradata Cliques ........................................................................................................................ 7-14 BYNET ...................................................................................................................................... 7-16 BYNET Communication Protocols ........................................................................................... 7-18 Vproc Inter-process Communication ......................................................................................... 7-20 Examples of Teradata Database Systems ................................................................................... 7-22 Example of 5650 Cabinets ......................................................................................................... 7-24 What makes Teradata’s MPP Platforms Special? ...................................................................... 7-26 Summary .................................................................................................................................... 7-28 Module 7: Review Exercises...................................................................................................... 7-30 Teradata System Architecture Page 7-3 Teradata Database Releases The facing page identifies various Teradata releases that have been available since 1984. This page identifies some historical information about Teradata Version 1 systems. Teradata Version 1 Teradata Database Version 1 platforms were first available to customers in 1984. This first platform was the original Database Computer, DBC/1012 from Teradata. In 1991, the NCR Corporation introduced the 3600. Both of these systems are older technologies and both of these systems used a proprietary 16-bit operating system known as TOS (Teradata Operating System). All AMPs and PEs were dedicated hardware processors that were connected together using a message-passing layer known as the Ynet. Both platforms supported channel-attached (Bus and Tag) and LAN-attached host systems. DBC/1012 Architecture – this system was a dedicated relational database management system. Two specific components of the DBC/1012 were the IFP and the COP, both of which were effectively hardware Parsing Engines. The acronyms IFP and COP still appear in the Data Dictionary even today. Interface Processor (IFP) – the IFP was the Parsing Engine of the DBC/1012 for channel-attached systems. For current systems (e.g., 5550), PEs that are assigned to a channel are identified in the Data Dictionary with a type of IFP. Communications Processor (COP) – the COP is the Parsing Engine of the DBC/1012 for network-attached systems. For current systems (e.g., 5550), PEs that are assigned to a LAN (or network) are identified in the Data Dictionary with a type of COP. 3600 Architecture – this system included hardware AMPs and PEs as well as multipurpose Application Processors executing UNIX MP-RAS. Application Processors (APs) also provided the channel and LAN connectivity. UNIX applications were executed on APs while Teradata was executed on PEs and AMPs. All processing units were connected via the Ynet. Teradata Version 2 Starting with Teradata Database Version 2 Release 1 (available in January 1996), the Teradata Database became an open database system. No longer was Teradata software dependent on a proprietary Operating System (TOS) and proprietary hardware. Rather, it was an application that initially executed ran under UNIX. By porting the Teradata Database to a general-purpose operating system platform, a variety of processing options against the Teradata database became possible, all within a single system. OLTP (as well as OLCP and OLAP) applications became processing options in addition to standard DSS. Page 7-4 Teradata System Architecture Teradata Database Releases Teradata Releases Version 1 Release 1 Release 2 Release 3 Release 4 Release 5 Version 1 was a combination of hardware and software. Version 2 Release 1 Release 2 Release 3 Release 4 Release 5 Release 6 Version 2 is an implementation of Teradata PEs and AMPs as software vprocs (virtual processors). Teradata 12.0 Teradata 13.0 Teradata 13.10 Teradata 14.0 Teradata System Architecture For example, if a customer needed additional AMPs, the hardware and software components for an AMP had to be purchased, installed, and configured. V1 Platforms DBC/1012 3600 Year Available 1984 1991 Teradata is effectively a database application that executes under an operating system. Platforms 5100 Year Available 1996 (UNIX MP-RAS only) 5650 (requires 12.0 or later) 6650, 6680 6690 2010 2011 2012 Page 7-5 Teradata Database Architecture Teradata is effectively an application that runs under an operating system (SUSE Linux, UNIX MP-RAS, or Windows Server 2003). PDE software provides Teradata Database software the capability to under a specific operating system. Parallel Database Extensions (PDE) software is an interface layer on top of the operating system (Linux, UNIX MP-RAS, or Windows Server 2003). PDE provides the Teradata Database with the ability to: Run the Teradata Database in a parallel environment Execute vprocs Apply a flexible priority scheduler to Teradata Database sessions Debug the operating system kernel and the Teradata Database using resident debugging facilities AMPs and PEs are implemented as “virtual processors - vprocs”. They run under the control of PDE and their number is software configurable. AMPs are associated with “virtual disks – vdisks” which are associated with logical units (LUNs) within a disk array. The versatility of Teradata Database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. Vprocs are a set of software processes that run on a node under Teradata Parallel Database Extensions (PDE) within the multitasking environment of the operating system. Page 7-6 Teradata System Architecture Teradata Database Architecture Teradata Processing Node (e.g., 6650 node) Operating System (e.g., SUSE Linux) PDE and BYNET S/W (MPL) PE vproc Teradata Gateway Software (LANs) ... PE vproc AMP vproc AMP vproc AMP vproc AMP vproc AMP vproc AMP vproc AMP vproc ... AMP vproc Vdisk Vdisk Vdisk Vdisk Vdisk Vdisk Vdisk ... Vdisk • Teradata executes on a 64-bit operating system (e.g., SUSE Linux). – Utilizes general purpose SMP/MPP hardware. – Parallel Database Extensions (PDE) is unique per OS that Teradata is supported on. • AMPs and PEs are implemented as virtual processors (Vprocs). • “Shared Nothing” Architecture – each AMP has its own memory, manages its own disk space, and executes independently of other AMPs. Teradata System Architecture Page 7-7 Teradata Database – Multiple Nodes A customer may choose to implement Teradata on a small, single node SMP system for smaller database requirements and to eventually grow incrementally to a multiple terabyte system. A single-node SMP platform may also be used as low cost development systems. Under the single-node (SMP) version of Teradata, PE and AMP vproc still communicate with each other via PDE and BYNET software. All vprocs share the resources of CPUs and memory within the SMP node. As a customer’s Teradata database needs grow, additional nodes will probably be needed. A multi-node system running the Teradata Database is referred to as an MPP (Massive Parallel Processing) system. The Teradata Database application is considered a Trusted Parallel Application (TPA). The Teradata Database is the only TPA application available at this time. Nodes in a system configuration may or may not be connected to the BYNET. Examples of nodes and their purpose include: TPA (Trusted Parallel Application) node – executes Teradata Database software. HSN (Hot Standby Node) – is a spare node in the clique (not running Teradata) used in event of a node failure. Non-TPA (NOTPA) node – is an application node that does not executes Teradata Database software. Hot standby nodes allow spare nodes to be incorporated into the production environment. The Teradata Database can use spare nodes to improve availability and maintain performance levels in the event of a node failure. A hot standby node is a node that: is a member of a clique does not normally participate in Teradata Database operations can be brought in to participate in Teradata Database operations to compensate for the loss of a node in the clique Configuring a hot standby node can eliminate the system-wide performance degradation associated with the loss of a node. A hot standby node is added to each clique in the system. When a node fails, all AMPs and all LAN-attached PEs on the failed node migrate to the node designated as the hot standby node. The hot standby node becomes a production node. When the failed node returns to service, it becomes the new hot standby node. Configuring hot standby nodes eliminates: Page 7-8 Restarts that are required to bring a failed node back into service. Degraded service when vprocs have migrated to other nodes in a clique. Teradata System Architecture Teradata Database – Multiple Nodes BYNET TPA Node 1 TPA Node 2 Operating System (e.g., Linux) PDE and BYNET PE vproc AMP vproc AMP vproc AMP vproc Vdisk Vdisk Vdisk Operating System (e.g., Linux) Gateway Software ... PDE and BYNET PE vproc ....... PE vproc AMP vproc AMP vproc AMP vproc AMP vproc Vdisk Vdisk Vdisk Vdisk Gateway Software ... PE vproc ....... AMP vproc Vdisk Teradata is a linearly expandable database – as your database grows, additional nodes may be added – effectively becoming an MPP (Massive Parallel Processing) systems. • Teradata software makes a multi-node system look like a single-Teradata system. Examples of types of nodes that connect to the BYNET. • TPA (Trusted Parallel Application) node – executes Teradata Database software. • HSN (Hot Standby Node) – spare node in the clique (not running Teradata) used in event of a node failure. • Non-TPA (NOTPA) node – application node that does not executes Teradata Database software. Teradata System Architecture Page 7-9 MPP Systems When multiple SMP nodes (simply referred to as nodes) are connected together to form a larger configuration, we refer to this as an MPP (Massively Parallel Processing) system. The connecting layer (or system interconnect) is called the BYNET. The BYNET is a combination of hardware and software that allows multiple vprocs on multiple nodes to communicate with each other. Because Teradata is a linearly expandable database system, as additional nodes and vprocs are added to the system, the system capacity scales in a linear fashion. The BYNET Version 1 can support up to 128 SMP nodes. The BYNET Version 2 can support up to 512 nodes. The BYNET Version 3 can support up to 1024 nodes and BYNET Version 4 can support up to 4096 nodes. Acronyms that may appear in diagrams throughout this course: PCI – Peripheral Component Interconnect EISA – Extended Industry Standard Architecture PBCA – PCI Bus Channel Adapter PBSA – PCI Bus ESCON Adapter EBCA – EISA Bus Channel Adapter Page 7-10 Teradata System Architecture MPP Systems The BYNET consists of redundant switches that interconnect multiple nodes. BYNET Switch TPA Node BYNET Switch TPA Node HSN Node Multiple nodes make up Massively Parallel Processing (MPP) system. A clique is a group of nodes connected to and sharing the same storage. Teradata System Architecture : : : : : : : : Page 7-11 Example of 2+1 Node Teradata System The facing page contains an illustration of a simple three-node (2+1) Teradata Database system. Each node has its own Vprocs to manage, while communication among the Vprocs takes place via the BYNETs. The PEs are not shown in this example. Each node is an SMP from a configuration standpoint. Each node has its own CPUs, memory, UNIX and PDE software, Teradata Database software, BYNET software, and access to one or more disk arrays. Nodes are the building blocks of MPP systems. A system size is typically expressed in terms of number of nodes. AMPs provide access to user data stored within tables that are physically stored on disk arrays. Each AMP is associated with a Vdisk. Each AMP sees its Vdisk as a single disk. Teradata (AMP software) organizes its data on its disk space (Vdisk) using a Teradata “File System” structure. A Vdisk may be actually composed of multiple Pdisks - Physical disk. A Pdisk is assigned to physical drives in a disk array. Example: 6650 and Internal 6844 Disk Arrays The facing page contains an example of a 3-node (2+1) clique sharing two 6844 disk arrays. Each node has Fibre Channel adapters and Fibre Channel cables (point-to-point connections) to connect to the disk arrays. Page 7-12 Teradata System Architecture Example of 2+1 Node Teradata System SMP001-7 AMPs 0 1 SMP002-6 AMPs ……. 29 30 SMP002-7 31 ……. 59 Hot Standby Node AMP 0 Vdisk 0 600 GB Pdisk 0 600 GB Pdisk 1 600 GB : : : : : : : : 600 GB MaxPerm = 1.08 TB* * Actual space is app. 90%. 120 disks 120 disks 2+1 node clique sharing 240 drives; 30 AMPs/node; Linux System Teradata System Architecture Page 7-13 Teradata Cliques A clique is a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays. The illustration on the facing page shows a 6-node system consisting of two cliques, each containing three nodes. Because all disk arrays are available to all nodes in the clique, the AMP vprocs will still have access to the rows they are responsible for. Page 7-14 Teradata System Architecture Teradata Cliques BYNET Switch TPA Node : : : : • • • • HSN TPA Node BYNET Switch TPA Node : : : : HSN TPA Node : : : : : : : : A clique is a defined set of nodes that share a common set of disk arrays. All nodes in a clique must be able to access all Vdisks for all AMPs in the clique. A clique provides protection from a node failure. If a node fails, all vprocs will migrate to the remaining nodes in the clique (Vproc Migration) or to a Hot Standby Node (HSN). Teradata System Architecture Page 7-15 BYNET There are two physical BYNETs, BYNET 0 and BYNET 1. Both are fully operational and provide fault tolerance in the event of a BYNET failure. The BYNETs automatically handle load balancing and message routing. BYNET reconfiguration and message rerouting in the event of a component failure is also handled transparently to the application. Page 7-16 Teradata System Architecture BYNET BYNET 0 Node Node HSN BYNET 1 Node Node HSN Node Node HSN The BYNET is a dual redundant, bi-directional interconnect network. • All nodes are connected to both BYNETs. This example shows three (2+1) cliques. BYNET Features: • • • • • • Enables multiple nodes to communicate with each other. Automatic load balancing of message traffic. Automatic reconfiguration after fault detection. Fully operational dual BYNETs provide fault tolerance. Scalable bandwidth as nodes are added. Even though there are two physical BYNETs to provide redundancy and bandwidth, the Teradata Database and TCP/IP software only see a single network. Teradata System Architecture Page 7-17 BYNET Communication Protocols Using communication-switching techniques, the BYNET allows for point-to-point, multicast, and broadcast communications among the nodes, thus supporting a monumental increase in throughput in very large databases. This technology allows Teradata users to grow massively parallel databases without fear of a communications bottleneck for any database operations. Although the BYNET software supports the multi-cast protocol, Teradata only uses this protocol with Group AMPs operations. This is a Teradata feature starting with release V2R5. Teradata software will use the point-to-point protocol whenever possible. When an all-AMP operation is needed, Teradata software uses the broadcast protocol to send messages to the different SMPs. The BYNET is linearly scalable for point-to-point communications. For each new node added to a system with BYNET V4, an additional 960 MB of additional bandwidth is added to each BYNET, thus providing scalability as the system grows. Scalability comes from the fact that multiple point-to-point circuits can be established concurrently. With the addition of another node, more circuits can be established concurrently. For broadcast and multicast operations with BYNET V4, the bandwidth is 960 MB per second per BYNET. BYNET V1 (old implementation) had a bandwidth of 10 MB per second per direction per BYNET for a node. Page 7-18 Teradata System Architecture BYNET Communication Protocols BYNET 0 PE PE BYNET 1 PE PE Hot Standby Node AMP ... AMP AMP ... AMP Point-to-Point (one-to-one) One vproc communicates with one vproc (e.g., 1 PE to 1 AMP). Scalable bandwidth: • BYNET v2 – 60 MB x 2 (bi-directional) x 2 BYNETs = 240 MB per node • BYNET v3 – 93.75 MB x 2 (bi-directional) x 2 BYNETs = 375 MB per node • BYNET v4 – 240 MB x 2 (bi-directional) x 2 BYNETs = 960 MB per node Multi-Cast (one-to-many) One vproc communicates to a subset of vprocs (e.g., Group AMP operations). Broadcast (one-to-all) One vproc communicates to all vprocs (e.g., 1 PE to all AMPs). Not scalable. Teradata System Architecture Page 7-19 Vproc Inter-process Communication The “message passing layer” is a combination of two pieces of software and hardware– the PDE and the BYNET device drivers and software and the BYNET hardware. Communication among vprocs in an MPP system may be either inter-node or intra-node. When vprocs within the same node communicate they do not require the physical transport services of the BYNET. However, they do use the highest levels of the BYNET software even though the messages themselves do not leave the node. When vprocs must communicate across nodes, they must use the physical transport services of the BYNET requiring movement of the data. Any broadcast messages, for example, will go out to the BYNET, even for the AMPs and PEs that are in the same node. Communication among vprocs in a single SMP system occurs with the PDE and BYNET software, even though a physical BYNET does not exist in a single-node system. Page 7-20 Teradata System Architecture Vproc Inter-process Communication Single-Node System PDE and BYNET s/w vproc vproc vproc vproc vproc vproc Teradata Database MPP Systems BYNET PDE and BYNET s/w vproc vproc vproc vproc vproc vproc vproc vproc vproc vproc vproc vproc Teradata Database Node 1 Teradata System Architecture PDE and BYNET s/w Teradata Database Node 2 Page 7-21 Examples of Teradata Database Systems The facing page identifies various SMP servers and MPP systems that are supported for the Teradata Database. The following dates indicate when these systems were generally available to customers (GCA – General Customer Availability). – – – – – – – – – – – – – – – – – – 5100M 4700/5150 4800/5200 4850/5250 4851/4855/5251/5255 4900/5300 4950/5350 4980/5380 5400E/5400H 5450E/5450H 5500E/5500C/5500H 2500/5550H 2550 1550 2555/5555C/H 1600/5600C/H 2650/5650C/H 6650C/H, 6680 January, 1996 (not described in this course) January, 1998 (not described in this course) April, 1999 June, 2000 July, 2001 March, 2002 December, 2002 August, 2003 March, 2005 April, 2006 March, 2007 January, 2008 October, 2008 December, 2008 March, 2009 February, 2010 July, 2010 (Internal release; Official release Oct 2010) 2011 The Teradata Database is also available on non-Teradata platforms. The Teradata Database is available on the Intel-based mid-range platforms running Microsoft Windows 2003 or Linux. For example, Dell provides processing nodes that are used in some of the Teradata appliance systems. Page 7-22 Teradata System Architecture Examples of Teradata Database Systems Examples of systems used with the Teradata Database include: Active Enterprise Data Warehouse Systems 5200/525x 5300/5350/5380 5400/5450 5500/555x/56xx 6650/6680/6690 – – – – – up to 2 nodes/cabinet up to 4 nodes/cabinet up to 10 nodes/cabinet up to 9 nodes/cabinet up to 4 nodes/cabinet with associated storage The basic building block is the SMP (Symmetric Multi-Processing) node. Common characteristics of these systems: • MPP systems that use the BYNET interconnect • Single point of operational control – AWS or SWS • Rack-based systems – each technology is encapsulated in its own chassis Key differences: • Speed and capacity of SMP nodes and systems • Cabinet architecture • BYNET interface cards, switches and speeds *BYNET V4 – up to 4096 nodes Teradata System Architecture Page 7-23 6650 Cabinets The facing page contains two pictures of rack-based cabinets. This represents a two cabinet 3+1 6650 clique. 54xx, 55xx, and 56xx systems also used a rack-based cabinet. The rack was initially designed for the 54xx systems and has been improved on with later systems such as 55xx and 56xx systems. This redesign allows for better cooling and maintenance and has a black and gray appearance. This design is also used with the LSI disk array cabinets. The 56xx cabinet is a different cabinet and is approximately 4” deeper than the 55xx cabinets. An older style rack or cabinet is used for the 4700, 4800, 4850, 4851, 4855, 4900, 4950, 4980, 5200, 5250, 5251, 5255, 5300, 5350, and 5380 systems. This cabinet was similar in size and almond in color. The approximate external dimensions of this rack or cabinet are: Height – 77” Width – 24” (inside rails are 19” apart and this is often referred to as a 19” wide rack) Depth – 40” (the 56xx/66xx cabinet is 44” deep) This industry-standard rack is referred to as a 40U rack where a U is a unit of measure of height of 1.75” or 4.445 cm. The system or processor cabinet includes a Server Management (SM) chassis which is often referred to as the CMIC (Chassis Management Interface Controller). This component is part of the server management subsystem and interfaces with the AWS or SWS. Page 7-24 Teradata System Architecture 6650 Cabinets Secondary SM Switch Secondary SM Switch Drive Trays (16 HD) Drive Trays (16 HD) 6844 Array Controllers 6844 Array Controllers TPA Node HSN TPA Node TPA Node TMS Node (opt.) BYA32S-1 BYA32S-0 SM – CMIC (1U) TMS Node (opt.) Primary SM Switch Primary SM Switch AC Box AC Box AC Box AC Box 6650 6650 Teradata TPA Node PE . . AMP ... AMP Teradata uses industry standard rack-based cabinets. HSN – Hot Standby Node Teradata System Architecture TMS Node (opt.) SM – CMIC (1U) 3+1 6650 Clique Page 7-25 What makes Teradata’s MPP Platforms Special? The facing page lists the major features of Teradata’s MPP systems. Acronyms: PUT – Parallel Upgrade Tool AWS – Administration Workstation SWS – Service Workstation – utilizes Server Management Web Services (SMWeb) for the 56xx. Page 7-26 Teradata System Architecture What Makes Teradata’s MPP Platforms Special? Key features of Teradata’s MPP systems include: • Teradata Database software – allows the Teradata Database to execute on multiple nodes and act as a single instance. • Scalable BYNET Interconnect – as you add nodes, you add bandwidth. • Operating system software (e.g., Linux) for a node is only aware of the resources within the node and only has to manage those resources. • AWS/SWS – single point of operational control and scalable server management. • PUT (Parallel Upgrade Tool) – simplifies installation/upgrade of software across many nodes. • Redundant (availability) components. Examples include: – – – – – Hot Standby Nodes Two BYNETs Two Disk Array Controllers within a Disk Array Dual AC capability for increased availability N+1 Power Supplies within a processing node and disk arrays Teradata System Architecture Page 7-27 Summary The facing page summarizes the key points and concepts discussed in this module. Page 7-28 Teradata System Architecture Summary • Teradata Database is a software implementation of Teradata. – AMPs and PEs are implemented as virtual processors (Vprocs). • The Teradata Database utilizes a “Shared Nothing” Architecture – each AMP has its own memory and manages its own disk space. – Teradata is called a Trusted Parallel Application (TPA). • Multiple nodes may be configured to provide a Massively Parallel Processing (MPP) system. • A clique is a defined set of nodes that share a common set of disk arrays. • The Teradata Database is a linearly expandable RDBMS – as your database grows, additional nodes may be added. Teradata System Architecture Page 7-29 Module 7: Review Exercises Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 7-30 Teradata System Architecture Module 7: Review Questions Complete the following. 1. Each AMP has its own memory and manages its own disk space and executes independently of other AMPs. This is referred to as a _________ _________ architecture. 2. The software component that allows the Teradata Database to execute in different operating system environments is the __________. 3. A physical message passing interconnect is called the _____________. 4. A clique provides protection from a _________ failure. 5. If a node fails, all vprocs will migrate to the remaining nodes in the clique. This feature is referred to as ___________ _____________. 6. The _______ or _______ provides a single point of operational control for Teradata MPP systems. 7. A _________ node is part of a system configuration, is connected to the BYNET, and executes the Teradata Database software. 8. A _________ node is part of a system configuration, connects to the BYNET, and is used to execute application software other than Teradata Database software. 9. A _________ node is part of a system configuration, connects to the BYNET, and is used as a spare node in the event of a node failure. Teradata System Architecture Page 7-31 Notes Page 7-32 Teradata System Architecture Module 8 Data Protection After completing this module, you will be able to: Explain the concept of FALLBACK tables. List the types and levels of locking provided by Teradata. Describe the Recovery, Transient and Permanent Journals and their function. List the utilities available for archive and recovery. Teradata Proprietary and Confidential Data Protection Page 8-1 Notes Page 8-2 Data Protection Table of Contents Data Protection Features .............................................................................................................. 8-4 Disk Arrays .................................................................................................................................. 8-6 RAID Technologies ..................................................................................................................... 8-8 RAID 1 – Mirroring ................................................................................................................... 8-10 RAID 10 – Striped Mirroring................................................................................................. 8-10 RAID 1 Summary ...................................................................................................................... 8-12 Cliques ....................................................................................................................................... 8-14 Large Cliques ......................................................................................................................... 8-14 Teradata Vproc Migration .......................................................................................................... 8-16 Hot Standby Nodes (HSN) ......................................................................................................... 8-18 Large Cliques ......................................................................................................................... 8-18 Performance Degradation with Node Failure ............................................................................ 8-20 Restarts ............................................................................................................................... 8-20 Fallback ...................................................................................................................................... 8-22 Fallback Clusters ........................................................................................................................ 8-24 Fallback and RAID Protection ................................................................................................... 8-26 Fallback and RAID 1 Example .................................................................................................. 8-28 Fallback and RAID 1 Example (cont.) ................................................................................... 8-30 Fallback and RAID 1 Example (cont.) ................................................................................... 8-32 Fallback and RAID 1 Example (cont.) ................................................................................... 8-34 Fallback and RAID 1 Example (cont.) ................................................................................... 8-36 Fallback vs. non-Fallback Tables Summary .............................................................................. 8-38 Clusters and Cliques................................................................................................................... 8-40 Locks .......................................................................................................................................... 8-42 Locking Modifier ....................................................................................................................... 8-44 ACCESS................................................................................................................................. 8-44 NOWAIT ............................................................................................................................... 8-44 Rules of Locking ........................................................................................................................ 8-46 Access Locks.............................................................................................................................. 8-48 Transient Journal ........................................................................................................................ 8-50 Recovery Journal for Down AMPs ............................................................................................ 8-52 Permanent Journal ...................................................................................................................... 8-54 Archiving and Recovering Data ................................................................................................. 8-56 Module 8: Review Questions ..................................................................................................... 8-58 Data Protection Page 8-3 Data Protection Features Disk Arrays – Disk arrays provide RAID 1, RAID 5, or RAID S data protection. If a disk drive fails, the array subsystem provides continuous access to the data. Systems with disk arrays are configured with redundant Fibre adapters, buses, and array controllers to provide highly available access to the data. Clique – a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to another available node in the clique. All nodes in the clique must have access to the same disk arrays. Locks – Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks: Exclusive, Write, Read, and Access. Fallback – protects your data by storing a second copy of each row of a table on an alternative “fallback AMP”. If an AMP fails, the system accesses the fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all of the table data is still available. Users may continue to use Fallback tables without any loss of available data. Down-AMP Recovery Journal – started automatically when the system has a failed or down AMP. Its purpose is to log any changes to rows which reside on the down AMP. Transient Journal – exists to permit the successful rollback of a failed transaction. Transactions are not committed to the database until an End Transaction request has been received by the AMPs, either implicitly or explicitly. Until that time, there is always the possibility that the transaction may fail in which case the participating table(s) must be restored to their pre-transaction state. Permanent Journal – provides selective or full database recovery to a specified point in time by keeping either before-image or after-images of rows in a journal. It permits recovery from unexpected hardware or software disasters. ARC and NetVault/NetBackup – ARC command scripts provide the capability to backup and restore the Teradata database. The NetVault and NetBackup utilities provide a GUI based front-end for creation and execution of ARC command scripts. Page 8-4 Data Protection Data Protection Features Facilities that provide system-level protection Disk Arrays – RAID data protection (e.g., RAID 1) – Redundant SCSI and/or Fibre Channel buses and array controllers Cliques and Vproc Migration – SMP or O.S. failures - Vprocs can migrate to other nodes within the clique. Facilities that provide Teradata DB protection Fallback – provides data access with a “down” AMP Locks – provides data integrity Transient Journal – automatic rollback of aborted transactions Down AMP Recovery Journal – fast recovery of fallback rows for AMPs Permanent Journal – optional before and after-image journaling ARC – Archive/Restore facility NetVault and NetBackup – provide tape management and ARC script creation and scheduling capabilities Data Protection Page 8-5 Disk Arrays Disk arrays utilize a technology called RAID (Redundant Array of Independent Disks) Spanning the entire spectrum from personal computers to mainframes, disk arrays (utilizing RAID technology) offer significant improvements in availability, reliability and maintainability of information storage, along with higher performance. Yet the concept behind disk arrays is relatively simple. A disk array subsystem consists of controller(s) which drive a set of disks. Typically, a disk array is configured to represent a number of logical volumes (or disks), each of which appears to be a physical disk to the user. A logical volume can be configured to reside on multiple physical disks. The fact that a logical volume is located on 1 or more disks is transparent to the user. There is one immediate advantage of having the data spread across a number of individual separate disks which arises from the redundant manner in which the data can be stored in the disk array. The remarkable benefit of this feature is that if any single disk in the array fails, the unit continues to function without loss of data. This is possible because redundancy information is stored separate from the data. The redundancy information, as will be explained, can be a copy of the data or other information that can be used to reconstruct any data that was stored on a failed disk. Secondly, performance increases for specific applications are possible as the effective seek time for finding records on a given disk can potentially be reduced by allowing multiple simultaneous accesses of different blocks on different disks. Alternatively, with a different architecture, the rate at which data is transferred to and from the disk array can be increased significantly over that of a single disk utilizing parallel reads and writes of the data spread across the disks in the array. This function is referred to as “striping the data”. Finally, disk array subsystem maintenance is typically simplified because it is possible to replace (“hot swap”) individual disks and other components while the system continues to function. You no longer have to bring down the system to replace a disk. Page 8-6 Data Protection Disk Arrays Utilities Applications Host Operating System DAC DAC Why Disk Arrays? • High availability through data mirroring or data parity protection. • Better I/O performance through implementation of RAID technology at the hardware level. • Convenience – automatic disk recovery and data reconstruction when mirroring or data parity protection is used. Data Protection Page 8-7 RAID Technologies RAID is an acronym for Redundant Array of Independent Disks. The term was coined in 1988 in a paper describing array configuration and application by researchers and authors Patterson, Gibson and Katz of the University of California at Berkeley. The word redundant implies that data, functions and/or components have been duplicated in the array’s architecture. Duplication of data, functions, and hardware ensures that even in the event of a failed drive or other components, data is not lost and is continuously available. The industry currently has agreed upon six RAID configuration levels and designated them as RAID 0 through RAID 5. The physical configuration is dictated to some extent by the choice of RAID level; however, RAID conventions specify more precisely how data is stored on disk. RAID 0 RAID 1 RAID 2 RAID 3 RAID 4 RAID 5 Data striping Disk mirroring Parallel array, hamming code Parallel array with parity Data parity protection, dedicated parity drive Data parity protection, interleaved parity With Teradata, the RAID 1 is most commonly used. RAID 5 (data parity protection) is also available with some arrays. There are other RAID technologies that are defined by specific vendors or are accepted in the data processing industry. For example, RAID 10 or RAID 1+0 (or RAID 0+1) is considered to be “striped mirroring”. RAID level classifications do not imply superiority of one mode over another. Each mode has its rightful application. In fact, these modes of operation can be combined within a single system configuration, within product limitations, to obtain maximum flexibility and performance. The advantages of RAID 1 (compared to RAID 5) include: Superior Performance Mirroring provides the best read and write throughput. Maximizes the performance capabilities of controllers and disk drives. Best performance when a drive has failed. Less reconstruction impact when a drive has failed. Superior Availability Less susceptible to a double disk failure in a RAID drive group. Faster reconstruction of a failed drive - shorter vulnerability period during reconstruction. Superior Price/Performance - the performance advantage of RAID 1 outweighs the additional cost for typical Teradata warehouses. Page 8-8 Data Protection RAID Technologies RAID – Redundant Array of Independent Disks RAID technology provides data protection at the disk drive level. With RAID 1 and RAID 5 technologies, access to the data is continuous even if a disk drive fails. RAID technologies available with Teradata: RAID 1 Disk mirroring, used with NetApp (LSI Logic) and EMC2 Disk Arrays. RAID 5 Data parity protection, interleaved parity, RAID 5 provides more capacity, but less performance than RAID 1. For Teradata: RAID 1 Most useful with typical Teradata data warehouses (e.g., Active Data Warehouses). Most frequently used RAID technology. RAID 5 Most useful when creating archival data warehouses that require less expensive storage and where performance is not as important. Not frequently used with Teradata systems (not covered in this class). Data Protection Page 8-9 RAID 1 – Mirroring RAID 1 is data mirroring protection. The RAID 1 technology requires each primary data disk to have a companion disk or mirror. The contents of the primary disk and the mirror disk are identical. When data is written on the primary disk, a write also occurs on the mirror disk. The mirroring process is invisible to the user. For this reason, RAID 1 is also called transparent mirroring. With RAID solutions, mirroring is managed by the controller, which provides a higher level of performance. Performance is improved because data can be read from either the primary (data) drive or the mirror. The controller decides which read/write assembly (drive actuator) is closest to the requested data. If the primary data disk fails, the mirror disk can be accessed without data loss. There is a minor performance penalty if a drive fails because the array controller can read from either drive if both drives are available. If either disk fails, the disk array controller can copy the data from the remaining drive to a replacement drive while normal operations continue. RAID 10 – Striped Mirroring When user data is to be written to the array, the controller instructs the array to write a block of data to one drive pair to the defined stripe depth. Subsequent data blocks are written concurrently to contiguous sectors in the next drive pair to the defined stripe depth. In this manner, data are striped across the array of drives, utilizing multiple drives and actuators. With LSI Logic arrays, striped mirroring is automatic when you create a drive group (with RAID 1 technology) that has multiple mirrored pairs of disks. If an application (e.g., Teradata Database) uniformly distributes data, striped mirroring (RAID 10 or 1+0) and mirroring (RAID 1) will have similar performance. If an application (database) partitions data, striped mirroring (RAID 10) can lead to performance gains over mirroring (RAID 1) because array controllers equally spread I/O’s between channels in the array. Striped Mirroring is NOT necessary with Teradata. Page 8-10 Data Protection RAID 1 – Mirroring • 2 Drive Groups each with 1 mirrored pair of disks • Operating system sees 2 logical disks (LUNs) or volumes • If LUN 0 has more activity, more disk I/Os occur on the first two drives in the array. Disk Array Controller LUN 0 LUN 1 Mirror 1 Disk 3 Mirror 3 Block A0 Block A0 Block B0 Block B0 Block A1 Block A1 Block B1 Block B1 Block A2 Block A2 Block B2 Block B2 Block A3 Block A3 Block B3 Block B3 Disk 1 2 Drive Groups each with 1 pair of mirrored disks Notes: • If the physical drives are 600 GB each, then each LUN or volume is effectively 600 GB. • If both logical units (or volumes) are assigned to an AMP, then the AMP will have approximately 1.2* TB assigned to it. * Actual MaxPerm space will be a little less. Data Protection Page 8-11 RAID 1 Summary RAID 1 characteristics include: Data is fully replicated Easy to understand technology Follows a traditional approach Transparent to the operating system Redundant drive is affected only by write operations RAID 1 advantages include: High I/O rate (small logical block size) Maximum data availability Minor performance penalty with single drive failure No performance penalty in write intensive environments RAID 1 disadvantage is: Only 50% of total disk space is available for user data. Therefore, RAID 1 has 50% overhead in disk space usage. Summary RAID 1 provides high data availability and performance, but storage costs are high. Striped mirroring is not necessary with Teradata. RAID 1 for Teradata - most useful with typical Teradata data warehouses (e.g., Active Data Warehouses). RAID 5 for Teradata - most useful when creating archival data warehouses that require less expensive storage and where performance is not as important. Page 8-12 Data Protection RAID 1 Summary Characteristics • data is fully replicated • striped mirroring is possible with multiple pairs of disks in a drive group • transparent to operating system Advantages (compared to RAID 5) • • • • • Provides maximum data availability Mirroring provides the best read and write throughput Maximizes the performance capabilities of controllers and disk drives Minimal performance issues when a drive has failed Less reconstruction impact when a drive has failed Disadvantage • 50% of disk space is used for mirrored data Summary • RAID 1 provides best data availability and performance, but storage costs are higher. • Striped Mirroring is NOT necessary with Teradata. Data Protection Page 8-13 Cliques A clique is a set of Teradata nodes that share a common set of disk arrays. In the event of node failure, all vprocs can migrate to available nodes in the clique. All nodes in the clique must have access to the same disk arrays. The illustration on the facing page shows a three-node clique. In this example, each AMP has 24 AMP vprocs. In the event of node failing, the remaining nodes will attempt to absorb all vprocs from the failed node. Large Cliques A large clique is usually a set of 8 Teradata nodes that share a common set of disk arrays via a set of Fibre Channel switches. In the event of a node failure, AMP vprocs can migrate to the other available nodes in the clique. In this case, work is distributed among 7 nodes and the performance degradation is approximately 14%. After the failed node is recovered/repaired and rebooted, a second restart of Teradata is needed to reuse the node that had failed. The restart will redistribute the AMPs to the recovered node. Acronyms: DAC – Disk Array Controller Page 8-14 Data Protection Cliques Clique – a set of SMPs that share a common set of disk arrays. SMP001-2 0 1 …. SMP001-4 SMP001-3 23 DAC-A 24 DAC-B 25 DAC-A …. DAC-B 48 47 DAC-A 49 …. 71 DAC-B Example of a 2650 clique (3 nodes, no HSN) – 24 AMPs/node. Data Protection Page 8-15 Teradata Vproc Migration If a TPA node (running Teradata) fails, Teradata restarts and the AMP vprocs that were executing on the failed node are started on other nodes within the clique. PE vprocs that are assigned to channel connections do not migrate to another node. PE vprocs that are assigned to gateway connections may or may not (depending on configuration) migrate to another node within the clique. If a node fails, the vprocs from the failed node are distributed between the remaining nodes in the clique. The vconfig.out file determines the node on which vprocs will start if all of the nodes in the clique are available. The following is from a “Get Config” command following the failure of SMP001-4. DBS LOGICAL CONFIGURATION ----------------------------------------------Vproc Number -----0* 1 2 3 : 22 23 24 25 26 : 46 47 48 49 50 : 59 60 61 62 : 71 Page 8-16 Rel. Vproc# -----1 2 3 4 : 23 24 1 2 3 : 23 24 25 26 27 : 36 25 26 27 : 36 Node ID -----1-02 1-02 1-02 1-02 : 1-02 1-02 1-03 1-03 1-03 : 1-03 1-03 1-02 1-02 1-02 : 1-02 1-03 1-03 1-03 : 1-03 Movable ------Yes Yes Yes Yes : Yes Yes Yes Yes Yes : Yes Yes Yes Yes Yes : Yes Yes Yes Yes : Yes Crash Count ----0 0 0 0 : 0 0 0 0 0 : 0 0 0 0 0 : 0 0 0 0 : 0 Vproc State ------ONLINE ONLINE ONLINE ONLINE : ONLINE ONLINE ONLINE ONLINE ONLINE : ONLINE ONLINE ONLINE ONLINE ONLINE : ONLINE ONLINE ONLINE ONLINE : ONLINE Config Config Status Type ------------Online AMP Online AMP Online AMP Online AMP : : Online AMP Online AMP Online AMP Online AMP Online AMP : : Online AMP Online AMP Online AMP Online AMP Online AMP : : Online AMP Online AMP Online AMP Online AMP : : Online AMP Cluster/ Host No. -------0 1 2 3 : 22 23 24 25 26 : 46 47 48 49 50 : 59 60 61 62 : 71 RcvJrnl/ Host Type --------On On On On : On On On On On : On On On On On : On On On On : On Data Protection Teradata Vproc Migration Clique – a set of SMPs that share a common set of disk arrays. SMP001-2 …. 48 0 1 …. SMP001-4 SMP001-3 60 59 23 DAC-A 24 DAC-B …. 25 DAC-A 71 …. DAC-B Node Fails 47 DAC-A DAC-B This example illustrates vproc migration without the use of Hot Standby Nodes. After vproc migration, the two remaining nodes each have 36 AMPs. • After failed node is repaired, a second restart is needed for failed node to rejoin the configuration. Data Protection Page 8-17 Hot Standby Nodes (HSN) A Hot Standby Node (HSN) is a node that is part of a clique and the hot standby node is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails, the AMPs from the failed node move to the hot standby node. The performance degradation is 0%. When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed. Characteristics of a hot standby node are: A node that is a member of a clique. Does not normally participate in the trusted parallel application (TPA). Can be brought into the TPA to compensate for the loss of a node in the clique. Hot Standby Nodes are positioned as a performance continuity feature. Large Cliques A large clique can also utilize a Hot Standby Node (Node). For example, an 8-node large clique with a Hot Standby Node would consist of 7 nodes running Teradata and 1 Hot Standby Node. The performance degradation would be 0% for an all-AMP operation when a node fails in the clique. This configuration is often referred to as a 7+1 configuration. Large Clique configurations have not been supported since the introduction of the 5500. Page 8-18 Data Protection Hot Standby Nodes (HSN) HSN Node 1 A A A X Node 2 Node 3 HSN Node 4 Node 5 Node 6 Node 7 Node 8 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A : : : : : : A: A: A: : : : : : : : : : A A A A A A A A A A A A A A A A A A A A A A A Disk Array Disk Array Disk Array Disk Array Disk Array Disk Array Disk Array Disk Array Disk Array Disk Array 1. Performance Degradation is 0% as AMPs are moved to the Hot Standby Node. 2. When Node 1 is recovered, it becomes the new Hot Standy Node. Data Protection This example illustrates vproc migration using a Hot Standby Node. Page 8-19 Performance Degradation with Node Failure The facing page displays 2 examples of the performance degradation with all-AMP operations that occur when a node fails. Note: WL - Workload The top example illustrates two 3-node cliques and the performance degradation of 50% for an all-AMP operation when a node fails in one of the cliques. From a simple perspective, if you have 3 nodes in a clique and you lose a node, you would logically think 33% performance degradation. In reality, the performance cost or degradation is 50%. Assume 3 nodes, 72 AMPs, and you execute an all-AMPs query. This query uses 240 CPU seconds per node to complete the query. The 3 nodes use a total of 720 CPU seconds to do the work. Another way to look at is that each AMP needs 10 CPU seconds or 72 AMPs x 10 CPU seconds equals 720 CPU seconds of work to be done. A node fails and now there are 2 nodes to complete the query. There are still 72 AMPs and the query still needs 720 CPUs seconds to complete the query, but now there are only 2 nodes. Each node will need about 360 CPUs seconds to complete the query. Each node has about 50% more work to do. This is why it is considered a 50% performance cost. Another way of looking at a query is from the response time back to the user. From a user perspective, let’s assume that response time back to the user with all 4 nodes normally active is 1 minute (60 seconds) of wall clock time. The wall clock response time with only 2 active nodes is 90 seconds. (Since there are fewer nodes, the query is going to take longer to complete.) From the user perspective, the response time is 50% longer (30/60). It is true that if you take 67% of 90, you will get 60 and you may think that the degradation is 33%. However, 90 seconds is not the normal response time. This normal response time is 60 seconds and the exception is 90 seconds, therefore the performance is worse by 50%. The percentage is calculated from the “normal”. The bottom example illustrates two 3-node cliques, each with a hot standby node (2 TPA nodes) and the performance degradation of 0% for an all-AMP operation when a node fails in the clique. This configuration is often referred to as a 2+1 configuration. Restarts In the first (top) example, after the failed node is recovered/repaired and rebooted, a second restart of Teradata is needed to reuse the node that had failed. The restart will redistribute the AMPs to the recovered node. With a hot standby node, when the failed node is recovered/repaired and restarted, it becomes the new hot standby node within the clique. A second restart of Teradata is not needed. Page 8-20 Data Protection Performance Degradation with Node Failure 2 Cliques without HSN nodes (3 nodes) – performance degradation of 50% with node failure. Workload = 6.0 Clique WL = 3.0 Node WL = 1.00 ----------------------Workload = 6.0 Clique WL = 3.0 Node WL = 1.5 Clique 1 Node 1 Clique 1 Node 2 Clique 1 Node 3 Clique 2 Node 4 Clique 2 Node 5 Clique 2 Node 6 1.0 1.0 1.0 1.0 1.0 1.0 X Clique 1 Node 2 Clique 1 Node 3 Clique 2 Node 4 Clique 2 Node 5 Clique 2 Node 6 1.5 1.5 1.0 1.0 1.0 When a node fails, Teradata restarts. After the node is repaired, a second restart of Teradata is required to allow the node to rejoin the configuration. 2 Cliques each with a HSN (3+1 nodes) – performance degradation of 0% with node failure. Workload = 6.0 Clique WL = 3.0 Node WL = 1.00 ----------------------Workload = 6.0 Clique WL = 3.0 Node WL = 1.0 Clique 1 Node 1 Clique 1 Node 2 Clique 1 Node 3 Clique 1 Clique 2 Node 4 Clique 2 Node 5 Clique 2 Node 6 1.0 1.0 1.0 HSN Clique 2 HSN 1.0 1.0 1.0 X Clique 1 Node 2 Clique 1 Node 3 Clique 1 (Node 1) Clique 2 Node 4 Clique 2 Node 5 Clique 2 Node 6 1.0 1.0 1.0 1.0 1.0 1.0 Clique 2 HSN When a node fails, Teradata restarts. After the node is repaired, it becomes the new Hot Standby Node. A second restart of Teradata is not required. Data Protection Page 8-21 Fallback Fallback protects your data by storing a second copy of each row of a table on an alternative “fallback AMP”. If an AMP fails, the system accesses the fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all of the table data is still available. Users may continue to use Fallback tables without any loss of available data. When a table is created, or any time after its creation, the user may specify whether or not the system should keep a fallback copy. If Fallback is specified, it is automatic and transparent to the user. Fallback guarantees that the two copies of a row will always be on different AMPs. Therefore, if either AMP fails, the alternate row copy is still available on the other AMP. Certainly there is a benefit to protecting your data. However, there are costs associated with that benefit. They are: twice the disk space for storage and twice the I/O for Inserts, Updates, and Deletes. (However, the Fallback option does not require any extra I/O for SELECT operations and the fallback I/O will be performed in parallel with the primary I/O.) The benefits of Fallback include protecting your data from hardware (disk) failure, protecting your data from software (node) failure, automatic recovery and minimum recovery time after repairs or fixes are complete. A hardware (disk) or software (vproc) failure causes an AMP to be taken off-line until the problem is corrected. During this period, Fallback tables are fully available to users. When the AMP is brought back on-line, the associated Vdisk is refreshed to reflect any changes during the off-line period. Page 8-22 Data Protection Fallback A Fallback table is fully available in the event of an unavailable AMP. A Fallback row is a copy of a “Primary row” which is stored on a different AMP. Cluster Cluster AMP 0 Primary rows 2 AMP 1 Benefits of Fallback • Permits access to table data during 11 6 3 AMP off-line period. • Adds a level of data protection Fallback rows 12 5 7 5 1 beyond disk array RAID. • Automatic restore of data changed during AMP off-line. AMP 2 5 12 2 AMP 3 7 6 • Critical for high availability 5 applications. 1 11 3 Cost of Fallback • Twice the disk space for table storage. • Twice the I/O for Inserts, Updates and Deletes. Loss of any two AMPs in a cluster causes RDBMS to halt! Data Protection Page 8-23 Fallback Clusters A cluster is a group of AMPs that act as a single fallback unit. Clustering has no effect on the distribution of the Primary rows of a table. The Fallback row copy however, will always go to a different AMP in the same cluster. The cluster size is set when Teradata is configured and the only choice for new systems is 2AMP clusters. Years ago, AMP clusters ranged from 2 to 16 AMPs per cluster and were commonly set as groups of 4 AMPs. Starting with 5450 systems, all clusters are defined as 2 AMP clusters. Should an AMP fail, the primary and fallback row copies stored on that AMP cannot be accessed. However, their alternate copies are available through the other AMPs in the same cluster. The loss of an AMP in a cluster has no effect upon other clusters. It is possible to lose one AMP in each cluster and still have full access to all Fallback-protected table data. If both AMPs fail in a cluster, then Teradata halts. While an AMP is down, the remaining AMPs in the cluster must do their own work plus the work of the down AMP. A small cluster size (e.g., 2 AMP cluster) reduces the chances of have 2 down AMPs in a single cluster which would cause a non-operational configuration. With today’s new systems, a typical cluster size of 2 AMPs provides the best option to maximize availability. Page 8-24 Data Protection Fallback Clusters • A Fallback cluster is a defined set of 2 AMPs across which fallback is implemented. • Loss of one AMP in the cluster permits continued table access. • Loss of two AMPs in the cluster causes the RDBMS to halt. Cluster 0 Cluster 1 Cluster 2 Cluster 3 AMP 0 AMP 1 AMP 2 AMP 3 Primary rows 62 8 Fallback rows 41 66 7 34 22 50 5 78 58 93 20 88 2 AMP 4 Primary rows Fallback rows 41 66 62 8 Data Protection AMP 5 7 19 4 45 14 1 38 17 37 72 AMP 6 58 93 20 88 34 22 50 5 2 78 4 AMP 7 45 19 17 37 72 14 1 38 Page 8-25 Fallback and RAID Protection RAID 1 mirroring and RAID 5 data parity protection provide protection in the event of a disk drive failure. Fallback provides another level of data protection beyond disk mirroring or data parity protection. Examples of other failures that Fallback provides protection against include: Page 8-26 Multiple drive failures in the same drive group An array is not available (e.g., both disk array controllers fail in the disk array) An AMP is not available (e.g., a software problem) Data Protection Fallback and RAID Protection • RAID 1 Mirroring or RAID 5 Data Parity Protection provides protection in the event of disk drive failure. – Provides protection at a hardware level – Teradata is unaware of the RAID technology used • Fallback provides an additional level of data protection and provides access to data when an AMP is not available (not online). • Additional types of failures that Fallback protects against include: – Multiple drives fail in the same drive group, – Disk array is not available • Both disk array controllers fail in a disk array • Two of the three power supplies fail in a disk array – AMP is not available (e.g., software or data error) • The combination of RAID 1 and Fallback provides the highest level of availability. Data Protection Page 8-27 Fallback and RAID 1 Example The next set of pages contains an example of how Fallback and RAID 1 Mirroring work together. Page 8-28 Data Protection Fallback and RAID 1 Example This example assumes that RAID 1 Mirroring is used and the table is fallback protected. AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 15 78 99 19 39 28 Fallback rows 15 78 99 19 39 28 62 8 27 34 22 50 Primary RAID 1 Mirrored Pair of Physical Disk Drives Fallback Primary Fallback Data Protection 62 8 27 15 78 99 Primary 62 8 27 15 78 99 Primary Fallback Fallback 34 22 50 19 38 28 Primary 34 22 50 19 38 28 Primary Fallback Fallback 15 78 99 62 8 27 Primary 15 78 99 62 8 27 Primary Fallback Fallback 19 39 28 34 22 50 19 39 28 34 22 50 Page 8-29 Fallback and RAID 1 Example (cont.) The example of how Fallback and RAID 1 Mirroring work together is continued. In this example, one disk drive has failed in the first drive group. Is Fallback needed? No. As a matter of fact, Teradata doesn’t even realize that the drive has failed. The disk array continues to provide access to the data directly from the second disk drive in the drive group. The disk array controller will send a “fault” or error message to the AWS. Page 8-30 Data Protection Fallback and RAID 1 Example (cont.) Assume one disk drive fails. Is Fallback needed in this example? AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 15 78 99 19 39 28 Fallback rows 15 78 99 19 39 28 62 8 27 34 22 50 Primary RAID 1 Mirrored Pair of Physical Disk Drives Fallback Primary Fallback Data Protection 62 8 27 15 78 99 Primary 62 8 27 15 78 99 Primary Fallback Fallback 34 22 50 19 38 28 Primary 34 22 50 19 38 28 Primary Fallback Fallback 15 78 99 62 8 27 Primary 15 78 99 62 8 27 Primary Fallback Fallback 19 39 28 34 22 50 19 39 28 34 22 50 Page 8-31 Fallback and RAID 1 Example (cont.) The example of how Fallback and RAID 1 Mirroring work together is continued. In this example, assume two disk drives have failed – one in the first drive group and one in the third drive group. Is Fallback needed? No. Like before, Teradata doesn’t even realize that the drives have failed. The disk array continues to provide access to the data directly from the second disk drive each of the drive groups. The disk array controller will send “fault” or error messages to the AWS. Page 8-32 Data Protection Fallback and RAID 1 Example (cont.) Assume two disk drives have failed. Is Fallback needed in this example? AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 15 78 99 19 39 28 Fallback rows 15 78 99 19 39 28 62 8 27 34 22 50 Primary RAID 1 Mirrored Pair of Physical Disk Drives Fallback Primary Fallback Data Protection 62 8 27 15 78 99 Primary 62 8 27 15 78 99 Primary Fallback Fallback 34 22 50 19 38 28 Primary 34 22 50 19 38 28 Primary Fallback Fallback 15 78 99 62 8 27 Primary 15 78 99 62 8 27 Primary Fallback Fallback 19 39 28 34 22 50 19 39 28 34 22 50 Page 8-33 Fallback and RAID 1 Example (cont.) The example of how Fallback and RAID 1 Mirroring work together is continued. In this example, assume two disk drives have failed – both failed drives are in the first drive group. Is Fallback needed? Yes, if you need to access the data in this table. When multiple disk drives fail in a drive group, the data (Vdisk) is not available and the AMP goes into a FATAL state. At this point, Teradata does realize that an AMP is not available and Teradata restarts. The disk array controller will send “fault” or error messages to the AWS. The AWS will also get “fault” messages indicating that Teradata has restarted. Page 8-34 Data Protection Fallback and RAID 1 Example (cont.) Assume two disk drives have failed in the same drive group. Is Fallback needed? AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 15 78 99 19 39 28 Fallback rows 15 78 99 19 39 28 62 8 27 34 22 50 Primary RAID 1 Mirrored Pair of Physical Disk Drives Fallback Primary Fallback Data Protection 62 8 27 15 78 99 Primary 62 8 27 15 78 99 Primary Fallback Fallback 34 22 50 19 38 28 Primary 34 22 50 19 38 28 Primary Fallback Fallback 15 78 99 62 8 27 Primary 15 78 99 62 8 27 Primary Fallback Fallback 19 39 28 34 22 50 19 39 28 34 22 50 Page 8-35 Fallback and RAID 1 Example (cont.) The example of how Fallback and RAID 1 Mirroring work together is continued. In this example, assume three disk drives have failed – two failed drives are in the first drive group and one failed drive is in the third drive group. Is Fallback needed? Yes, if you need to access the data in this table. When multiple disk drives fail in a drive group, the data (Vdisk) is not available and the AMP goes into a FATAL state. However, the third AMP is still operational and online. Page 8-36 Data Protection Fallback and RAID 1 Example (cont.) Assume three disk drive failures. Is Fallback needed? Is the data still available? AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 15 78 99 19 39 28 Fallback rows 15 78 99 19 39 28 62 8 27 34 22 50 Primary RAID 1 Mirrored Pair of Physical Disk Drives Fallback Primary Fallback Data Protection 62 8 27 15 78 99 Primary 62 8 27 15 78 99 Primary Fallback Fallback 34 22 50 19 38 28 Primary 34 22 50 19 38 28 Primary Fallback Fallback 15 78 99 62 8 27 Primary 15 78 99 62 8 27 Primary Fallback Fallback 19 39 28 34 22 50 19 39 28 34 22 50 Page 8-37 Fallback vs. non-Fallback Tables Summary Fallback tables have a major advantage in terms of availability and recoverability. They can withstand an AMP failure in each cluster and maintain full data availability. A second AMP failure in any cluster results in a system halt. A manual restart of the system is required in this circumstance. Non-Fallback tables are affected by the loss of any one AMP. The table continues to be accessible, but only for those AMPs that are still on-line. A one-AMP Primary Index access is possible, but a full table scan is not. Fallback tables are easily recovered after a failure due to the availability of Fallback rows. Non-Fallback tables may only be restored from external medium in the event of a disaster. Page 8-38 Data Protection Fallback vs. non-Fallback Tables Summary FALLBACK TABLES One AMP Down AMP AMP AMP AMP - Data fully available Two or more AMPs Down AMP AMP AMP AMP - If different cluster, data fully available - If same cluster, Teradata halts One AMP Down AMP AMP AMP Two or more AMPs Down AMP AMP AMP Non-FALLBACK TABLES Data Protection AMP AMP - Data partially available; queries that avoid down AMP succeed. - If different cluster, data partially available; queries that avoid down AMP succeed. - If same cluster, Teradata halts Page 8-39 Clusters and Cliques As you know, a cluster is a group of AMPs that act as a single fallback unit. A clique is a set of Teradata nodes that share a common set of disk arrays. Clusters provide data access protection in the event of an AMP failure (usually because of a Vdisk failure). Cliques provide protection from SMP node failures. The best availability for Teradata is to spread clusters across different cliques. The “Default Cluster” function of the CONFIG utility does this automatically. The example on the facing page illustrates a 4+2 node system. Each clique consists of 3 nodes (2 TPA plus one Hot Standby Node – HSN) connected to a set of disk arrays with 240 disks. This example assumes each node is configured with 30 AMPs. Page 8-40 Data Protection Clusters and Cliques SMP001-7 Clique 0 0 1 … SMP002-6 29 30 SMP003-7 Clique 1 60 61 … 31 … SMP002-7 59 SMP004-6 89 90 91 … Hot-Standby Node 240 Disks in Disk Arrays for Clique 0 SMP004-7 119 Hot-Standby Node 240 Disks in Disk Arrays for Clique 1 Cluster 0 – AMPs 0 and 60 Cluster 1 – AMPs 1 and 61 To provide the highest availability, the goal is to interleave clusters across cliques and cabinets. Data Protection Page 8-41 Locks Locking prevents multiple users who are trying to change the same data at the same time from violating the data's integrity. This concurrency control is implemented by locking the desired data. Locks are automatically acquired during the processing of a request and released at the termination of the request. In addition, users can specify locks. There are four types of locks: Exclusive, Write, Read, and Access. Exclusive locks are only applied to databases or tables, never to rows. They are the most restrictive type of lock; all other users are locked out. Exclusive locks are used rarely, most often when structural changes are being made to the database. Write locks enable users to modify data while locking out all other users except readers not concerned about data consistency (Access lock readers). Until a Write lock is released, no new read or write locks are allowed. Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which no modification of the data is permitted. Access locks can be specified by users who are not concerned about data consistency. The use of an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on large tables that are updated only by small singlerow changes. Access locks are sometimes called “stale read” locks, i.e. you may get ‘stale data’ that hasn’t been updated. Three levels of database locking are provided: Database Table Row Hash - locks all objects in the database - locks all rows in the table or view - locks all rows with the same row hash The type and level of locks are automatically chosen based on the type of SQL command issued. The user has, in some cases, the ability to upgrade or downgrade the lock. For example, if an SQL UPDATE command is executed without a WHERE clause, a WRITE lock is placed on the table. If an SQL UPDATE command is executed with a WHERE clause that specifies a Primary Index value, then a row hash lock is used. Page 8-42 Data Protection Locks There are four types of locks: Exclusive – prevents any other type of concurrent access Write – prevents other reads, writes, exclusives Read – prevents writes and exclusives Access – prevents exclusive only Locks may be applied at three levels: Database – applies to all tables/views in the database Table/View – applies to all rows in the table/views Row Hash – applies to all rows with same row hash Lock types are automatically applied based on the SQL command: Data Protection SELECT – applies a Read lock UPDATE – applies a Write lock CREATE TABLE – applies an Exclusive lock Page 8-43 Locking Modifier This option precedes an SQL statement and locks a database, table, view, or row hash. The locking modifier overrides the default usage lock that Teradata places on a database, table, view, or row hash in response to a request. Note: The DROP TABLE access right is required on the table in order to upgrade a READ or WRITE LOCK to an EXCLUSIVE LOCK. ACCESS Access locks have many advantages. This allows quick access to data, even if other requests are updating the data. They also have minimal effect on locking out others – when you use an access lock; virtually all requests are compatible with your lock except exclusive locks NOWAIT If a resource is locked and an application does not want to wait for that lock to be released, the Locking Modifier NOWAIT option can be used. The NOWAIT option indicates that if the lock cannot be obtained, then the statement will be aborted. This option is used in situations where it is not desirable to have a statement wait for resources, possibly also tying up resources in the process of waiting. Example: LOCKING TABLE tablename FOR WRITE NOWAIT UPDATE ….. ; *** Failure 7423 Object already locked and NOWAIT. Transaction Aborted. Statement# 1, Info =0 The user is informed with a 7423 error status code that indicates the lock could not be placed due to an existing, conflicting lock. Page 8-44 Data Protection Locking Modifier The locking modifier overrides the default usage lock that Teradata places on a database, table, view, or row hash in response to a request. Certain locks can be upgraded or downgraded: LOCKING ROW FOR ACCESS SELECT * FROM Table_A; An “Access Lock” allows the user to access (read) an object that has a READ or WRITE lock associated with it. In this example, even though an access row lock was requested, a table level access lock will be issued because the SELECT causes a full table scan. Note: A "Locking Row" request must be followed by a SELECT. LOCKING TABLE Table_B FOR EXCLUSIVE UPDATE Table_B SET A = 2011; This request asks for an exclusive lock, effectively upgrading the lock. LOCKING TABLE Table_C FOR WRITE NOWAIT UPDATE Table_C SET A = 2012; The NOWAIT option is used if you do not want your transaction to wait in a queue. NOWAIT effectively says to abort the the transaction if the locking manager cannot immediately place the necessary lock. Error code 7423 is returned if the lock request is not granted. Data Protection Page 8-45 Rules of Locking As the facing page illustrates, a new lock request must wait (queue) behind other incompatible locks that are either in queue or in effect. The new Read lock must wait until the write lock ahead of it is released before it goes into effect. In the second example, the second Read lock request may occupy the same position in the queue as the Read lock that was already there. When the current Write lock is released, both requests may be given access concurrently. This only happens when locks are compatible. When an SQL statement provides row hash information, a row hash lock will be used. If multiple row hashes within the table are affected, a table lock is used. Page 8-46 Data Protection Rules of Locking Rule LOCK LEVEL HELD Lock requests are queued behind all outstanding incompatible lock requests for the same object. LOCK REQUEST NONE ACCESS Granted READ WRITE ACCESS READ WRITE EXCLUSIVE Granted Granted Granted Queued Granted Granted Granted Queued Queued Granted Granted Queued Queued Queued EXCLUSIVE Granted Queued Queued Queued Queued Example 1 – New READ lock request goes to the end of queue. New request Lock queue Current lock READ WRITE READ New lock queue READ WRITE Current lock READ Example 2 – New READ lock request shares slot in the queue. New request Lock queue Current lock READ READ WRITE New lock queue READ Current lock WRITE READ Data Protection Page 8-47 Access Locks Access locks have many advantages. They allow quick access to data, even if other requests are updating the data. They also have minimal effect on locking out others - when you use an access lock; virtually all requests are compatible with yours. When doing large aggregations of numbers, it may be inconsequential if certain rows are being updated during the summation, particularly if one is only looking for approximate totals. Access locks are ideal for this situation. Looking at Example 3, what happens to the Write lock request when the Read lock goes away? Looking at the chart, it will be “Granted” since Write and Access are considered compatible. Another example not shown on the facing page: Assume user1 is in ANSI mode and has updated a row, but hasn't entered COMMIT yet. The actual row in the table is updated on disk; the before-image is located in the TJ of the WAL log in case user1 decides to ROLLBACK. If user2 accesses this row with an access lock, the updated row on disk is returned even though it is locked and not committed yet. Assume user1 issues a ROLLBACK, then the before-image in the TJ is used to rollback the row on disk. If user2 selects the row a second time, user2 will get the row (original) that is now on disk. Page 8-48 Data Protection Access Locks Rule LOCK LEVEL HELD Lock requests are queued behind all outstanding incompatible lock requests for the same object. LOCK REQUEST NONE ACCESS Granted READ WRITE ACCESS READ WRITE EXCLUSIVE Granted Granted Granted Queued Granted Granted Granted Queued Queued Granted Granted Queued Queued Queued EXCLUSIVE Granted Queued Queued Queued Queued Example 3 – New ACCESS lock request granted immediately. New request Lock queue Current lock New lock queue Current locks ACCESS WRITE READ WRITE READ ACCESS Advantages of Access Locks Permit quicker access to table in multi-user environment. Have minimal ‘blocking’ effect on other queries. Very useful for aggregating large numbers of rows. Disadvantages of Access Locks May produce erroneous results if during table maintenance. Data Protection Page 8-49 Transient Journal The Transient Journal exists to permit the successful rollback of a failed transaction. Transactions are not committed to the database until an End Transaction request has been received by the AMPs, either implicitly or explicitly. Until that time, there is always the possibility that the transaction may fail in which case the participating table(s) must be restored to their pre-transaction state. The Transient Journal maintains a copy of all before images of all rows affected by the transaction. If the event of transaction failure, the before images are reapplied to the affected tables, the images are deleted from the journal and a rollback operation is completed. In the event of transaction success, at the point of transaction commit, the before images for the transaction are discarded from the journal. In Summary, if a Transaction fails (for whatever reason), the before images in the transient journal are used to return the data (in the tables involved in the transaction) to its original state. Page 8-50 Data Protection Transient Journal Transient Journal – provides transaction integrity • • • • • A journal of transaction “before images” (UNDO rows) maintained within WAL. Provides for automatic rollback in the event of TXN failure. Is automatic and transparent. “Before images” are reapplied to table if a transaction fails. “Before images” are discarded upon transaction completion. Successful TXN BEGIN TRANSACTION UPDATE Row A – Before image Row A recorded (Add $100 to checking) UPDATE Row B – Before image Row B recorded (Subtract $100 from savings) END TRANSACTION – Discard before images Failed TXN BEGIN TRANSACTION UPDATE Row A UPDATE Row B (Failure occurs) (Rollback occurs) (Terminate TXN) Data Protection – Before image Row A recorded – Before image Row B recorded – Reapply before images – Discard before images Page 8-51 Recovery Journal for Down AMPs After the loss of any AMP, a Down-AMP Recovery Journal is started automatically. Its purpose is to log any changes to rows which reside on the down AMP. Any inserts, updates, or deletes affecting rows on the down AMP, are applied to the Fallback copy within the cluster. The AMP that holds the Fallback copy logs the Row ID in its Recovery Journal. This process continues until such time as the down AMP is brought back on-line. As part of restart activity, the Recovery Journal is read and changed rows are applied to the recovered AMP. When the journal has been exhausted, it is discarded and those tables that are fallback-protected are fully recovered. Page 8-52 Data Protection Recovery Journal for Down AMPs Recovery Journal is: Automatically activated when an AMP is taken off-line. Maintained by the other AMPs in a cluster. Totally transparent to users of the system. While AMP is off-line Journal is active. Table updates continue as normal. Journal logs Row IDs of changed rows for down-AMP. When AMP is back on-line Restores rows on recovered AMP to current status. Journal discarded when recovery complete. AMP 0 AMP 1 AMP 2 AMP 3 Vdisk Primary rows 62 8 27 34 22 50 5 78 19 14 1 38 Fallback rows 5 78 19 19 38 8 62 8 27 50 27 78 Recovery Journal Data Protection TableID/RowID – 62 TableID/RowID – 5 Page 8-53 Permanent Journal The purpose of the Permanent Journal is to provide selective or full database recovery to a specified point in time. It permits recovery from unexpected hardware or software disasters. The Permanent Journal also has the effect of reducing the need for full table backups which can be costly both in time and resources. The Permanent Journal is an optional journal and its features must be customized to the specific needs of the installation. The journal may capture before images (for rollback), after images (for rollforward), or both. Additionally, the user must specify if single images (default) or dual images (for fault-tolerance) are to be captured. A Permanent Journal may be shared by multiple tables or multiple databases. The journal captures images concurrently with standard table maintenance and query activity. The cost in additional required disk space may be calculated in advance to ensure adequate disk reserve. The journal is periodically dumped to external media, thus reducing the need for full table backups – in effect, only the changes are backed up. Page 8-54 Data Protection Permanent Journal The Permanent Journal is an optional, user-specified, system-maintained journal which is used for recovery of a database to a specified point in time. The Permanent Journal: • Is used for recovery from unexpected hardware or software disasters. • May be specified for ... – One or more tables – One or more databases • • • • • • • Permits capture of Before Images for database rollback. Permits capture of After Images for database rollforward. Permits archiving change images during table maintenance. Reduces need for full table backups. Provides a means of recovering NO FALLBACK tables. Requires additional disk space for change images. Requires user intervention for archive and recovery activity. Data Protection Page 8-55 Archiving and Recovering Data The purpose of the ARC utility is to allow for the archiving and restoring of database objects which may have been damaged or lost. There are several scenarios where restoring objects from external media may be necessary. Restoring of non-Fallback tables after a disk failure. Restoring of tables which have been corrupted by batch processes which may have left the data in an ‘uncertain’ state. Restoring of tables, views or macros which have been accidentally dropped by the user. Miscellaneous user errors resulting in damaged or lost database objects. Teradata’s Backup and Recovery (BAR) architecture provides solutions from Teradata Partners. Two examples are: NetVault – from BakBone software NetBackup – from Symantec (Veritas NetBackup by Symantec) The ASF2 utility is an older utility that provides an X Windows based front-end for creation and execution of ARC command scripts. It is designed to run on UNIX MP-RAS. Page 8-56 Data Protection Archiving and Recovering Data ARC • • • • • The Archive/Restore utility (arcmain) Runs on IBM, UNIX MP-RAS, Windows 2003, and Linux systems Archives and restores data from/to Teradata Database Restores or copies data from archive media Permits data recovery to a specified checkpoint using Permanent Journals Backup and Recovery (BAR) • Example of BAR choices from different Teradata Partners – NetBackup - Veritas NetBackup by Symantec – Tivoli Storage Manager – utilizes TARA • • • • Provides Windows front end for ARC Easy creation of scripts for archive/recovery Provides job scheduling and tape management functions BAR was previously referred to as Open Teradata Backup (OTB) Data Protection Page 8-57 Module 8: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 8-58 Data Protection Module 8: Review Questions Match the item to a lettered description. ____ 1. Database locks ____ 2. Table locks ____ 3. Row Hash locks ____ 4. FALLBACK ____ 5. Cluster ____ 6. Recovery journal ____ 7. Transient journal ____ 8. ARC ____ 9. NetBackup/Tivoli ____ 10. Permanent journal ____ 11. Disk Array Data Protection a. b. c. d. e. f. g. h. i. j. k. Provides for TXN rollback in case of failure Teradata Backup and Recovery applications Protects all rows of a table Logs changed rows for down AMP Provides for recovery to a point in time Applies to all tables and views within Multi-platform archive utility Lowest level of protection granularity Protects tables from AMP failure Protects database from a physical drive failure Group of AMPs used by Fallback Page 8-59 Notes Page 8-60 Data Protection Module 9 Introduction to MPP Systems After completing this module, you will be able to: Specify a major difference between a 6650 and a 6690 system. Specify a major difference between a 2650 and a 2690 system. Define the purpose of the major subsystems that are part of an MPP system. Specify the names of the Teradata (TPA) nodes in a 6690 cabinet. Teradata Proprietary and Confidential Introduction to MPP Systems Page 9-1 Notes Page 9-2 Introduction to MPP Systems Table of Contents Teradata Systems ......................................................................................................................... 9-4 SMP Architecture ......................................................................................................................... 9-6 Hyper-Threading and Multi-Core CPUs ...................................................................................... 9-8 Comparing Performance of Servers ........................................................................................... 9-10 Cabinet or Rack Pictures ............................................................................................................ 9-12 Teradata 6650 Systems .............................................................................................................. 9-14 Teradata 6650 Cabinets .............................................................................................................. 9-16 Adding SSD to a 6650 (Future) ................................................................................................. 9-18 Teradata 6650 Configuration Examples..................................................................................... 9-20 Teradata 6690 Systems .............................................................................................................. 9-22 Teradata 6690 Cabinets .............................................................................................................. 9-24 Teradata Extended Nodes .......................................................................................................... 9-26 Making Sense of the Different Platforms................................................................................... 9-28 Linux Coexistence Combinations .............................................................................................. 9-30 Teradata Appliance Introduction................................................................................................ 9-32 Teradata 2650/2690 Appliances ................................................................................................. 9-34 Teradata 2650/2690 Cabinets..................................................................................................... 9-36 Appliance Configuration Examples ........................................................................................... 9-38 What is the BYNET™? ............................................................................................................. 9-40 BYNET 32 Switches .................................................................................................................. 9-42 BYNET 64 Switches .................................................................................................................. 9-44 BYNET Expansion Switches ..................................................................................................... 9-46 BYNET Expansion to 1024 Nodes ............................................................................................ 9-46 Server Management with SWS .................................................................................................. 9-48 Node Naming Conventions ........................................................................................................ 9-50 Summary .................................................................................................................................... 9-52 Module 9: Review Questions ..................................................................................................... 9-54 Introduction to MPP Systems Page 9-3 Teradata Systems As the competitive needs of businesses change, the system architecture changes over time. To be best-in-class, an information processing system in today's environment will typically have the following characteristics. Utilization of multiple processors in multiple nodes to achieve acceptable performance. Easily scalable in both processing power and data storage capacity with adherence to all industry-standard interfaces. Be capable of handling a very large databases, rapidly process complex queries, maintain data security, and be accessible to the total enterprise. Support on-line transaction processing as well as decision support applications. In today’s global and highly competitive markets, computing systems (especially enterprise servers) need to be available to the world 24 hours a day. TPerf (Traditional Performance) is a power metric that has been used in a rigorous and consistent manner for each generation of the Teradata platform since the model 5100. It is a metric for how fast a node can process data. TPerf is maximized when there is a balance between CPU and I/O bandwidth. When used to compare different Teradata configurations, the TPerf metric is similar to other throughput metrics, such as rows/second or transactions/second that a node processes where actual data volumes in terms of bytes are not reflected in the metric. Data capacity is not a function of a general or design center TPerf used by sales and engineering to compare Teradata systems, that is, this metric assumes there is a constant database volume in place when comparing one system to another. TPerf is a power metric that measures the throughput performance of the TPerf workload. It is not a response time metric for specific queries and operations. Response time depends on a number of factors in the Teradata architecture in addition to the ones that TPerf gauges, i.e., CPU power and I/O performance. Other factors influencing response time include, but are not limited to: 1. 2. 3. 4. Parallelism provided by the number of AMPs Concurrency (competition among queries) Workload mix Workload management TPerf is analogous to the pulling Power of a train locomotive. The “Load” is the work the Node operates on. The data space is analogous to the freight cars in a train. You would need twice as big a locomotive to pull twice as many cars. You would need a To have the same performance with twice as much data and load on a system, you would need a system with a TPerf that is twice (2x) as large. Page 9-4 Introduction to MPP Systems Teradata Systems Examples of systems used with the Teradata Database include: Teradata Systems 5400/5450 5500/555x/56xx 6650/6680/6690 15xx/16xx/25xx/26xx – – – – up to 10 nodes/cabinet up to 9 nodes/cabinet up to 4 nodes/cabinet with associated storage various Appliance systems The basic building block is the SMP (Symmetric Multi-Processing) node. The power of these nodes will be measured by TPerf – Traditional Performance. • The Teradata metric for total power of a node or system. • Determined by measuring system elements and calculating the performance with a representative set of workloads. Key differences: • Speed and capacity of SMP nodes and systems • Cabinet architecture • BYNET interface cards, switches and speeds *BYNET V4 – up to 4096 nodes Introduction to MPP Systems Page 9-5 SMP Architecture The SMP or “processing node” is the basic building block for Teradata systems. The processing node contains the primary processor logic (CPUs), memory, and I/O functionality. Teradata is supported on non-Teradata SMP servers with 4 or fewer physical CPU sockets. A Teradata license can only be purchased for an SMP server with up to 4 physical CPUs. The server might have 4 hyper-threading CPUs which look like 8 logical CPUs to the operating system. The server may have two quad-core CPUs which appears to the operating system as 8 CPUs. Basic definitions of the CPUs used with Teradata servers: Hyper-Threading CPUs – one physical CPU (chip) socket, but with 2 control (context) areas – makes 1 CPU look like 2 logical CPUs. Dual-core CPUs – one physical CPU (chip) socket, but with two control (context) areas and 2 execution cores – makes 1 CPU look like 2 physical CPUs. Quad-core CPUs – one physical CPU (chip) socket, but with four control (context) areas and 4 execution cores – makes 1 CPU look like 4 physical CPUs. Quad-core CPUs with Hyper-Threading – one physical CPU (chip) socket, but with 8 control (context) areas and 4 execution cores – makes 1 CPU look like 4 physical CPUs or 9 logical CPUs. Six-core CPUs with Hyper-Threading – one physical CPU (chip) socket, but with 12 control (context) areas and 6 execution cores – makes 1 CPU look like 6 physical CPUs or 12 logical CPUs. 5400/5450 nodes have 2 physical chips using Hyper-Threading, effectively 4 logical CPUs. 5500H nodes have 2 dual-core chips, effectively 4 CPUs. 5555C nodes have 1 quad-core chips, effectively 4 CPUs 5550H and 5555H nodes have 2 quad-core chips, effectively 8 CPUs. 5600H nodes have 2 quad-core chips using hyper-threading, effectively 16 CPUs per node. 2650, 2690, 5650H, 6650H, 6680, and 6690 nodes have 2 six-core chips using hyperthreading, effectively 24 CPUs per node. Page 9-6 Introduction to MPP Systems SMP Architecture SMP (Symmetrical Multi-Processing) Node – basic building block of MPP systems. • Hyper-Threading CPUs – one CPU socket (chip) with 1 execution core and 2 control (context) areas • • • • – makes 1 CPU chip look like 2 logical CPUs. Dual-core CPUs – one CPU socket with 2 execution cores – makes 1 chip look like 2 physical CPUs. Quad-core CPUs – one CPU socket with 4 execution cores – makes 1 chip look like 4 physical CPUs. Quad-core CPUs with Hyper-Threading – one chip socket with 4 execution cores each with 2 control areas – makes 1 CPU chip socket look like 8 logical CPUs Six-core CPUs with Hyper-Threading – one chip socket with 6 execution cores each with 2 control areas – makes 1 CPU chip socket look like 12 logical CPUs Other names include node, compute node, processing node, 24-way node, etc. Key hardware components of a node include: CPUs and cache memory Memory System Bus I/O Subsystem Processor(s) Memory CPU CPU Memory CPU CPU Memory I/O Subsystem Fibre Channel Adapter Fibre Channel • • • • System Bus Introduction to MPP Systems Page 9-7 Hyper-Threading and Multi-Core CPUs The facing page illustrates the concept of Hyper-Threading and Multi-Core CPUs. With Hyper-Threading, 2 physical CPUs appear to the Operating System as 4 logical or virtual CPUs. With Dual-Core, 2 physical CPUs appear to the Operating System as 4 physical CPUs. The SMP’s BIOS automatically tells the Operating System that there are 4 CPUs. The Operating System will schedule work as though there are actually 4 CPUs in either case. The reason for a performance gain with Hyper-Threading is as follows. When one of the logical processors (control unit) is setting up its data and instruction registers from cache or memory, the execution unit can be executing instructions from the other logical processor. In this way, the execution unit doesn’t have to wait for one of the control units to set up its data and instruction registers – it is effectively kept busy a larger percentage of the time. Some of the benefits of Hyper-Threading include: No software changes required Symmetric Improved CPU Efficiency The reason for a performance gain with Dual-Core CPUs is that there are two control areas and two execution units. One CPU socket is really two physical CPUs. Quad-Core CPUs provide even more processing power with one CPU socket providing four physical CPUs. With Quad-Core, 2 physical CPUs appear to the Operating System as 8 physical CPUs. The SMP’s BIOS effectively tells the Operating System that there are 8 CPUs. With Quad-Core and Hyper-Threading, 2 physical CPUs appear to the Operating System as 16 CPUs. The SMP’s BIOS effectively tells the Operating System that there are 16 CPUs. Notes: Page 9-8 The Operating System schedules work across logical or physical CPUs. The Windows Task Manager or UNIX “pinfo” command actually identifies the CPUs (e.g., 8 with quad-core) for which work can be scheduled. Introduction to MPP Systems Hyper-Threading and Multi-Core CPUs x Control Unit (context area) – Data Registers and Instruction Registers Execution Unit – physical execution of instructions Without Hyper-Threading With Hyper-Threading With Dual-Core CPUs Operating System Operating System Operating System 1 2 1 3 2 4 1 3 With Quad-Core CPUs and H-T With Six-Core CPUs and H-T Operating System Operating System Introduction to MPP Systems 2 4 Page 9-9 Comparing Performance of Servers TPerf is a metric for total Power of a Node or system TPerf = Traditional Performance Analogous to the pulling Power of a train locomotive. The “Load” is the work the Node operates on. The data space is analogous to the freight cars in a train. You would need twice as big a locomotive to pull twice as many cars. To have the same performance with twice as much data and load on a system, you would need a system with a TPerf that is twice (2x) as large. Acronym: H-T is Hyper-Threading Teradata’s Design Center establishes typical system configurations for different Teradata system models. For example, one design center configuration for a 6650 system is cliques of 3+1 nodes, 42 AMPs per node, and two 600 GB mirrored disks for each node. The design center power rating is called TPerf-dc. The process for deriving design center TPerf for a Teradata platform consists of five steps: 1) A diverse se of performance tests is executed on the platform design center configuration for a Teradata platform model. 2) The CPU and IO resource usage and throughput are measured. 3) An analytical model is used to calculate the CPU and IO resource usage of a weighted blend of workloads. 4) The blended workload is compared against the resource capabilities provided by the design center platform configuration. 5) The TPerf metric is then calculated. This design center TPerf (TPerf-dc) represents system throughput potential, in other words, how much work could be done in a given amount of time given a high concurrency workload for that design center hardware configuration. Any system with the same configuration and the same workload mix used in the model will deliver overall performance that matches the level indicated by the TPerf-dc metric. TPerf-dc usually does not describe the throughput potential for deployed configurations of Teradata systems. The reality is that business demands require a wide variety of Teradata system configurations to meet specific performance and pricing needs and no customer workload is the same as that for the TPerf-dc model. TPerf-dc plays only a small part in any attempt to estimate response time expectations for the design center configuration and TPerf workload – all the other factors listed above must be considered. Page 9-10 Introduction to MPP Systems Comparing Performance of Servers 4 Cores 4 Sockets 130 2 Cores 4 Cores 8 Cores 8 Cores 12 Cores H-T 120 H-T 110 H-T (Hyper-Threading) 100 90 80 H-T 70 130.0 60 119.0 50 83.4 40 30 52.1 20 10 0 H-T 10.14 12.84 1.00 5.04 5.79 4.68 6.13 8.45 5100 5250 5255 5300 5350 5380 Introduction to MPP Systems H-T H-T 5400 5450 31.7 Linux Linux 5500H 5555H 32 GB Linux Linux Linux 5600H 96 GB 6650H 96 GB 6690 96GB Page 9-11 Cabinet or Rack Pictures The Rack Cabinet is an industry standard, 40U rack frame used to house Teradata processing nodes and/or disk arrays. Measurements The “U” in the 40U rack term represents a unit of vertical measurement for placement of chassis in the rack. [1U = 4.445 cm (1.75 in.)] This diagram illustrates the depth of older cabinet which was 40”. Teradata systems use an industry standard rack mount architecture and individual chassis that conform to industry standards. Each chassis occupies a specific number of U spaces in the rack. Examples of types of chassis that can be placed in a rack or cabinet include. Processing Node (54xx , 55xx, 56xx, and 66xx nodes – 2U BYNET Switch (BYA32S) – 1U Server Management Chassis (CMIC) – 1U The 55xx and 66xx systems use a rack that is 44” deep (4” deeper than previous rack). Older systems (e.g., 5650) used a separate Teradata SWS (Service Workstation) for operational maintenance of the system. The last SWS was a Dell PowerEdge T710 Server and was available as deskside or rack mount server. Newer systems (e.g., 6690) utilize a VMS (Virtualized Management Server) which consolidates CMIC, SWS, and Teradata Viewpoint functions into a single chassis. Page 9-12 Introduction to MPP Systems Cabinet or Rack Pictures Notes: • Cabinet Size = 24" W X 77"H X 44" D without doors and side panels • Improved cable management – Larger exit hole in base – Supports inter-rack cabling Node Chassis Processor /Storage Cabinet Introduction to MPP Systems Page 9-13 Teradata 6650 Systems The Teradata Active Enterprise Data Warehouse 6650 platform is scalable from one to 4,096 Teradata nodes, and can handle more than 15 petabytes of data to support the complex workloads in an active warehouse environment. The 6650 processing nodes are the newest release of Teradata Servers which supports the Teradata Warehouse solution. These nodes are similar to the 5650 processing nodes, utilizing the Intel Westmere™ six-core CPUs with hyper-threading enabled. The Teradata Active Enterprise Data Warehouse platform is made up of a combination of cabinet types, depending on the system configuration: Processing/storage cabinet BYNET cabinet Teradata Managed Server (TMS) cabinet The 6650 provides high availability via the following features: Hot standby nodes (HSN): One node in a clique can be configured as a hot standby node. Eliminates the degradation of database performance in the event of a node failure in the clique. Tasks assigned to the failed node are completely redirected to the hot standby node. Hot spare disks: One or more disks per array can be configured as hot spare disks. In the event of a disk failure on a RAID mirrored pair, the contents of the failed disk are copied into a hot spare disk from the mirrored surviving disk to repair the RAID pair. When the failed drive is replaced, a copy back operation occurs to restore data to the replaced drive. Fallback: Data protection can be provided at the table level by automatically storing a copy of each permanent data row of a table on a different or “fallback” AMP. If an AMP fails, the Teradata Database can access the fallback copy and continue operation. The design center recommendations has a different number of AMPs and associated storage per AMP varies depending on the configuration. Page 9-14 1+1 clique – 48 AMPs/node; 192 disks per node 2+1 clique – 30 AMPs/node; 120 disks per node 3+1 clique – 42 AMP/node; 84 disks per node Introduction to MPP Systems Teradata 6650 Systems Features of the 6650 system include: • The Teradata 6650 platform is the first release of unified Node/Storage within a single cabinet in the Active Enterprise Data Warehouse space. • The 6650 is designed to reduce floor space utilization. – The UPS/batteries are not used with the 6650 – In the event of site wide power loss, data integrity is provided by WAL. • The 6650 utilizes up to two Intel® 2.93 GHz six-core CPUs – Two models – 6650C and 6650H • 6650C nodes utilize 1 socket with one six-core CPU and 48 GB of memory • 6650H nodes utilize 2 sockets with two six-core CPUs and 96 GB of memory • 6650C can be used to co-exist with previous generations and 6650H will co-exist with future Active EDW Platform mixed storage offerings • The 6650 can be configured in 1+1, 2+1, and 3+1 cliques. – A 6650 clique consists of either one or two processing/storage cabinets. Each cabinet contains processing nodes and a disk array. • The 6650 can be upgraded to use SSD drives. – 6650 is an SSD Ready platform and prepares for the introduction of Solid State Drives (SSD) in the Active EDW space. Introduction to MPP Systems Page 9-15 Teradata 6650 Cabinets The facing page illustrates various 6650 cabinet configurations. The 66xx and later systems utilize an industry standard rack mount cabinet which provide for excellent air flow and cooling. Similar to previous rack-based systems, this rack contains individual subsystem chassis that are housed in standard rack frames. Subsystems are self-contained, and their configurations — either internal or within a system — are redundant. The design ensures overall system reliability, enhances its serviceability, and enables time and cost efficient upgrades. The key chassis in the rack/cabinet is the node chassis. The SMP node chassis is 2U in height. A Hot Standby Node is required with 6650 systems. For 6650 systems, a clique has a maximum of three TPA nodes with one HSN node. Cabinet Build Conventions The placement of the hardware components in a cabinet follows these general cabinet build conventions: Page 9-16 A 6650 clique consists of either one or two processing/storage cabinets. Each cabinet contains processing nodes and a disk array. The following clique configurations are available: – A two-cabinet 3+1 clique. The first cabinet contains two processing nodes and one disk array. The second cabinet contains one processing node, one hot standby node, and one disk array. – A two-cabinet 2+1 clique. The first cabinet contains one processing node and one disk array. The second cabinet contains one hot standby node, one processing node, and one disk array. – A two-cabinet 1+1 clique. The first cabinet contains one processing node and one disk array. The second cabinet contains one hot standby node and one disk array. – A one-cabinet 1+1 clique. The cabinet contains one processing node, one hot standby node, and one disk array. There is 1 CMIC in first cabinet of each two-cabinet clique. If a system only has one clique, then there is a CMIC in the second cabinet. Introduction to MPP Systems Teradata 6650 Cabinets 6650 Characteristics • Integrated Cabinet with nodes and arrays in same cabinet. • NetApp array with 2 controllers and 8 drive trays. – 300, 450, or 600 GB drives • With 2+1 clique, each AMP is typically assigned to 4 disks (2 mirrored pairs). – Usually 30 AMPs/node • With 3+1 clique, each AMP is typically assigned to 2 disks (1 mirrored pair). – Usually 42 AMPs/node • No UPSs in cabinet. Secondary SM Switch Secondary SM Switch Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) 6844 Array Controllers (4U) 6844 Array Controllers (4U) TPA Node HSN TPA Node TPA Node TMS Node BYA32S-1 BYA32S-0 SM – CMIC (1U) TMS Node SM – CMIC (1U) Primary SM Switch Primary SM Switch PDU PDU 6650H Introduction to MPP Systems 3+1 Clique across 2 cabinets TMS Node PDU PDU 6650H Page 9-17 Adding SSD to a 6650 (Future) The facing page illustrates a future option to add Solid State Disks (SSD) to a 6650 cabinet. Page 9-18 Introduction to MPP Systems Adding SSD to a 6650 (Future) SSD Upgrade Steps Secondary SM Switch • Place SSD arrays in positions 3, 4, and 5). • Upgrade to 13.10 if not on 13.10. • Enable TVS. • Reconfig (no backup/restore required). Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) If TMS, Channel Servers and/or BYNET switches are installed, they can be moved to a another cabinet to make room for the SSD storage. Drive Tray (16 HD) Drive Tray (16 HD) 6844 Array Controllers (4U) TPA Node SSD Arrays use SAS based controllers and 400 GB SSD. Each tray has its own controllers and SSD drives. TPA Node SSD Array SSD Array TMS Node BYA32S-1 BYA32S-0 SM – CMIC (1U) TMS Node BYA32S-1 BYA32S-0 Primary SM Switch PDU PDU 6650H Introduction to MPP Systems Page 9-19 Teradata 6650 Configuration Examples The facing page includes two examples of 6650 cliques. Typically, a 6650 node in a 3+1 clique will be configured with 42 AMPs, 2 disks per AMP, and 96 GB of memory. Current configurations of the 6650 include: 6650H – Design Center Clique 3+1 H 2+1 H 1+1 H Drive Options 300GB 450GB 600GB 300GB 450GB 600GB 300GB 450GB 600GB HDDs per Node 84 84 84 120 120 120 192 192 192 Configuration HDDs CPU COD per Available Clique 252 Yes 252 Yes 252 Yes 240 Yes 240 Yes 240 Yes 192 Yes 192 Yes 192 Yes Allows upgrade to SSD per Node Disks per AMP 28 28 28 28 28 28 28 28 28 2 2 2 2 2 2 2 2 2 AMPs per Node 42 42 42 30 30 30 48 48 48 These configurations provide an effective future SSD upgrade path while maintaining optimum AMPs per node for a 6650H. 6650C – Design Center Clique 3+1 C 2+1 C 1+1 C Drive Options 300GB 450GB 600GB 300GB 450GB 600GB 300GB 450GB 600GB HDDs per Node 42 42 42 60 60 60 96 96 96 Configuration HDDs CPU COD per Available Clique 126 Yes 126 Yes 126 Yes 120 Yes 120 Yes 120 Yes 96 Yes 96 Yes 96 Yes Allows upgrade to SSD per Node Disks per AMP 14 14 14 14 14 14 14 14 14 2 2 2 2 2 2 2 2 2 AMPs per Node 21 21 21 15 15 15 24 24 24 These configurations provide an effective future SSD upgrade path while maintaining optimum AMPs per node for a 6650C. For both 6650H and 6650C, if CPU Only Capacity on Demand is active, it should be removed to take full advantage of the increased I/O now available. Following the Optimum Performance Configurations will allow the customer to avoid a data reload and maintain their systems AMPs per node ratio thereby reducing the impact of an upgrade. Page 9-20 Introduction to MPP Systems Teradata 6650 Configuration Examples 6650H (3+1 Clique) 6650H (2+1 Clique) 120 Disks 120 Disks 126 Disks 126 Disks 600 GB 600 GB 600 GB 600 GB Node 1 HSN Node 2 HSN TMS Node 2 Node 1 Node 3 TMS TMS TMS TMS Note: Each disk array will typically have additional global hot spare drives. 6650H (2+1 nodes/clique) 6650H (3+1 nodes/clique) 30 AMPs / Node 60 AMPs / Clique 42 AMPs / Node 126 AMPs / Clique 120 Disks per Node 240 Disks per Clique 84 Disks per Node 252 Disks per Clique Each Vdisk – 4 Disks (RAID 1) Each Vdisk – 1.08 TB* Clique – 60 AMPs x 1.08 TB = 65 TB* Introduction to MPP Systems * Actual MaxPerm space is app. 90%. Each Vdisk – 2 Disks (RAID 1) Each Vdisk – 540 GB* Clique – 126 AMPs x 540 GB = 68 TB* Page 9-21 Teradata 6690 Systems The Teradata 6690 platforms utilize Solid State Drives (SSD) and Hard Disk Drives (HDD) within a single cabinet in the Active Enterprise Data Warehouse space. Requires Teradata Virtual Storage (TVS). SSD and HDD Storage is maintained within the same drive tray. The Teradata Active Enterprise Data Warehouse platform is made up of a combination of cabinet types, depending on the system configuration: Processing/storage cabinet BYNET cabinet Teradata Managed Server (TMS) cabinet Note: A Service Workstation (SWS) is installed in one TMS cabinet. A system may have additional TMS cabinets. 6690 nodes are based on the 6650 processing nodes. Characteristics include: Page 9-22 Up to two Intel Westmere six-core CPU’s – 12 MB L2 cache with Hyper-threading – Small performance increase over 5650; 6680H (126 TPerf) 450 GB OS drives support 96GB memory 300 GB dump drive for restart performance Introduction to MPP Systems Teradata 6690 Systems Features of the 6690 system include: • The Teradata 6690 platforms utilize Solid State Drives (SSD) and Hard Disk Drives (HDD) within a single cabinet in the Active Enterprise Data Warehouse space. – Requires Teradata Virtual Storage (TVS). – SSD and HDD Storage is maintained within the same drive tray. • The 6690 is designed to reduce floor space utilization (similar to 6650). – The UPS/batteries are not used with the 6690 cabinet. – Data integrity in event of site wide power loss is provided by WAL. • A 6690 nodes uses the Intel six-core Westmere CPUs with hyper-threading enabled. The 6690 has a faster CPU (2.93 GHz. versus 3.06 GHz) than the previous 6680 node. – These systems can be configured in 1+1 or 2+1cliques. – A 6690 clique is contained within 1 processing/storage cabinet. • No co-existence with Active Warehouse 5xxx and not planned for with 6650 systems. – The 6690 is ideal for new customers and/or floor sweeps. – The 6690 will co-exist with future Active EDW Platform mixed storage offerings. Introduction to MPP Systems Page 9-23 Teradata 6690 Cabinets Each Teradata 6690 cabinet can be configured in a 1+1 or 2+1 clique configuration. A processing/storage cabinet contains one clique. A cabinet with a 2+1 clique contains two processing nodes, one hot standby node, and four disk arrays. A cabinet with a 1+1 clique contains one processing node, one hot standby node, and four disk arrays. Virtualized Management Server (VMS) The VMS is available with the 2690 Appliance and the 6690 Enterprise Warehouse Server. Characteristics of the VMS include: • 1U Server that VIRTUALIZES system and cabinet management software onto a single server • Teradata System VMS – provides complete system management functionality – – – – Cabinet Management Interface Controller (CMIC) Service Workstation (SWS) Teradata Viewpoint (single system only) Automatically installed on base/first cabinet • The VMS allows full rack solutions without an additional cabinet for traditional Viewpoint and SWS • Eliminates need for expansion racks reducing customers’ floor space and energy costs • For multi-system monitoring and management traditional Teradata Viewpoint is required. Page 9-24 Introduction to MPP Systems Teradata 6690 Cabinets 6690 Characteristics • Integrated Cabinet with nodes and SSD and HDD arrays in same cabinet. • Each NetApp drive tray can hold up to 24 SSD and/or HDD drives. – SSD drives are 400 GB. – HDD drives (10K RPM) are 600 GB. – Possible maximum of 360 disks in the cabinet. • One NetApp tray has 2 controllers and supports 2 additional expansion trays. • 6690 feature – Virtualized Management Server (VMS) Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Expansion Tray Controllers Expansion Tray * Expansion Tray Controllers Up to 24 SAS Drives Expansion Tray Up to 24 SAS Drives Expansion Tray Up to 24 SAS Drives Controllers VMS (1U) HSN TPA Node TPA Node Up to 24 SAS Drives • No UPSs in cabinet. • There is no room for BYNET switches in this Up to 24 SAS Drives 2+1 Clique in a single cabinet Up to 24 SAS Drives Up to 24 SAS Drives Expansion Tray Up to 24 SAS Drives Expansion Tray Up to 24 SAS Drives Controllers PDU PDU 6690 Introduction to MPP Systems * Up to 24 SAS Drives – Consolidated CMIC, SWS, Teradata Viewpoint cabinet. Therefore, BYNET switches are located in a separate cabinet. Expansion Tray * Not present in a 1+1 Configuration Page 9-25 Teradata Extended Nodes Additional specialized nodes are available to Teradata 55xx, 56xx, and 66xx systems. The various type and possible uses are listed on the facing page. General Notes: All TPA nodes (Teradata Nodes running the Teradata Database) must execute the same Operating System. Non-TPA Nodes and/or Managed Servers, can execute the same or a different Operating System; this is the "mixed OS support". A Non-TPA Node is a Teradata Server (Node) that is BYNET connected, but does not run the Teradata Database. A Non-TPA Node can communicate to the Teradata Database through TCP/IP emulation across the BYNET. A Managed Server is a Teradata Server (Node) that resides in the Teradata System Cabinet (rack mounted) and is connected through a dedicated Ethernet network to the Teradata Database Instance. The purpose of both Non-TPA Nodes and Managed Server Nodes is flexibility. These nodes can be used similar to external application servers for BAR, ETL/ELT, BI, etc. Some of the advantages of Non-TPA or Managed Server nodes include a single point of management/maintenance, "pre-built" dedicated network to Teradata Database, and they can often be installed into existing Cabinets, minimizing additional footprint in the data center. Page 9-26 Introduction to MPP Systems Teradata Extended Nodes Examples of extended node types: • Hot Standby Nodes (HSN) – – – BYNET connected spare node that is part of a clique and is used in the event of a node failure. Located in same cabinet as other nodes; managed by SWS • Channel Server (used as interface between Teradata and mainframe (e.g., IBM) – – – – BYNET connected Maximum of 3 ESCON and/or FICON adapters – allows host channel connections Node with 1 Quad-core CPU and 24 GB of memory – improves Teradata performance by offloading the channel workload Located in same cabinet as other nodes; managed by SWS • Teradata Managed Server (TMS) Nodes – – – Not BYNET connected Dell server integrated in processor cabinet for use with Teradata applications • Can be utilized as a Viewpoint, SAS, BAR, Ethernet, TMSM, Data Mover, etc. node Located in same cabinet as other nodes; managed by SWS • Non-TPA Nodes – – – BYNET connected Can be used to execute application software (e.g., ETL) Located in same cabinet as other nodes; managed by SWS Introduction to MPP Systems Page 9-27 Making Sense of the Different Platforms The facing page attempts to provide some perspective of the different platforms. The 4400, 4800, 4850, 5200, and 5250 nodes are based on the Intel Eclipse chassis and Aspen baseboard technology. These nodes are often referred to as Eclipse nodes. The 4455, 4851, 4855, 5251, and 5255 nodes are based on the Intel Koa baseboard technology. These nodes may be referred to as Koa nodes. The 4470, 4900 and 5300 nodes are based on the INTEL Dodson baseboard technology and may be referred to as Dodson nodes. The 4475, 4950 and 5350 nodes are based on the INTEL Hodges baseboard technology and may be referred to as Hodges nodes. The 4480, 4980, and 5380 nodes are based on the INTEL Harlingen baseboard technology and may be referred to as Harlingen nodes. The 5400 and 5450 nodes are based on the INTEL Jarrell baseboard technology and may be referred to as Jarrell nodes. The 155x, 25xx, and 55xx nodes are based on the INTEL Alcolu baseboard technology and may be referred to as Alcolu nodes. The following dates indicate when these systems were generally available to customers (GCA – General Customer Availability). – – – – – – – – – – – – – – – – – – – Page 9-28 5100M 4700/5150 4800/5200 4850/5250 4851/4855/5251/5255 4900/5300 4950/5350 4980/5380 5400E/5400H 5450E/5450H 5500E/5500C/5500H 2500/5550H 2550/2555/5555C/H 1550 1600/2580/5600C/H 5650C/H 6650C/H and 6680 2690 6690 January, 1996 (not described in this course) January, 1998 (not described in this course) April, 1999 June, 2000 July, 2001 March, 2002 December, 2002 August, 2003 March, 2005 April, 2006 March, 2007 January, 2008 October, 2008 (2550) and March, 2009 (2555/5555) December, 2008 March, 2010 July, 2010 April, 2011 October, 2011 February, 2012 Introduction to MPP Systems Making Sense of the Different Platforms Model CPU BYNET 2003 2004 5350/5380 (2 – 512 nodes) Intel Xeon 2.8/3.06 GHz BYNET V2.1 2005 2006 5400/5450H (1–1024 nodes) Intel Xeon 3.6/3.8 GHz BYNET V3.0 2007 5500H (1–1024 nodes) Intel Dual-core Xeon CPUs 2.66 GHz BYNET V3.1 2008 2009 5550/5555H (1–1024 nodes) Two Intel Quad-core Xeon CPUs 2.33 GHz BYNET V3.1/V3.2 2010 5600/5650H (1–4096 nodes) Two Intel quad or six-core CPUs 2.66/2.93 GHz BYNET V4.0 2011 6650H/6680/6690 (1–4096 nodes) Two Intel six-core CPUs 2.93/3.06 GHz BYNET V4.0 Introduction to MPP Systems Page 9-29 Linux Coexistence Combinations The facing page illustrates possible Linux coexistence combinations. Page 9-30 Introduction to MPP Systems Linux Coexistence Combinations Coexistence systems contain a mixture of node and storage generations that operate as a single MPP system running the same software. Goal is to have Parallel Efficiency: 5400E/5400H – Xeon 3.6 GHz 5450E/5450H – Xeon 3.8 GHz Utilization of one set of cliques at 100% and the other sets of cliques as close to 100% as possible. Conversion to 64-bit Linux is required if the nodes are not already running 64-bit Linux. 5500C/H – 2/4 core Xeon 2.66 GHz 5550H – 8 core Xeon 2.66 GHz This is done by balancing the workload between the nodes. May need to leverage larger Linux memory. 5555C/H – 4/8 core Xeon 2.33 GHz 5600C/H – 4/8 core Nehalem 2.66 GHz 5650C/H – 6/12 core Westmere 2.93 GHz 6650C/H – 6/12 core Westmere 2.93 GHz 6680/6690 – 12 core Westmere 2.93/3.06 GHz Introduction to MPP Systems 66xx systems can coexist with future systems. Page 9-31 Teradata Appliance Introduction A Teradata appliance is a Teradata server which is optimized specifically for high DSS performance. The first Teradata appliance was the 2500 introduced in 1Q2008. Characteristics of the Teradata appliances include: Delivered Ready to Run – Integrated system fully staged and tested – Includes a robust set of tools and utilities Rapid Time to Value – System live within hours Competitive Price Point – Capacity on Demand available if needed Easy Data and Application Migration to a Teradata EDW/ADW What is an Appliance? An appliance is an instrument or device designed for a particular use. The typical characteristics of an appliance are: Combination of hardware and software designed for a specific function – for example, the 25xx hardware/software is optimized for fast table scans & “Deep Dive” Analytics. Fixed/limited function – designed specifically for Decision Support workloads, the hardware is not configured or optimized for ADW. Fixed capacity/configuration - have a fixed configuration and limited upgrade paths. Ease of installation – fully staged and the integrated design greatly reduces the number of cabinet interconnect cables. Simple to operate – appliances are Teradata system! They have all the Server Management and capabilities used in the MPP systems. Teradata Load ‘N Go Services make is easy to quickly implement a new system Load data from operational systems Five easy steps completed in about one month – Step 1 - Build the base database structure – Step 2 - Easy Set-Up Options – Step 3 - Build and test the load scripts using the TPT Wizard – Step 4 - Conduct the initial load – Step 5 - Document and turn load/reload process over to customer No transformations or consolidation into an enterprise data model Users have access to data quickly Enabling new business insights The firmware in the disk array controllers for 25xx systems has been specifically optimized for scan-based workloads. The disk array controller pre-fetches entire cylinder to cache when a cylinder index is accessed by Teradata. Page 9-32 Introduction to MPP Systems Introduction to Teradata Appliances • What is an Appliance? – An appliance is an device designed for a specific function. – Fixed/limited function and fixed capacity/configuration. – Easy to install and simple to operate. • Data Warehouse Appliance – Teradata nodes and storage is integrated into a single cabinet. – Delivered ready to run with rapid time to value. – System live within hours, fully staged and tested. • Powerful – Purpose-built for high analytical performance. – Optimized for fast file scans and heavy “deep dive” analytics. • Cost-Effective – Competitive price point. – Easy data and application migration to a Teradata Enterprise Data Warehouse. • Ideal for Entry Level Data Warehouses, Analytical Sand Boxes, and Test and Development Systems. Introduction to MPP Systems Teradata 2500 Page 9-33 Teradata 2650/2690 Appliances Teradata 2650 Appliance The Data Warehouse Appliance 2650 can have up to 9 nodes in a cabinet. The nodes utilize the Intel Westmere six-core CPU with hyper-threading and 96 GB of memory per node. The Data Warehouse Appliance 2650 comes standard with the BYNET over Ethernet switch. For scalability requirements beyond 275TB you can configure BYNET V4, but special approval is required. Teradata 2690 Appliance The Data Warehouse Appliance 2690 can have up to 8 nodes in a cabinet. The nodes utilize the Intel Westmere six-core CPU (3.06 GHz) with hyper-threading and 96 GB of memory per node. Cliques consist of 2 nodes and no HSN. The Data Warehouse Appliance 2690 comes standard with the BYNET over Ethernet switch. Page 9-34 Introduction to MPP Systems Teradata 2650/2690 Appliances Teradata appliances utilize a fully integrated cabinet design with nodes and disk arrays in the same cabinet. Two examples of appliances are: Teradata 2650 Systems • Nodes use 2 Intel Six-core Westmere CPUs at 2.93 GHz; 96 GB of memory per node • 24 AMPs per node – 24 SAS 300 or 600 GB drives, or 12 SAS 2 TB drives per node • A 2650 cabinet can house up to 9 nodes. – Cliques are in 3 node configurations (no HSN); Cabinets can have 1, 2, or 3 cliques. Teradata 2690 Systems • Nodes use 2 Intel Six-core Westmere CPUs at 3.06 GHz; 96 GB of memory per node • Each node has 2 hardware compression boards • 24 AMPs per node – 24 SAS 300, 600, or 900 GB drives per node (2.5" drives @ 10K RPM) • A 2690 cabinet can house up to 8 nodes. – Cliques are in 2 node configurations (not HSN); a cabinet can have between 1 and 4 cliques. • Utilizes VMS (Virtualized Management Server) – Consolidated CMIC, SWS, Teradata Viewpoint Introduction to MPP Systems Page 9-35 Teradata 2650/2690 Cabinets With the 2650, you can have 3 cabinet type configurations: a 1/3 cabinet, 2/3 cabinet and full cabinet. With a full cabinet you have 9 nodes. The disk drives that are supported are 300GB or 600GB drives or 108 2 TB 3.5” drives. To also help improve loading you can configure the system with 10GB Ethernet copper or fiber ports. Cliques are configured with 3 nodes and no HSN. A 1/3 cabinet is designed for lower CPU/Node density per cabinet. A 2/3 cabinet is designed for medium CPU/Node density per cabinet. It is a good solution for mid-size capacity options and provides flexible solutions for adding an integrated SWS or TMS. A fully populated 2650 cabinet is designed for high CPU/Node density per cabinet. It is a good solution for a high capacity system driving high CPU utilization. With the 2690, a cabinet can be configured with up to 4 cliques in 2 node clique configurations (no HSN). A full cabinet will have 8 nodes. The disk drives that are supported are 300GB, 600GB, or 900GB drives. One important new feature of the 2690 is hardware compression. • • • With automatic block level compression, customers can get as much as 3x the customer data space, so the amount of available storage has tripled. System level scan rate (what’s also known as effective scan rate), has increased 3x as well because with compression because 3x more data is scanned. Also, hot cache memory, which is where frequently used results are stored until not needed, has tripled as well because the data/results being stored are compressed. With compression, the system can be pushed higher because compression CPU work has been moved out of the nodes, and that CPU is available for Teradata work. The Teradata Virtualized Management Server is a standard feature on the Data Warehouse Appliance 2690. This 1U managed server rack mounts in the appliance node cabinet and essentially consolidates all Teradata management functionality into one server. The VMS contains the following functionality: • • • Teradata Viewpoint, single system: Teradata Viewpoint is the robust web portal that manages workloads, queries, and systems. SWS: The Teradata SWS is the software that monitors the physical cabinet. This includes the nodes, disks, and connectivity. CMIC: The CMIC monitors all the disk controllers and cabling The VMS is a key reason why full racks can be shipped without having to have a separate expansion cabinet for this functionality. Some considerations include: • Traditional Viewpoint is still available, but it is priced and licensed differently. Please see the Teradata Viewpoint OCI for more information. Also note that VMS Viewpoint can only monitor one system, not multiple • If more than one node cabinet is required, the expansion cabinet will also have a VMS but will only contain the CMIC software as the others aren’t needed. Page 9-36 Introduction to MPP Systems Teradata 2650/2690 Cabinets Disk Array (Dual Array Controllers; 72 Drives) Disk Array Disk Array Disk Array (Dual Array Controllers; 48 Drives) Disk Array Disk Array 2 Node Clique 2 Node Clique 2690 Node 2690 Node 2690 Node 2650 Node Disk Array 3 Node Clique 3 Node Clique 3 Node Clique 2650 Node 2650 Node 2650 Node Disk Array 2650 Node 2650 Node Disk Array 2650 Node 2650 Node 2650 Node 2650 Node Dual AC Box Fully loaded 2650 Cabinet Introduction to MPP Systems Nodes are numbered 2 -10. 2 Node Clique 2 Node Clique 2690 Node 2690 Node 2690 Node 2690 Node Dual AC Box Nodes are numbered 2 -9. Fully loaded 2690 Cabinet Page 9-37 Appliance Configuration Examples The examples on the facing page show a typical AMP and Disk configurations for 2650 and 2690 systems. Notes: 2650 systems utilize SAS disks (Serial Attached SCSI) – 300 GB and 600 GB disk drives 2650 systems can utilize 2 TB SATA disks (Serial Advanced Technology Attachment) 2690 systems can utilize 300, 600, or 900 GB SAS disk drives. Page 9-38 Introduction to MPP Systems Appliance Configuration Examples 2690 2650 24 Disks – 600 GB 24 Disks – 600 GB • 300 or 600 GB SAS Disks 2.5" – 216 in cabinet 24 Disks – 600 GB 24 Disks – 600 GB • 2 TB Disks 3.5" – 108 in cabinet 24 Disks – 600 GB 24 Disks – 600 GB 24 Disks – 600 GB Node – Westmere CPUs Node – Westmere CPUs Node – Westmere CPUs Node – Westmere CPUs • 3 Node Cliques share 3 • • • drive trays 96 GB Memory / Node 24 AMPs / Node 72 AMPs /Clique Node – Westmere CPUs 24 Disks – 600 GB 24 Disks – 600 GB Node – Westmere CPUs Node – Westmere CPUs 24 Disks – 600 GB • 300, 600, or 900 GB SAS Disks 2.5" – 192 in cabinet 2690 Clique • 2 Node Cliques share 2 drive trays • 96 GB Memory / Node • 24 AMPs / Node • 48 AMPs /Clique • Includes hardware compression. 24 Disks – 600 GB • 24 Disks / Node (RAID 1) • 72 Disks / Clique 24 Disks – 600 GB 24 Disks – 600 GB • 24 Disks / Node (RAID 1) • 48 Disks / Clique Node – Westmere CPUs Node – Westmere CPUs Node – Westmere CPUs 24 Disks – 600 GB Node – Westmere CPUs 2650 Clique Node – Westmere CPUs Node – Westmere CPUs 24 Disks – 600 GB Node – Westmere CPUs 24 Disks – 600 GB 24 Disks – 600 GB 2690 Disk Options 2650 Disk Options 2650 Cabinet with 9 nodes • 216 AMPs • 864 GB memory in cabinet (Up to 9 Nodes in a cabinet) Introduction to MPP Systems Node – Westmere CPUs Node – Westmere CPUs Node – Westmere CPUs 2690 Cabinet with 8 nodes • 192 AMPs • 768 GB memory in cabinet (Up to 8 Nodes in a cabinet) Page 9-39 What is the BYNET™? The BYNET (BanYan Network) provides high performance networking capabilities for MPP systems. The BYNET is a dual-redundant, bi-directional, multi-staged network based on a Banyan network topology. The BYNET enables multiple processing nodes (SMP nodes) to communicate in a high speed, loosely-coupled fashion. BYNET communication occurs in a point-to-point, multi-cast, or broadcast fashion. A connection request contains an address or routing tag for the intended receiving node or group of nodes. Once the connection is made, a circuit is established for the duration of the connection. The BYNET works much like a telephone network where many callers can establish connections, including conference calls. The BYNET interconnect provides a peak bandwidth of x Megabytes (MB) per second for each node per direction connected to a network. V1 – 10 MB V2 – 60 MB V3 – 93.75 MB V4 – 240 MB For example, a BYNET v4 network provides 240 MB x 2 (bi-directional) x 2 (BYNETs) = 960 MB/sec per node. A 10-node 5600 system with a dual BYNET network has the potential raw capability of 9600 MB (or 9.6 GB) per second total bandwidth for point–to– point connection. However, the total available broadcast bandwidth is 960 MB per second for a dual network system of any size. Other features of the BYNET network include: Guaranteed delivery - a message from a node is guaranteed to be delivered without error to the receiving node(s); multiple levels of error checking and acknowledgment are used to ensure this. Fault tolerant - multiple connection paths are available in each network; dual network feature provides an active backup network should one network be lost. Flexible network usage - nodes communicate in point-to-point or broadcast fashion. Self-configuring - the BYNET automatically determines network topology at startup; enables ease of installation. Self-diagnosis and automatic fault recovery - automatically detects and reports errors; reconfigures routing of connections to avoid inoperable processing nodes. Load balancing - traffic is automatically and dynamically distributed throughout the networks. Page 9-40 Introduction to MPP Systems What is the BYNET? What is the BYNET (BanYan NETwork)? • High speed interconnect (network) for processing nodes in MPP systems. The BYNET is a dual redundant network. • BYNET works much like a telephone network where many callers (nodes) can establish connections, including conference calls. • BYNET Version 3 Switches – 375 MB/Sec per node • BYNET Version 4 Switches – 960 MB/Sec per node BYNET Switch (v1, v2, v3, or v4) BYNET Switch (v1, v2, v3, or v4) BYNET Switch Examples • • • • BIC Open BYNET SW BYNET 4 switch (v2.1 – 240 MB/sec) BYNET 32 switch (BYA32S) – Can execute at v3 or v4 speed BYNET 64 switch (v3.0 – 12U switches) BYNET 64 switch (v4.0 – 5U switches) BIC ... SMP Introduction to MPP Systems Open BYNET SW BIC (BYNET Interface Card) Examples (these can run at v3 or v4 speeds) • • BIC2SX – used with 54xx nodes BIC2SE – used with 5500 nodes and later SMP Page 9-41 BYNET 32 Switches The facing page contains of an example of a BYNET 32 switches. Examples of other BYNET switches are listed below. This is not an inclusive list. BYNET 4 Switch Version 2 (BYA4G) – a PCI card designed to interconnect up to 4 SMPs. This switch is a BYNET v2 switch (60 MB/sec.) designed for 485x systems. The BYA4G is a PCI card that is placed into a PCI slot of an SMP. BYNET 4 Version 2.1 Switch (BYA4M) – PCI card designed to interconnect up to 4 SMPs. This switch is a BYNET v2.1 switch (60 MB/sec.) designed for 4900 systems. The BYA4M is a PCI card that is placed into a PCI slot of an SMP. BYNET 4 Switch Version 2.1 (BYA4MS) – PCI card designed to interconnect up to 4 SMPs. This BYNET V2.1 switch (60 MB/sec.) was designed for 4980 systems. The BYA4MS has a shorter form factor – S is for shorter. BYNET 32 Switch (BYA32S) – this switch can run at v3 or v4 speeds depending on the system and type of BICs. Up to 16 TPA nodes and 16 NOTPA nodes can be connected to this switch. This 1U chassis switch resides in a Base or System Cabinet. Includes an Ethernet Interface for BYNET status & error reporting and chassis management. Note on BYNET cables: Page 9-42 There is a physical difference between BYNET v2 and BYNET v3/v4 cables. The BYNET v3/v4 cables have a “Quick Disconnect” connector whereas the BYNET v2 cables have a “Micro D” connector with 2 screws. The number of wires inside the cables is the same. Introduction to MPP Systems BYNET 32 Switches BYNET 0 BYNET 1 BYA32S Switch TPA 1 TPA 2 ... BYA32S Switch TPA 16 NOTPA 17 NOTPA 18 ... NOTPA 32 BYNET 32 switch (BYA32S) is a 1U chassis used in an processor rack. • This 32-port switch can execute at v3 or v4 speeds. • Up to 16 TPA nodes can be connected. • An additional 16 HSN, Channel, or non-TPA nodes can be connected. Introduction to MPP Systems Page 9-43 BYNET 64 Switches For configurations greater that 16 TPA/HSN nodes, BYNET 64 switches must be used. BYNET 64 Node Switch Version 2 (BYA64GX chassis) – this switch is actually composed of 8 BYA8X switch boards in the BYA64GX chassis. Each BYA8X switch board allows up to 8 SMPs to interconnect (i.e., 8 switches x 8 SMPs each = 64 SMPs). The BYA64GX is actually a backpanel that allows the 8 BYA8X switch boards to interconnect. This 12U chassis resides in either the BYNET V2 64 Node Switch cabinet or the BYNET V2 64/512 Node Expansion Cabinet. Note: BYA8X switch board (in BYA64GX chassis): This is Stage A base switch board. Each board supports 8 links to nodes. The BYA64GX chassis can contain a maximum of 8 BYA8X switches, allowing for 64 links to nodes. In systems greater than 64 nodes, the BYA8X switch boards also connect the BYA64GX chassis to BYB64G chassis through X-port connectors, one on each BYA8X board. BYNET Switch Cabinets Even though the BYNET switch cabinets are different for BYNET v2, v3, and v4. However, the basic purpose is the same - the purpose is to house BYNET 64 switches. The BYNET 64 Node Switch Cabinet (shown on facing page) can be used for configurations from 2 through 64 nodes and must be used for configurations greater than 16 nodes. All nodes in the configuration are interconnected from the BYNET (V2 or V3) node interface to the BYNET (V2 or V3) 64 Node Switch chassis (BYA64GX). Two BYNET (V2 or V3) 64 Node Switch Cabinets are required for the base dual redundant BYNET V2 networks. The BYNET 512 Node Expansion Cabinet Version 2 (or 3) (not shown) is for used for configurations that begin with 64 nodes or less and has expanded beyond 64 node maximum configuration supported by the BYNET BYA64GX chassis (in the BYNET 64 Node Switch Cabinet). Above 64 nodes, the BYNET BYB64G chassis (effectively a 512 node switch chassis) is used to interconnect multiple BYNET 64 node switch chassis. The simple configuration rules are: Each group of 2 to 64 nodes requires two BYNET V2 64 node switch chassis; a minimum of two is required for dual redundancy. For configurations with greater than 64 nodes, each BYNET V2 64 node switch chassis must have a complimentary BYNET V2 512 node switch chassis. Page 9-44 Introduction to MPP Systems BYNET 64 Switches A BYNET 64 Switch is a separate chassis located inside a BYNET rack or cabinet. • BYNET v3 64 Switches (BYA64GX) – 12U in height – 375 MB/sec per node for both BYNET channels • BYNET v4 64 Switches (BYA64S) – 5U in height – 960 MB/sec per node for both BYNET channels Two BYNET switch racks are needed to house these two BYNET 64 switches. BYA64S Switch (v4) Node 2 ... BYNET 64 Node Switch Chassis – 5U BYA64S (v4) BYA64S (v4) BYNET 1 BYNET 0 Node 1 BYNET 64 Node Switch Chassis – 5U BYA64S Switch (v4) Node 64 Nodes connect to BYA switches. Introduction to MPP Systems BYNET Switch Rack BYNET Switch Rack Page 9-45 BYNET Expansion Switches With BYNET v3, the BYA64GX and BYC64G switches are physically identical. What makes them different is the firmware that is loaded onto the BYNET switch and how they are cabled together. The base chassis is the same as the BYNET version 2 base chassis. This includes including sheet metal and backpanel, power supplies, power control board, and fans. The v3 BYA8QX switch is new within the BYA64GX and BYC64G switches. BYNET V3 64-node Switch Chassis The BYNET V3 64-node Switch Chassis are used in 5400H systems with greater than 16 nodes. Each switch chassis resides in its own cabinet or co-reside with a BYNET V3 1024node Expansion Switch Chassis. Each BYNET V3 64-node Switch Chassis provides the BYNET switching for its own BYNET V3 fabric. Therefore, for redundancy; two 64-node Switch Chassis are needed. In systems with greater than 64 nodes, two BYNET 64-node switches are needed for every 64 nodes. BYNET V3 1024-node Expansion Switch Chassis The BYNET V3 1024-node Expansion Switch Chassis (marketing name) is used in 5400H systems with greater than 64 nodes. The 1024-node switch resides in its own cabinet or coresides with a BYNET 64-node switch. The total number of 1024-node switch chassis needed in a system is a power of 2 based on the number of nodes. For systems with 65 - 128 nodes, two 1024-node switches are needed per BYNET fabric (total of 4). For systems with 129 – 256 nodes, four 1024-node switches are needed per BYNET fabric (total of 8). For systems with 257 – 512 nodes eight 1024-node switches are needed per BYNET fabric (total of 16). BYNET Expansion to 1024 Nodes BYNET v3/v4 support configurations up to 1024 nodes. For BYNET v3, in order to interconnect more than 512 nodes additional BYOX and BYCLK hardware is needed. Page 9-46 Introduction to MPP Systems BYNET Expansion Switches This example shows both BYNETs and connects 128 nodes. BYNET 0 BYNET 1 BYC Switch BYC Switch BYC Switch BYC Switch BYA Switch BYA Switch BYA Switch BYA Switch Node 1 Node 2 ... Node 64 Node 65 Node 66 ... Node 128 BYA64S (v4) BYA64S (v4) BYA64S (v4) BYA64S (v4) BYC64S (v4) BYC64S (v4) BYC64S (v4) BYC64S (v4) • The BYNET v4 Expansion switch (BYC) is a separate 5U chassis located inside the BYNET rack or cabinet. • The BYNET v3 Expansion switch (BYC – not shown) is a 12U chassis. To support 128 nodes with BYNET v3 switches, 4 BYNET switch BYNET racks are needed. Introduction to MPP Systems BYNET Switch Rack BYNET Switch Rack Page 9-47 Server Management with SWS The SWS (Service Workstation) provides a single operational view for Teradata MPP Systems and the environment to configure, monitor, and manage the system. The SWS effectively is the central console for MPP systems. The SWS is one part of the Server Management subsystem that provides monitoring and management capabilities of MPP systems. Prior to the SWS, other server management environments were: 1st Generation Server Management (3600) – Server Management (SM) processing, storage and display occurred on AWS. 2nd Generation Server Management (5100, 48xx/52xx, 49xxx/53xx) – most SM processing occurs on CMICs and Management Boards. The AWS still provides all the storage and display. 3rd Generation Server Management (54xx systems and beyond) – most SM processing occurs on CMICs and Management Boards. The SWS still provides all the storage and display. The Server Management subsystem uses industry standard parts, a Server Management Node and Ethernet switches to implement an Ethernet based Server Management solution. This new Server Management is referred to a Third Generation Server Management (SM3G). One of the reasons for the new Server Management subsystem is to better adhere to industry standards. Ethernet-based management is now the industry standard for chassis vendors. Virtualized Management Server (VMS) The Teradata Virtualized Management Server is a standard feature on the Data Warehouse Appliance 2690 and the 6690. This 1U managed server rack mounts in the appliance node cabinet and essentially consolidates all Teradata management functionality into one server. The VMS contains the following functionality: • • • Teradata Viewpoint, single system: Teradata Viewpoint is the robust web portal that manages workloads, queries, and systems. SWS: The Teradata SWS is the software that monitors the physical cabinet. This includes the nodes, disks, and connectivity. CMIC: The CMIC monitors all the disk controllers and cabling The VMS is a key reason why full racks can be shipped without having to have a separate expansion cabinet for this functionality. Some considerations include: • Traditional Viewpoint is still available, but it is priced and licensed differently. Please see the Teradata Viewpoint OCI for more information. Also note that VMS Viewpoint can only monitor one system, not multiple • If more than one node cabinet is required, the expansion cabinet will also have a VMS but will only contain the CMIC software as the others aren’t needed. Page 9-48 Introduction to MPP Systems Server Management with SWS For 1600, 56xx, and 66xx systems: • The SWS (Service Workstation) is a Linux workstation that is Dual Ethernet LANs dedicated to system servicing and maintenance. – May be deskside or rack mounted • Server Management WEB (SMWeb) services provides operational & maintenance control via Internet access. Option for 1650, 2690, and 6690 systems: • VMS (Virtualized Management Server) – consolidated CMIC, SWS, Teradata Viewpoint SMWeb services provide the ability to: • connect to AWS type display • • • • connect to nodes power on/off/reset manage alerts obtains h/w or s/w status information Array Controllers BYNET BYNET HSN SMP 15 SMP 14 HSN SMP 12 SMP 11 HSN SMP 9 SMP 8 SM (CMIC) Collective #1 Introduction to MPP Systems Array Controllers Array Controllers HSN SMP 15 SMP 14 HSN SMP 12 SMP 11 HSN SMP 9 SMP 8 SM (CMIC) Array Controllers Collective #2 Page 9-49 Node Naming Conventions The examples on the facing page show AWS naming conventions for cabinets or racks. Each chassis consists of a number of internal components (processors, fans, power supplies, management boards, etc.). The chassis numbering for 52xx/53xx cabinets starts at 1 from the top of the cabinet to bottom of the cabinet. The chassis numbering for 54xx and 55xx cabinets starts at 1 from the bottom of the cabinet to the top of the cabinet. 54xx/55xx Chassis Numbering Conventions A standard chassis numbering convention is used for the 54xxE, 54xxH/LC, 55xxC/H, Storage, and BYNET cabinets. The chassis numbers are not defined by hardware, but only by convention and numbering defined in the CMIC configuration file. Chassis numbers begin with one and go up to 22. Chassis numbers are assigned to the position; chassis numbers begin for each type of chassis as defined below and are not skipped if a chassis is not installed. All Cabinets In all cabinets, chassis 1 is the bottom UPS, the numbering continues upward until all UPS(s) are assigned. Up to 5 UPS(s) can exist in the cabinet. Node Cabinets Chassis 6 - CMIC in the Node cabinets Chassis 7 through 16 – Nodes; the bottom node chassis starts 7 and continues up to 16. The chassis number is assigned to the position, if no node is installed the chassis number is skipped. If only 8 TPA nodes in a rack, then nodes are numbered 9 to 16. Chassis 17 and 18 – BYA32Gs Chassis 19 through 22 – FC switches (if present) Storage Cabinets Chassis 4 – SM Chassis (CMIC) in a Storage cabinet (if present) Chassis 5 and 6 – Disk Array Controller chassis; lower disk array is 5, the upper is 6. Disk Array names are DAMCxxx-y-z where xxx is collective number, y is cabinet number, and z is chassis number. BYNET Cabinets Chassis 4 and 5 – BYNET 64 switches (Chassis 4 - BYC64, Chassis 5 - BYA64) 54xx/55xx Collective Conventions A collective is made up of the node and disk array cabinets that are part of the same server management group (usually the same clique). Include the first BYNET Cabinet to the first Node Cabinet Collective Include the second BYNET Cabinet to the second Node Cabinet Collective Include the third BYNET Cabinet to the third Node Cabinet Collective, etc Remember, only one BYNET Cabinet may be configured in any 54xx Collective The SM3G Collectives are defined in software using the CMIC Configuration Utility. The CMIC Configuration Records (CMICConfig.xml) contain configuration information for all the chassis in a CMIC’s collective. All SM3G chassis must reside on the same Primary and Secondary management networks. Page 9-50 Introduction to MPP Systems Node Naming Conventions 6650 6650 6690 Secondary SM Switch Secondary SM Switch Up to 24 SAS Drives Drive Tray (16 HD) Drive Tray (16 HD) Up to 24 SAS Drives Drive Tray (16 HD) Drive Tray (16 HD) Up to 24 SAS Drives 1U BYNET Switch 1U BYNET Switch Drive Tray (16 HD) Drive Tray (16 HD) Up to 24 SAS Drives HSN SMP001-15 Drive Tray (16 HD) Drive Tray (16 HD) Up to 24 SAS Drives Drive Tray (16 HD) Drive Tray (16 HD) Up to 24 SAS Drives SMP001-14 HSN SMP001-12 Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) Drive Tray (16 HD) 6844 Array Controllers (4U) 6844 Array Controllers (4U) 5650 1st E'net Switch – P 1st E'net Switch – S 16 15 14 13 12 11 10 9 8 Up to 24 SAS Drives SMP001-11 HSN SMP001-9 SMP001-8 7 SM – CMIC UPS UPS UPS UPS UPS Introduction to MPP Systems SMP002-7 HSN 6 SMP002-6 SMP003-6 5 TMS Node TMS Node BYA32S-1 BYA32S-0 SM – CMIC (1U) Primary SM Switch PDU PDU TMS Node SM – CMIC (1U) Primary SM Switch PDU PDU Up to 24 SAS Drives Up to 24 SAS Drives 10 9 8 VMS (1U) HSN SMP004-9 SMP004-8 Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives PDU PDU Page 9-51 Summary The facing page summarizes the key points and concepts discussed in this module. Page 9-52 Introduction to MPP Systems Summary Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Extreme Performance Appliance Active Enterprise Data Warehouse Purpose Test/ Development or Smaller Data Marts Analytics on Extreme Data Volumes from New Data Types Data Warehouse or Departmental Data Marts Extreme Performance for Operational Analytics Enterprise Scale for both Strategic and Operational Intelligence EDW/ADW Possible Uses Departmental Analytics, Entry level EDW Analytical Archive, Deep Dive Analytics Strategic Intelligence, Decision Support, Fast Scan Operational Intelligence, Lower Volume, High Performance Active Workloads, Real Time Update, Tactical and Strategic response times Introduction to MPP Systems Page 9-53 Module 9: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 9-54 Introduction to MPP Systems Module 9: Review Questions 1. What is a major difference between a 6650 system as compared to a 6690 system? _____________________________________________________________________ 2. What is a major difference between a 2650 node and a 2690 node? _____________________________________________________________________ 3. What does the acronym represent and briefly define the purpose of the following subsystems? BYNET _____________________________________________________________________________ SWS _____________________________________________________________________________ 4. Specify the names of the two TPA nodes in 6690 cabinet #2. __________ ____________ Play the numbers games – match the number to a definition. 1. 2. 3 8 a. b. Typical # of AMPs per node in a 6650 3+1 clique Maximum number of nodes that can be in a 2690 cabinet 3. 4. 24 42 c. d. Maximum number of drives in one NetApp 6844 disk array Number of nodes in a 2650 clique 5. 128 6. 900 e. f. Large disk drive size (GB) for a 2690 disk array Typical # of AMPs in a 2690 node Introduction to MPP Systems Page 9-55 Notes Page 9-56 Introduction to MPP Systems Module 10 How Teradata uses MPP Systems After completing this module, you will be able to: Identify items that are placed into FSG cache. Identify a purpose for the WAL Depot and the WAL Log. Describe the fundamental relationship between Linux, logical units, and disk array controllers. Describe the fundamental relationship between Vdisks, Pdisks, LUNs, and partitions. Teradata Proprietary and Confidential How Teradata uses MPP Systems Page 10-1 Notes Page 10-2 How Teradata uses MPP Systems Table of Contents Teradata and the Processing Node ............................................................................................. 10-4 FSG Cache ............................................................................................................................. 10-4 Memory and the Teradata Database ........................................................................................... 10-6 5555H Example...................................................................................................................... 10-6 SMP Memory – Summary ......................................................................................................... 10-8 Determining FSG Cache ........................................................................................................ 10-8 O.S. Managed Memory and FSG Cache .................................................................................. 10-10 WAL – Write Ahead Logic ...................................................................................................... 10-12 WAL Concepts ......................................................................................................................... 10-14 Linux Vproc Number Assignment ........................................................................................... 10-16 Disk Arrays from a O.S. Perspective ....................................................................................... 10-18 Logical Units and Partitions ..................................................................................................... 10-20 EMC2 Notes ......................................................................................................................... 10-20 Teradata and Disk Arrays......................................................................................................... 10-22 Teradata 6650 (2+1) Logical View .......................................................................................... 10-24 Teradata 6650 (3+1) Logical View .......................................................................................... 10-26 Example of 1.2 TB Vdisk (pre-TVS) ....................................................................................... 10-28 Teradata File System Concepts ................................................................................................ 10-30 Teradata Vdisk Size Limits ...................................................................................................... 10-30 Teradata 13.10 Large Cylinder Support ................................................................................... 10-32 When to Use This Feature ................................................................................................ 10-32 Full Cylinder Read ................................................................................................................... 10-34 Summary .................................................................................................................................. 10-36 Module 10: Review Questions ................................................................................................. 10-38 How Teradata uses MPP Systems Page 10-3 Teradata and the Processing Node The example on the facing page illustrates a 5650H processing node running Linux and Teradata. Memory will initially be allocated for the operating system and Teradata vprocs. PDE will calculate how much memory to allocate to itself for FSG (File Segment Cache) based on memory not being used by the operating system and the Teradata vprocs. PDE software will manage the FSG memory space. Practical experience (for most environments) indicates that the operating system (e.g., Linux) may need more than this initial allocation during startup. For these reasons, PDE is not assigned all of the remaining memory for FSG cache, but a percentage (e.g., 90%) of the remaining memory. Also note that LAN and Channel adapters (PBSA) also require memory for network and channel activity. For example, each channel adapter uses memory buffers up to 500 MB in size. For 56xx systems, LAN and Channel Adapters not utilized within a TPA node. These are implemented in “Extended Node Types”. FSG Cache FSG Cache is primarily used by the AMPs to access memory resident database segments. When the Teradata Database needs to read a database block, it checks FSG Cache first. Page 10-4 How Teradata uses MPP Systems Teradata and the Processing Node FSG (File Segment Cache) – managed by PDE PE vproc PE vproc AMP AMP AMP AMP AMP AMP AMP vproc vproc vproc vproc vproc vproc vproc Teradata TPA S/W PDE Vproc GTW Vproc TVS Vproc RSG Vproc - optional (Parallel Database Extensions) (Gateway) (Teradata Virtual Storage) (Relay Services Gateway) Linux Process Control Memory Mgmt. Memory CPUs Pentium Westmere Six-Core – 3.06 GHz I/O Mgmt. (Device Drivers) BIC2SE QFC Eth. BYNET Disk Arrays LANs Pentium Westmere Six-Core – 3.06 GHz QFC - Quad Fibre Channel How Teradata uses MPP Systems Page 10-5 Memory and the Teradata Database The example on the facing page assumes a 5650H node with 96 GB of memory executing the Teradata Database. This example assumes 42 AMPs, 2 PEs, PDE, GTW, RSG, and TVS vprocs for a total of 48 vprocs in this node. This means that memory will have to be allocated for the 48 vprocs. The operating system, device drivers, and Teradata vprocs for a 6650H Linux node with 96 GB of memory will use approximately 18 GB of memory. PDE will use a FSG Cache Percent (CTL parameter) to calculate how much memory to allocate to itself for FSG (File Segment Cache) based on the available memory (96 GB – 18 GB). Practical experience (for most environments) indicates that the operating system (e.g., Linux) may need more than this initial allocation during startup. Parsing Engines and AMPs will typically use more than their initial allocation of memory (80 MB). For example, redistribution buffers for an AMP may use an additional 130 MB of memory for a total of 210 MB of memory per AMP. For these reasons, PDE is not assigned all of the remaining memory for FSG cache, but a percentage of the remaining memory. The default of 90% for FSG Cache Percent works for most 66xx systems. 90% of 78 GB (96-18) = 70.2 GB of FSG cache. This can be verified by using the ctl utility hardware function, it can be determined that 42 AMPs have an average of 1.669 GB of memory. 42 x 1.669 = 70.1 GB of FSG cache. 5555H Example Assume a 5555H node with 32 GB of memory executing the Teradata Database. Assume that a typical 5555H node will have 25 AMPs, 2 PEs, PDE, GTW, RSG, and TVS vprocs for a total of 31 vprocs. This means that memory will have to be allocated for the 31 vprocs. The operating system, device drivers, and Teradata vprocs for a 5555H Linux node with 32 GB of memory may use as much as 5.8 GB of memory. PDE will use a FSG Cache Percent (CTL parameter) to calculate how much memory to allocate to itself for FSG (File Segment Cache) based on the available memory (32 GB – 5.8 GB). The 5.8 GB is based on the Design Center recommendation for a 5555H node with 32 GB of memory. For these reasons, PDE is not assigned all of the remaining memory for FSG cache, but a percentage of the remaining memory. The default of 80% for FSG Cache Percent works for most 5555 systems. Page 10-6 How Teradata uses MPP Systems Memory and the Teradata Database Example of 6650 (Linux) node with 2 PEs and 42 AMPs and 96 GB of memory: 10% of remaining space – 8 GB available as free space Memory O.S., Device Drivers, and space for vprocs ≈ 18 GB 96 GB – 18 GB 78 GB FSG Cache 90% FSG Cache ≈ 70 GB Free Memory ≈ 8 GB Examples of objects that are memory resident: Hash Maps Configuration Maps Master Indexes RTS – Request-to-Steps Cache D/D – Data Dictionary Cache How Teradata uses MPP Systems FSG (File Segment Cache) (Examples of use – Data Blocks & Cylinder Indexes) Managed by PDE Software 90% of remaining space – 70 GB available for FSG PE Vproc RTS D/D Cache PDE ... AMP Vproc AMP Vproc ......... Master Index Hash Maps Configuration Maps Master Index GTW VSS RSG Operating System and Device Drivers Ex. 96 GB Memory Page 10-7 SMP Memory – Summary Practical experience (for most environments) indicates that Linux and Teradata vprocs need more memory than initially allocated during normal processing periods. Prior to V2R5, it was recommended that at least 20 MB to 40MB of additional free memory be available for each AMP. With 32-bit systems, it is recommended that a node have at least 60 – 80 MB of free memory available for each AMP. With 64-bit systems, each AMP may use up to 210 MB of memory. This would be an additional 130 MB of memory per AMP. This is accomplished by not giving 100% of the remaining memory to FSG. It is always recommended that the FSG Cache Percent be set to a value less than 100%. The default of 90% for FSG Cache Percent works well for most 56xx and 66xx configurations. 80% usually works well for 5555 configurations. Determining FSG Cache The “ctl” utility can be used to determine how much FSG cache memory is actually available to a node. Using the “ctl” utility, the hardware command will report the amount of FSG cache for each AMP. The values below represent the average amount of FSG memory per AMP. Examples are shown below. For a 5555H node with 25 AMPs, the report will indicate 838,016 KB/per AMP. 838,016 KB/AMP x 25 = 20,950,500 KB or approximately 21 GB of FSG cache. For a 2555H node with 36 AMPs, the report will indicate 582,016 KB/per AMP. 582,016 KB/AMP x 36 = 20,952,576 KB or approximately 21 GB of FSG cache. For a 5600H node with 40 AMPs, the report will indicate 1,753,856 KB/per AMP. 1,753,856 KB/AMP x 40 = 70,154,240 KB or approximately 70 GB of FSG cache. For a 5650H node with 47 AMPs, the report will indicate 1,472,000 KB/per AMP. 1,472,000 KB/AMP x 47 = 69,184,000 KB or approximately 69.2 GB of FSG cache. For a 6650H node with 42 AMPs, the report will indicate 1,669,120 KB/per AMP. 1,669,120,000 KB/AMP x 42 = 70,103,040 KB or approximately 70.1 GB of FSG cache. Page 10-8 How Teradata uses MPP Systems SMP Memory – Summary Based on the configuration and FSG Cache Percent value, PDE will determine the amount of memory to allocate for FSG cache. However, vprocs (especially AMPs) will use more than their initial memory allocations during normal processing (e.g., redistribution buffers, aggregations buffers, hash join buffers, etc.). Some basic guidelines for AMPs are: 64-bit systems – assume 210 MB per AMP 7.8 GB Memory managed by O.S. 90% – 70.2 GB 80% – 62.4 GB FSG Cache O.S., Device drivers, and Teradata Vprocs 18.0 GB Managed by PDE FSG software. Memory managed by O.S. Ex. 96 GB Memory FSG – pool of memory managed by PDE and each AMP uses what it needs. ctl Parameter – FSG Cache Percent – for 66xx, the design center recommendation is 90% and this works for most configurations. How Teradata uses MPP Systems Page 10-9 O.S. Managed Memory and FSG Cache The facing page lists examples of how Operating System managed memory (free memory) and FSG cache is used. Memory managed and used by the operating system and the vprocs is sometimes called “free memory”. The main code (on a TPA node) that uses free memory is the operating system and Teradata vprocs A brief description of Teradata Vprocs: AMP Access module processors perform database functions, such as executing database queries. Each AMP owns a portion of the overall database storage. GTW Gateway vprocs provide a socket interface to Teradata Database on Windows and Linux systems. On MP-RAS systems, the same functionality is provided by gateway software running directly on the system nodes within the PDE vproc. Node (or Base) PDE vproc - the node vproc handles PDE and operating system functions not directly related to AMP and PE work. Node vprocs cannot be externally manipulated, and do not appear in the output of the Vproc Manager utility. PE Parsing engines perform session control, query parsing, security validation, query optimization, and query dispatch. RSG Relay Services Gateway provides a socket interface for the replication agent, and for relaying dictionary changes to the Teradata Meta Data Services (MDS) utility. TVS Manages Teradata Database storage. AMPs acquire their portions of database storage through the TVS (previous releases named this VSS) vproc. When Teradata needs to read a database block, it checks FSG Cache first. Examples of how FSG Cache is used Page 10-10 Permanent data blocks Cylinder Indexes Spool data blocks Transient Journals Permanent Journals Synchronized scan (sync scan) data blocks How Teradata uses MPP Systems O.S. Managed Memory and FSG Cache Memory managed by the O.S. is referred to as “free memory”. • Teradata Vprocs – – – – – – AMP – includes AMP worker tasks PE – Session control, Parser, Optimizer, Dispatcher PDE (Parallel Database Extensions) – messaging, FSG space management, etc. GTW (Gateway) – Logon Security, Session Context, Connection to Client RSG (Relay Services Gateway) – Optional; Replication Gateway, MDS auto-update TVS (Teradata Virtual Storage) – manages Teradata Virtual Storage • Administrative and/or user programs such as: – kernel resources and administrative program text and data – message buffers (ex., TCP/IP) Memory managed by PDE is called FSG cache. FSG cache is primarily used by the AMPs to access memory resident database segments. • When Teradata needs to read a database block, it checks FSG Cache first. – – – – – Permanent data blocks Cylinder Indexes Spool data blocks Journal blocks; Transient Journal and/or Permanent Journals Synchronized scan (sync scan) data blocks How Teradata uses MPP Systems Page 10-11 WAL – Write Ahead Logic WAL (Write Ahead Logic) is a recoverability/reliability feature that can possibly provide performance improvements in the area of database writes. In general, I/O increases with WAL and, therefore, it may reduce throughput for I/O bound workloads. However, the overall performance is expected to be better with WAL since the benefit of CPU improvement outweighs the I/O cost. There is some additional CPU cost for maintaining the WAL log so WAL may reduce throughput for CPU-bound workloads, but is minimal. Simple example: Assume Teradata Mode, an implicit transaction, and you are doing an UPDATE of a single row in a block that has 300 rows in it. 1. 2. 3. 4. 5. Data block is read into FSG Cache. UNDO row is written to WAL Log (effectively a before-image or TJ type row) The data block in memory is changed and is marked as changed (not immediately written to disk - called deferred write). REDO row is written to the WAL Log (effectively an after-image) - writing a single REDO row is faster than writing a complete block The lock is released and the user gets a transaction completed message. Note the updated block is still in memory and hasn't been written to disk yet. Note: Other users might be doing updates on rows in the same block and there might be multiple updates to the same block in memory. 6. At some point (maybe a half-second second later), the block needs to be written to disk. This is a deferred write and is done in the background. 6A. If the updated block has not changed size, then it can be written back-in-place. Before physically writing the block back-in-place, the updated block is first written to the WAL depot. After the datablock is successfully written to the WAL Depot, it is then physically written back-in-place. Why is the block effectively written twice back to disk? A write operation can fail (called interrupted write) and this can corrupt a block on disk and potentially corrupt all 300 rows. This is a very rare occurrence, but can happen. The WAL Log only has 1 row (REDO row) of the row that has changed. Therefore, by writing the block first to the WAL Depot before writing back-in-place, Teradata ensures that a good copy of the entire datablock is written back-to-disk. The WAL Depot is ONLY used for blocks that haven't changed size - effectively write back-in-place operations. This is an extra internal I/O, but it provides data integrity and protection from interrupted write operations. 6B. If the block has changed size in memory (e.g., block expands to an additional sector), then the updated block is written to a new location on disk - it is not written to the WAL Depot. If there is an interrupted write, the original block has not been touched and the REDO rows along with the original data block can be used for recovery. WAL can batch up modifications from multiple transactions and apply them with a single disk I/O, thereby saving I/O operations. WAL will help improve throughput for I/O-bound workloads. Obviously, Load utilities such as FastLoad and MultiLoad don't need to use WAL. Other functions such as FastPath operations use the WAL subsystem differently. Page 10-12 How Teradata uses MPP Systems WAL – Write Ahead Logic WAL – Write Ahead Logic • Available with all Teradata systems – PDE (UNIX MP-RAS) and OpenPDE (Windows and Linux) • Replaced Buddy Backup in PDE (UNIX MP-RAS) Teradata systems WAL is a primarily an internal recoverability/reliability feature that also provides performance improvements in the area of database writes. • All modifications are represented in a log and the log is forced to disk at key times. • Data Blocks updated in Memory, but not written immediately to disk • In place of the data block written to disk, the before image (UNDO row) and after image (REDO row) are written to a WAL buffer which is written to the WAL log on disk. • WAL can batch up modifications from multiple transactions and apply them with a single disk I/O, thereby saving I/O operations. WAL will help improve throughput for I/O-bound workloads. • Updated data blocks will be eventually aged out and written to disk. Note: There are numerous DBS Control parameters to specify space allocations for WAL. How Teradata uses MPP Systems Page 10-13 WAL Concepts WAL has its own file system software and uses a fixed number of cylinders for the WAL Depot (varies by vdisk size and DBSControl parameters) and a dynamic number of cylinders for the WAL Log itself. The WAL Depot consists of two types of slots: Large Depot slots Small Depot slots The Large Depot slots are used by aging routines to write multiple blocks to the Depot area with a single I/O. The Small Depot slots are used when individual blocks that require Depot protection are written to the Depot area by foreground tasks. The number of cylinders allocated to the Depot area is fixed at startup based on the settings of several internal DBS Control flags. The number of Depot area cylinders allocated is per pdisk, so their total number depends on the number of Pdisks in your system. Sets of Pdisks belong to a subpool, and the system assigns individual AMPs to those subpools. Because it does not assign Pdisks to AMPs, the system calculates the average number of Pdisks per AMP in the entire subpool from the vconfig GDO when it allocates Depot cylinders, rounding up the calculated value if necessary. The result is then multiplied by the specified values to obtain the total number of depot cylinders for each AMP. Using this method, each AMP is assigned the same number of Depot cylinders. The concept is to disperse the Depot cylinders fairly evenly across the system. This prevents one pdisk from becoming overwhelmed by all the Depot writes for your system. WAL (Write Ahead Logic) is a transaction logging scheme maintained by the File System in which a write cache for disk writes of permanent data is maintained using log records instead of writing the actual data blocks at the time a transaction is processed. Multiple log records representing transaction updates can then be batched together and written to disk with a single I/O thus achieving a large savings in I/O operations and enhancing system performance as a result. The amount of space used for the WAL Log is dynamic. WAL contains before-images (TJ) and after-images (Redo) for transactions. For example, the number of TJ images is very dependent on the type of transaction. Updating every row in a large table places a lot of TJ images into WAL. Note: Prior to the V2R6.2 release, Teradata systems running under UNIX MP-RAS systems utilized a facility referred to as “buddy backup”. Page 10-14 How Teradata uses MPP Systems WAL Concepts WAL Depot • Fixed number of cylinders allocated AMP Cylinders to each AMP. • Used for Write-in-Place operations. • Teradata first writes data to WAL WAL Depot Depot and if successful, then writes to disk. • WAL Logic will attempt to group a number of blocks to write to WAL Depot. WAL Log WAL Log • Dynamic number of cylinders used by each AMP. • Used for new block allocations on disk. Data Cylinders • Contains before-images (UNDO) and after-images (REDO) for transactions – used with both Write-in-Place or new block allocations. (Perm, Spool, Temporary, Permanent Journals) • Updated data blocks will be eventually aged out and written to disk. How Teradata uses MPP Systems Allocation of cylinders is not contiguous. Page 10-15 Linux Vproc Number Assignment The facing page describes how Vprocs are assigned numbers. With OpenPDE systems, gateway software is implemented in as separate vproc (named GTW) from the PDE vproc. With MP-RAS systems (PDE), gateway software is incorporated into the PDE vproc. Within a multi-node single clique system, it is possible for one of the nodes to have a second TVS vproc. This may seem like an anomaly, but this is normal. For example, assume a 3+1 single clique system: In order for fallback to be most effective, a single clique is divided into two subpools of storage and AMPs which reference that storage. Fallback is then setup as a cluster size of two between the two subpools of AMPs. An Allocator (part of TVS vproc) only deals with a single sub-pool of storage. Since in this case we are dividing up two subpools into three nodes, one of the nodes has about half of its storage in one subpool and half of its storage in the other subpool. Therefore, that node needs to have two Allocator vprocs, one for each sub-pool of storage. Any system with more than one clique has only one sub-pool per clique and this anomaly goes away. A single node system (which is of course a single clique) also has two sub-pools for the same reason. With Teradata 13.10 (and previous releases), vproc number ranges are: AMPs – 0, 1, 2, … PEs – 16383, 16382, 16381, … GTW – 8192, 8193, 8194, … VSS – 10238, 10237, 10236, … PDE – 16384, 16385, 16386, … RSG – 9215, 9216, 9217, … (Optional) When a system is configured with PUT, the installer is presented with an option to choose large vproc numbers if configuring a system with more than 8,192 AMPs. Therefore, starting with Teradata 14.0, optional vproc number ranges are: AMPs – 0, 1, 2, … PEs – 30719, 30718, 30717, … GTW – 22528, 22529, 22530, … TVS – 28671, 28670, 28669, … PDE – 30720, 30721, 30722, … RSG – 26623, 26622, 26621, … (Optional) Page 10-16 How Teradata uses MPP Systems Linux Vproc Number Assignment Each Teradata Vproc is assigned a unique Vproc number in the system. For example: Typical Vproc assignments: AMP Vproc #s (start at 0 and increment by 1) • First AMP 0 • Second AMP 1 • Third AMP 2 PE Vproc #s (start at 16383 and decrement by 1) • First PE 16383 • Second PE 16382 • Third PE 16381 Optional Vproc assignments starting with Teradata 14.0: Appear in DD/D and utilities such as Teradata Administrator, Viewpoint etc. AMP Vproc #s (start at 0 and increment by 1) • First AMP 0 • Second AMP 1 • Third AMP 2 PE Vproc #s (start at 30719 and decrement by 1) • First PE 30719 • Second PE 30718 • Third PE 30717 How Teradata uses MPP Systems Page 10-17 Disk Arrays from a O.S. Perspective The Operating System is used to read and/or write data to/from an individual disk. Disk arrays trick the operating system into thinking it is writing to a single disk. A disk array LUN looks to the operating system like a single disk. When the operating system gets ready to do a read or a write, the disk array controller steps in and says, “I’ll handle that for you”. The operating system says, “I am writing to a single disk and its address is c10t0d0s1”. The operating system does not directly read or write to a disk in a disk array environment. The operating system communicates with the disk array controller. The operating system actually reads or writes the data from a logical unit (often referred to as a LUN or a Volume). A logical unit (LUN) or Volume is a logical disk and not a physical disk. The operating system does not know (or care) if a LUN or Volume is RAID 0, RAID 1, or RAID 5. The operating system does not know if the drive group is one disk, two disk, or four disks. The operating system does not know if the data is spread across one disk or four disks. The operating system simply sees the logical unit as a single disk. The standard operating system utilities that are used to manage, configure, and utilize a physical disk are also used to manage, configure, and utilize a logical disk or LUN. With the Teradata Database, the PUT utility is used to configure the disk array. The array controller performs the actual input/output operations to its disks. The array controller is responsible for handling the different RAID technologies. Page 10-18 How Teradata uses MPP Systems Disk Arrays from an O.S. Perspective A logical unit (LUN) or Volume is a single disk to the operating system. – The operating system does not know or care about the specific RAID technology being used for a LUN or Volume. – The operating system uses LUNs to communicate with disk array controllers. – It is possible to divide a LUN into one or more partitions (or slices for MP-RAS). Operating System LUN 0 Disk 1 in Array Disk 2 in Array LUN 1 Disk 3 in Array Disk 4 in Array …… …… LUN 59 Disk 119 in Array Disk 120 in Array The operating system (e.g., Linux) thinks it is reading and writing to 60 logical disks. How Teradata uses MPP Systems Page 10-19 Logical Units and Partitions A logical unit (just like a physical disk) can be divided into multiple partitions (or slices with MP-RAS). A partition is a portion of a logical unit. A partition is typically used in one of two ways. Used to hold the Linux file system on SMP node internal disks. Provides a raw data storage area (raw disk partition) that is used by Teradata. EMC2 Notes EMC2 DMX disk arrays are configured with 4-way Hyper (disk slice) Meta volumes which are seen as LUNs at the host or SMP level. Each drive is divided into 4 equal size pieces (effectively slices within the array). 4 slices (across 4 disks) make a LUN that is presented to the operating system. Meta volumes are used to reduce the number of LUNs and minimize registry entries in a Windows system. Acronym: FS – File System Page 10-20 How Teradata uses MPP Systems Logical Units and Partitions With Linux, a logical unit (LUN) or Volume can be divided into one or more partitions. • With MP-RAS systems, the portions of a LUN are referred to as slices. How are partitions typically used by Teradata? • Provides raw data storage area (raw disk partition) for Teradata. • A Pdisk (Teradata) is a name that is assigned to a partition (slice) within a LUN. LUN LUN Single Partition - raw disk space Multiple Partitions - each is raw disk space - each is assigned to a different Pdisk One Pdisk Multiple Pdisks How Teradata uses MPP Systems Page 10-21 Teradata and Disk Arrays The Teradata Database has long been recognized as one of the most effective database platforms for the storage and management of very large relational databases. The Teradata Database implementation executes as an application under the operating system. The two key pieces of software that make up the Teradata Database are the PE software and the AMP software. Users access the Teradata Database by issuing SQL commands - usually from channelattached hosts or LAN attached workstations. The user request is handled by Channel Driver or Gateway software and is passed to a Parsing Engine (PE) which processes the SQL request. PE software manages the user session, interprets (parses) the SQL request, creates an execution plan, and dispatches the steps of that plan to the AMP(s). AMPs provide access to user data stored within tables that are physically stored on disk arrays. Each AMP is associated with a Vdisk. Each AMP sees its Vdisk as a single disk. Teradata Database (AMP software) organizes its data on its disk space (Vdisk) using a Teradata Database “File System” structure. A “master index” is used to locate “cylinder indexes” which are used to locate data blocks that contain data rows. A Vdisk is actually composed of multiple slices (also called Pdisks - Physical disk) that are part of a LUN (Logical Unit) in a disk array. The operating system (e.g., Linux) and the array controllers work at the LUN level. A logical unit (just like a physical disk) can be divided into multiple slices. A slice is a portion of a logical unit. An AMP is assigned to a Vdisk. A Vdisk is composed of one or more Pdisks. In Linux, a Pdisk is assigned to a partition within a LUN. The PUT utility is used to define a Teradata Database configuration. Page 10-22 How Teradata uses MPP Systems Teradata and Disk Arrays Teradata Pdisk = Linux/Windows Partition User AMP File System Software Vdisk Pdisk 0 Single Disk Pdisk 1 PE TVS PDE O.S. How Teradata uses MPP Systems Disk Array Controller Logical Disks LUN 0 LUN 1 Pdisk 0 Pdisk 1 Page 10-23 Teradata 6650 (2+1) Logical View The facing page illustrates a logical view of the Teradata Database using a 6650 3+1 clique. The design center configuration for a 6650H (Linux) 2+1 clique is as follows: 4 Drives per AMP 30 AMPs per Node Each virtual AMP is assigned to a virtual disk (Vdisk). AMP 0 is assigned to Vdisk 0 which consists of 2 mirrored pairs of disks. Each AMP has a Vdisk with 592,020 cylinders. Note: The actual MaxPerm space that is available to an AMP is slightly less than the physical disk space because of file system overhead. Approximately 90 - 91% of the physical disk space is actually available as MaxPerm space. Page 10-24 How Teradata uses MPP Systems Teradata 6650 (2+1) Logical View 6650H Node AMP AMP 0 1 Vdisk 0 AMPs 2 – 28 6650H Node AMP AMP AMP 29 30 31 Vdisk 29 Vdisk 1 120 Disks Vdisk 30 AMPs 32 – 58 AMP 59 Vdisk 59 Vdisk 31 120 Disks Two Disk Arrays with 240 Disks – Logical View Typical configuration is to assign each AMP with two mirrored pairs of disks. How Teradata uses MPP Systems Page 10-25 Teradata 6650 (3+1) Logical View The facing page illustrates a logical view of the Teradata Database using a 6650 3+1 clique. The design center configuration for a 6650H (Linux) 3+1 clique is as follows: 2 Drives per AMP 42 AMPs per Node Each virtual AMP is assigned to a virtual disk (Vdisk). AMP 0 is assigned to Vdisk 0 which consists of 1 mirrored pair of disks. Each AMP has a Vdisk with 295,922 cylinders. Note: The actual MaxPerm space that is available to an AMP is slightly less than the physical disk space because of file system overhead. Approximately 90 - 91% of the physical disk space is actually available as MaxPerm space. Page 10-26 How Teradata uses MPP Systems Teradata 6650 (3+1) Logical View 6650H Node AMP 0 Vdisk 0 AMP 1 Vdisk 1 6650H Node AMPs 2 – 40 AMP AMP AMP 6650H Node AMP AMP AMP 83 84 85 AMPs 44 – 82 41 Vdisk 41 126 Disks 42 Vdisk 42 43 Vdisk 43 Vdisk 83 Vdisk 84 Vdisk 85 AMP AMPs 86 – 124 125 Vdisk 125 126 Disks Two Disk Arrays with 252 Disks – Logical View Typical configuration is to assign each AMP with one mirrored pair of disks. How Teradata uses MPP Systems Page 10-27 Example of 1.2 TB Vdisk (pre-TVS) A Vdisk effectively represents a set of disks in a disk array. In this example, a Vdisk represents a rank of 4 disks in a disk array that is configured to use RAID 1 technology. If the disk array has 600 GB disks and RAID 1 protection is used, then one rank of disks (4 disks) has 1.2 TB of available disk space. 4 disks x 600 GB x .50 (parity is 50%) = 1.2 TB* If the Vdisk is configured (assigned) with four 600 GB disks (RAID 1), then the associated AMP has 1.2 TB of perm disk space available to it. The facing page contains a typical example of a 1.2 TB Vdisk. It would contain 592,021 cylinders; each cylinder is 3872 sectors in size. A cylinder is approximately 1.9 MB in size (3872 x 512 bytes). With 600 GB disk drives, 592,021 cylinders are numbered from 0 to 592,020. Cylinder 0 contains control information used by the AMP and does not contain user data. If 73 GB disk drives are used, the AMP's Vdisk will be as follows: Total number of cylinders – 71,853 First Pdisk – 35,924 cylinders (numbered 0 through 35,923) Second Pdisk – 35,929 cylinders (numbered 35,924 through 71,852) If 146 GB drives are used, then the Vdisk will be as following: Total number of cylinders – 144,482 First Pdisk – 72,237 cylinders (numbered 0 through 72,236) Second Pdisk – 72,245 cylinders (numbered 72,237 through 144,481) If 300 GB drives are used, then the Vdisk will be as following: Total number of cylinders – 290,072 First Pdisk – 145,037 cylinders (numbered 0 through 145,036) Second Pdisk – 145,035 cylinders (numbered 145,037 through 290,071) The configuration of LUNs/partitions and the assignment Pdisks/Vdisks to AMPs is done through the PUT utility. As mentioned previously, the actual space that is available to an AMP is slightly less that the numbers used above because of file system overhead. The actual MaxPerm space is approximately 90-91% of the physical disk space. In the example on the facing page, each AMP will have approximately 1080 GB of MaxPerm space, not 1200GB. Page 10-28 How Teradata uses MPP Systems Example of 1.2 TB Vdisk (pre-TVS) Teradata’s File System software divides the Vdisk into logical cylinders. Typically, each cylinder is 3872 sectors in size. Physical Disks 600 GB Vdisk Pdisk 0 LUN 600 GB 600 GB Cylinder 0 1.2 TB Cylinder 0 600 GB 296,011 1 AMP 600 GB 592,020 Pdisk 1 LUN 600 GB 600 GB 296,012 600 GB 592,020 RAID 1 Mirroring How Teradata uses MPP Systems Page 10-29 Teradata File System Concepts Each AMP has its own disk space managed by the Teradata Database file system software. The file system software groups physical blocks into logical cylinders. One AMP can address/manage up to 700,000 cylinders. Each cylinder has a cylinder index (CI). Although an AMP can address this large number of cylinders, typically an AMP will only be responsible for a much smaller number of cylinders. For example, an AMP that manages 292 GB of disk space will have 144,482 cylinders. When an AMP is initialized (booted), it reads the Cylinder Indexes and creates an inmemory Master Index to the Cylinder Indexes. Notes: Teradata Database V2R5 to 13.0 – each Cylinder Index is 12 KB in size. The cylinder size is still 3872 sectors. Teradata uses both of the cylinder indexes as alternating cylinder indexes for write (INSERT, UPDATE, and DELETE) operations for all of the supported operating systems. Teradata Vdisk Size Limits For Teradata releases up to Teradata 13.0, the maximum amount of space that one AMP can access is based on the following calculation: 700,000 logical cylinders x 3872 sectors/cylinder x 512 bytes/sector This equals 1,387,724,800,000 bytes or approximately 1.26 TB where a TB is 10244. Page 10-30 How Teradata uses MPP Systems Teradata File System Concepts The cylinder size is 3872 sectors. For a 1.2 TB Vdisk, there are 592,021 cylinders. Note: However, the amount of actual MaxPerm space is approximately 90% of the actual disk space because of overhead (cylinder indexes, etc.). Data Cylinders Cylinder 1 x x = 700,000 cylinders 3872 sectors/cylinder 512 bytes/sector 1.26 Terabytes Data Block with rows Data Block with rows AMP Memory Cylinder 2 Cylinder Index Data Block with rows Master Index Entry for CI #1 Entry for CI #2 Data Block with rows MaxPerm per AMP 1.2 TB x .90 ≈ 1.08 TB Note: The maximum disk space an AMP can address is: Cylinder Index Entry for CI #700,000 Max # of Cylinders is approx. 700,000 How Teradata uses MPP Systems Cylinder Index Size of Cylinder Index space: 24K Page 10-31 Teradata 13.10 Large Cylinder Support Prior to Teradata 13.10, the maximum space available to an AMP is approximately 1.2 TB. This feature increases the maximum space available to an AMP to approximately 7 TB. Benefits of this feature are listed on the facing page. Only systems that are newly initialized (sysinit) with Teradata 13.10 will have large cylinders enabled. Existing systems that are upgraded to 13.10 will have to be initialized (sysinit) in order to utilize large cylinders. A cylinder contains Perm, Spool, Temporary, Permanent Journal, or WAL data, but NOT a combination. For an existing system, large cylinders results in fewer cylinders that are available for different types of data. Fewer cylinders can result in low cylinder conditions occurring more quickly and possibly more Mini-CylPacks. If the larger cylinder size is used on an existing system where each AMP has much less space than 1.2 TB, then the number of available cylinders will be much less. For example: Assume a 5650 system with disk arrays populated with 600 GB drives. An AMP will typically have 4 drives assigned to it (2 sets of mirrored disks). Therefore, the AMP will have approximately 1200 GB of available space. This space is divided into approximately 592,000 cylinders. – Note: The actual MAXPERM space available to an AMP in this example is approximately 1080 GB (90% of 1200 GB). If this system is configured with large cylinders, then the system will only have approximately 99,000 cylinders. Large cylinders consume more physical space, resulting in fewer overall cylinders. When to Use This Feature A customer should consider enabling Large Cylinders if: Initial system will be sized above the current 1.2 TB per AMP limit It is possible that future expansion would cause the per AMP limit of 1.2 TB to be exceeded. If the customer anticipates the need to utilize larger row sizes (e.g., 1 MB rows) in a future release. A customer should NOT enable Large Cylinders if: Page 10-32 AMPs on the system are sized considerably less than 1.2 TB with no plans to expand beyond that limit. Large cylinders consume more physical space, resulting in fewer overall cylinders. How Teradata uses MPP Systems Teradata Large Cylinder Support Starting with Teradata 13.10, this feature increases the maximum space available to an AMP to approximately 7.2 TB. Benefits of this feature are: • To utilize larger disk drives, AMPs must be to address more storage space. – Example, 2650 systems utilizing 2 TB disk drives • Customers that require a large capacity of storage space have the option of increasing their storage per AMP, rather than increasing the number of AMPs. • The maximum row size will most likely increase in future releases. Larger cylinders are more space efficient for storing large rows. – The maximum row size (~64 KB) is unchanged in 14.0 If large cylinders are enabled for a Teradata 13.10 or 14.0 system, then the maximum space that an AMP can access is 6 times greater or approximately 7.2 TB. Max # of cylinders ~700,000 • x x #sectors in cylinder x 23,232 x sector size 512 bytes = 7.2 TB Each cylinder index has increased to 64 KB to accommodate more blocks in a large cylinder. Only newly initialized (sysinit) systems can have large cylinders enabled. • Existing systems upgraded to 13.10 have to be initialized (sysinit) in order to utilize large cylinders. How Teradata uses MPP Systems Page 10-33 Full Cylinder Read Full Cylinder Read allows retrieval operations to run more efficiently by reading a list of cylinder-resident data blocks with a single I/O operation. This reduces I/O overhead from once per data block to once per cylinder. A data block is a disk-resident structure that contains one or more rows from the same table and is the smallest I/O unit for the Teradata Database file system. Data blocks are stored in physical disk sectors or segments, which are grouped in cylinders. Full Cylinder Read improves the performance of systems with both fine-grained operations and decision support workload. It eliminates the tradeoffs for short queries and concurrent updates versus strategic queries. Performance may benefit from Full Cylinder Read during operations such as: Full-table scan operations under conditions such as: Large select Merge insert/select and Merge delete Aggregation: Sum, Average, Minimum/Maximum, Count Join operations that involve many data blocks, such as merge joins, product joins, and inner/outer joins Starting with Teradata 13.10, this feature no longer needs to be tuned using a Cylinder Slots/AMP setting. This allows for more extensive use of cylinder read without the need to reserve portions of the FSG cache for Cylinder Read when Cylinder Read is not being used. Prior to Teradata 13.10, it was necessary to specify the number of cylinder slots per AMP that would be available. The default number of cylinder slots per AMP is: 6 on 32-bit systems with model numbers lower than 5380. 6 on 32-bit coexistence systems with “older nodes.” An “older node” is a node for a system with a model number lower than 5380. 8 on the 32-bit systems with model numbers at 5380 or higher. 8 on 64-bit systems. Teradata Database Customer Support sets the CR flag to ON and uses the ctl (control) utility to modify the number of cylslots. Memory allocated to cylinder slots can only be used for cylinder reads. The benefit of cylinder reads is likely to outweigh the reduction in generic FSG cache. Page 10-34 How Teradata uses MPP Systems Full Cylinder Read The Full Cylinder Read feature allows data to be retrieved with a single cylinder (large) read, rather than individual reads of blocks. CYLINDER Data Block DB DB DB DB DB ~1.9 MB Enables efficient use of disk & CPU performance resources for the following table scan operations under specific conditions. Examples include: – – – – large selects and aggregates: sum, avg, min, max, count joins: merge joins, product joins, inner/outer joins merge delete merge insert/select into empty or populated tables full table update/deletes With Teradata 13.10, it is no longer necessary to specify a number of cylinder slots to make available for this feature. – This 13.10 enhancement allows for more extensive use of cylinder reads without the need to reserve portions of the FSG cache for the Cylinder Read feature. – Prior to 13.10, the number of cylinder slots was set using the ctl utility. The default was 8 for 64bit operating systems. How Teradata uses MPP Systems Page 10-35 Summary The facing page summarizes the key points and concepts discussed in this module. Page 10-36 How Teradata uses MPP Systems Summary • Memory managed and used by the operating system and the vprocs is sometimes called “free memory”. • PDE software manages FSG Cache. – FSG Cache is primarily used by the AMPs to access memory resident database segments. • The operating system and Teradata does not know or care about the RAID technology being used. • A LUN or Volume looks like a single disk to the operating system. – With Linux or Windows, a LUN or Volume is considered a partition and the raw partition is assigned to a Teradata Pdisk. How Teradata uses MPP Systems Page 10-37 Module 10: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 10-38 How Teradata uses MPP Systems Module 10: Review Questions 1. Which two are placed into FSG cache? a. Hash maps b. Master Index c. Cylinder Indexes d. Permanent data blocks 2. What is the WAL Depot used for? a. UNDO Rows b. New data blocks c. Master Index updates d. Write-in-place data blocks 3. Which two are placed into the WAL Log? a. REDO Rows b. UNDO Rows c. New data blocks d. Master Index updates e. Write-in-place data blocks 4. Describe the fundamental relationship between Linux, logical units, and disk array controllers. ________________________________________________________________________________ 5. Describe the fundamental relationship between AMPs, Vdisks, Pdisks, Partitions, and LUNs. ________________________________________________________________________________ ________________________________________________________________________________ How Teradata uses MPP Systems Page 10-39 Notes Page 10-40 How Teradata uses MPP Systems Module 11 Teradata Virtual Storage After completing this module, you will be able to: List two benefits of Teradata Virtual Storage. List the two operational modes of TVS. Identify the difference between temperature and performance. Identify typical data that is identified as hot data. Teradata Proprietary and Confidential Teradata Virtual Storage Page 11-1 Notes Page 11-2 Teradata Virtual Storage Table of Contents Teradata Virtual Storage ............................................................................................................ 11-4 Teradata Virtual Storage Concepts ............................................................................................ 11-6 Allocation Map and Statistics Overhead ................................................................................ 11-6 TVAM .................................................................................................................................... 11-6 Teradata Virtual Storage Terminology ...................................................................................... 11-8 Teradata Virtual Storage Components ................................................................................... 11-8 TVS Operational Modes .......................................................................................................... 11-10 Expanding Data Storage Concepts ........................................................................................... 11-12 Multi-Temperature Concepts ................................................................................................... 11-14 Storage Performance vs. Data Temperature............................................................................. 11-16 Teradata with Hybrid Storage .................................................................................................. 11-18 What Goes Where? .................................................................................................................. 11-20 Multi-Temperature Data Example ........................................................................................... 11-22 Teradata 6690 Cabinets ............................................................................................................ 11-24 Virtualized Management Server (VMS) .............................................................................. 11-24 HHD to SSD Drive Configurations.......................................................................................... 11-26 Summary .................................................................................................................................. 11-28 Module 11: Review Questions ................................................................................................. 11-30 Teradata Virtual Storage Page 11-3 Teradata Virtual Storage Teradata Virtual Storage (TVS) is designed to allow the Teradata Database to make use of new storage technologies. It will allow you to store data that is accessed more frequently on faster devices and data that is accessed less frequently on slower devices. It will also allow Teradata to make use of solid state drives (SSD), for example, whenever the technology is available at a competitive price. Solid state refers to the use of semiconductor devices. Teradata Virtual Storage is responsible for: pooling clique storage and allocating cylinders from the storage pool to individual AMPs tracking where data is stored on the physical media maintaining statistics on the frequency of data access and on the performance of physical storage media These capabilities allow Teradata Virtual Storage to provide the following benefits: Storage optimization, data migration, and data evacuation Teradata Virtual Storage maintains statistics on frequency of data access (“data temperature”) and on the performance (“grade”) of physical media. This allows the Teradata Virtual Storage product to intelligently place more frequently accessed data on faster physical storage. As data access patterns change, Teradata Virtual Storage can move (“migrate”) storage cylinders to faster or slower physical media within each clique. This can improve system performance over time. Teradata Virtual Storage can migrate data away from a physical storage device in order to prepare for removal or replacement of the device. This process is called “evacuation.”. Complete data evacuation requires a system restart, but Teradata Virtual Storage supports a “soft evacuation” feature that allows much of the data to be moved while the system remains online. This can minimize system down time when evacuations are necessary. Lower Barriers to System Growth Device management features of Teradata Virtual Storage provide the ability to pool storage within each clique. Each storage device (pdisk) can be shared, if necessary, by all AMPs in the clique. If the number of storage devices is not a multiple of the number of AMPs in the clique, the extra storage will be shared. Consequently, storage can be added to the system in smaller increments, as needs and opportunities arise. Page 11-4 Teradata Virtual Storage Teradata Virtual Storage What is Teradata Virtual Storage (TVS)? • TVS (Teradata 13.0) is a change to the way in which Teradata accesses storage. • Purpose is to manage a Multi-Temperature Warehouse. • Pools all of the cylinders within a clique's disk space and allocates cylinders from this storage pool to individual AMPs. Advantages include: • Simplifies adding storage to existing cliques. – Improved control over storage growth. You can add storage to the clique-storage-pool versus to every AMP. – Allows sharing of storage devices among AMPs. • Enables mixing drive sizes / speeds / technologies – Enables the “mixing” of storage devices (e.g., spinning disks, Solid-State Disks – SSD). • Enables non-intrusive migration of data. – The most frequently accessed data (hot data cylinders) can migrate to the high performing cylinders and infrequently accessed data (cold data cylinders) can migrate to the lower performing cylinders. Teradata Virtual Storage Page 11-5 Teradata Virtual Storage Concepts The facing page illustrates the conceptual differences with and without Teradata Virtual Storage One of benefits of Teradata Virtual Storage is the ease of adding storage to an existing system. Before Teradata Virtual Storage, Existing systems have integral number of drives / AMP Today adding storage requires an additional drive per AMP – means 50% or 100% increase in capacity With Teradata Virtual Storage, you can add any number of drives. Added drives are shared by all AMPs These new disks may have different capacities and / or performance than those disks which already reside in the system. Cylinders IDs (with TVS) are unique in the system and are 8 bytes in length as compared to 4 bytes in length before TVS (Teradata 12.0 and before). Allocation Map and Statistics Overhead The file system requires space on each pdisk for its allocation map and statistics areas. The number of cylinders required depends on the pdisk size as specified in the vconfig GDO. TVAM TVAM is a support utility to control and monitor Teradata Virtual Storage. TVAM … Page 11-6 Includes “-optimize” command to cause forced migration Includes “evacuate” and “un-join” command to enable removing a drive Teradata Virtual Storage Teradata Virtual Storage Concepts Pre-TVS – AMPs own storage TVS owns storage AMPs don't know physical location of a cylinder and it can change. AMP AMP AMP AMP TVS Extent (Cylinder) Driver Pdisk Pdisk Pdisk Pdisk Pdisk Pdisk Pdisk Pdisk Cylinders were addressed by drive # and cylinder #. All of the cylinders in clique are effectively in a pool that is managed by that TVS vproc. Cylinders are assigned a unique cylinder id (virtual id) across all of the pdisks. Teradata Virtual Storage Page 11-7 Teradata Virtual Storage Terminology The facing page lists and defines some of the key terms used with Teradata Virtual Storage (TVS). A subpool is a set of pdisks. There is typically one subpool per clique. Single clique systems have 2 subpools, so we can spread the AMP clusters across the subpools to achieve fallback. It is very important to understand that TVS is configured on a clique by clique basis. For a multi-clique system each clique typically has one subpool. This is where we configure the AMP clusters across the cliques. No two AMPs in the same cluster should be configured in the same clique. TVS will take all the cylinders it finds in the clique (actually the subpool), and will divide that by the number of AMPs in the clique. This is the maximum that each AMP can allocate and is communicated back to the AMP so that it can size its master index. Each AMP can allocate cylinders as it needs cylinders up to that maximum. If some AMPs allocate more or less than other AMPs at any given time, it does not cause a problem because the space is not over-subscribed and no AMP can allocate more than its maximum. Teradata Virtual Storage Components 1. The DBS to File System interface is unchanged. 2. The file system calls SAM (Storage Allocation Manager) at startup to obtain the list of allocated extents which can be used to rebuild or verify the MI (Master Index) and WMI (WAL Master Index). SAM also reports the maximum size of the vdisk for this AMP. 3. The file system makes calls on the SAM library interface in order to allocate and free the extents (cylinders) of storage. SAM provides an opaque handle for each allocated extent virtualizing the storage from the file system’s point of view. 4. SAM sends messages to this AMP’s Allocator to allocate/free the extents requested. Usually this will be on the same node as the AMP, but may require a node hop if the AMP and Allocator (part of VSS Vproc) have migrated due to a node failure. 5. The Allocator keeps the Extent Driver apprised of all virtual to physical extent mappings. Sometimes this requires a message to the node agent on another node in case of vproc migration. The Allocator keeps a copy of its translation maps on physical storage in case of node crash. It performs these I/Os directly. 6. The file system uses extent handles when communicating FSGids to FSG. 7. FSG passes the extent handle as the disk address for disk I/O to the Extent Driver. 8. The Extent Driver translates the handle to a physical extent address before passing the request to the disk driver. Page 11-8 Teradata Virtual Storage Teradata Virtual Storage Terminology TVS Software • Consists of TVS (previously named VSS) vproc which includes Allocator and Migrator code • Includes Extent Driver (in kernel) which interfaces between PDE and operating system Cylinder (or Extent) • Allocation unit of disk space (currently 3872 sectors) from TVS to an AMP Pdisk • Partition/slice of a physical disk; associated with a TVS vproc Subpool • Group of disks (effectively a set of pdisks) assigned to a TVS vproc • Fallback clustering is across subpools; a single AMP is associated with a specific subpool • 1 subpool/clique except for single-clique systems which have 2 subpools Storage Performance • TVS profiles the performance of all the disk drives (e.g., spinning disks versus SSD) • With spinning disks, outer zones have a faster transfer rate than inner zones on a disk. Temperature • Frequency of access of a cylinder (Hot – Cold). TVS gathers metrics on data access frequency. Migration • Movement of cylinders between disks or between locations within a disk. Teradata Virtual Storage Page 11-9 TVS Operational Modes Teradata Virtual Storage operates in one of two modes: Teradata Traditional – Mimics prior Teradata releases Intelligent Placement (a.k.a., 1D) – Data temperature based placement Teradata Traditional (TT) mode Characteristics: When using configurations modeled with the standard interface, the TVS software is used in Teradata Traditional (TT) Mode. TT mode is available for all operating systems. In TT mode, TVS software uses similar placement algorithms as pre-TVS Teradata software. There is no migration of hot data to fast locations in TT mode Use Teradata Traditional mode when No mixing of array models in a clique AND No mixing of disk sizes in an array AND All Pdisks are the same size (homogeneous storage) AND Performance capability of all Pdisks is equal (within 5%) AND Migration is not desired AND The number of Pdisks is an integer multiple of the AMPs in the clique. This is not a strict requirement. In this case, any fractional Pdisks will go unused in TT mode. Intelligent Placement (1D – 1 Dimensional) mode Characteristics Intelligent Placement is only available for Linux 64-bit operating systems. This mode is used when any of the following are true: Mixing of array models in a clique Mixing of disk sizes in an array Pdisks in a clique are different sizes When TVS software is used in Intelligent Placement (1D) Mode: TVS software uses advanced data placement algorithms Migration of hot data to fast locations and cold data to slower locations can be enabled in 1D mode. Use Intelligent Placement (1D) when – Pdisks are multiple sizes OR – Performance capability of any Pdisks is different (> 5%) OR – The number of Pdisks of each size in the clique is not a multiple of the number of amps in the clique OR – Migration is desired Page 11-10 Teradata Virtual Storage TVS Operational Modes Teradata Virtual Storage (Storage Allocation) operates in one of two modes: • Teradata Traditional – works like prior Teradata releases • Intelligent Placement (a.k.a., 1D) – data temperature based placement Operational Mode Teradata Traditional Intelligent Placement Operating System Support All Linux Mixed Disk and/or Mixed Array No Yes Small Growth Increments No Yes Data Migration No Yes Disk Evacuation No Yes Note: Evacuation is used to migrate all allocated extents (cylinders) from the selected storage to different devices. This may be done when a disk goes bad or if a disk is to be removed from the storage pool. Teradata Virtual Storage Page 11-11 Expanding Data Storage Concepts When adding non-shared storage to a clique on a system with Teradata Virtual Storage Services (TVS), the number of devices (Pdisks) added should be a multiple of the number of AMPs in the clique. The allocation method can be either 1D migration or Teradata Traditional (TT). When adding shared storage to a clique on a system with Teradata Virtual Storage Services (TVS), the storage added will be shared among all AMPs. The allocation method is 1D migration only. In addition to utilizing the existing storage capacity more efficiently with temperature based placement, TVS simplifies the ability to alter the storage capacity of a Teradata system. As previously mentioned, database generations prior to Teradata Database 13.0 typically allocated the entire capacity of each drive in the system to a single AMP. That design parameter, coupled with the need for each AMP to have identical amounts of available storage meant that system-wide storage capacity could only be increased by adding enough new disks (of the same size) to provide each AMP in the system with an entire new drive. Thus, a system with 100 AMPs would typically have a minimum increment of growth of 100 drives (actually 200 drives using RAID-1) which have identical performance and capacities to the existing drives. With TVS, storage capacity can be added in a much more flexible manner. Instead of requiring each drive to be dedicated to a single AMP, TVS can subdivide a device into equivalent groups of cylinders which can then be allocated to each AMP, allowing even single drive pairs to be equally shared by all of the AMPs in a clique. This “fine grained virtualization” enables the growth of storage capacity using devices with differing capacities and speeds. For multi-clique or co-existence systems, new storage capacity would still have to be added to each clique in an amount that is proportional to the number of AMPs in each clique. Page 11-12 Teradata Virtual Storage Expanding Data Storage Concepts Storage can be added to a clique and is shared between all the AMPs within the clique. Expanded storage within a clique is treated as "cold storage". AMP 0 …… AMP 1 AMP 24 TVS Extent (Cylinder) Driver Added storage to the clique Pdisk 0 Pdisk 1 Pdisk 2 Pdisk 3 Physical Disk Physical Disk Physical Disk Physical Disk …… Pdisk 48 Pdisk 49 Pdisk 50 Pdisk 51 Physical Disk Physical Disk Physical Disk Physical Disk Mirrored Disk Mirrored Disk Mirrored Disk Mirrored Disk …… Mirrored Disk Mirrored Disk Mirrored Disk Teradata Virtual Storage Mirrored Disk Page 11-13 Multi-Temperature Concepts The facing page identifies two key areas associated with Multi-Temperature environments. With today’s disk technology, the user experiences faster data transfer with data that is located on the outer zones of a disk drive. This is because there is more data accessed per disk revolution. Data located on the inner zones experience slower data transfer because more disk revolutions are needed to access the same amount of data. Teradata Virtual Storage can track data temperature over time and can move data to appropriate region. A Multi-Temperature Warehouse has the ability to prioritize the use of system resources based on business rules while maximizing utilization of storage with ever increasing capacity Teradata Virtual Storage enhances performance with multi-temperature data placement. Page 11-14 Teradata Virtual Storage Multi-Temperature Concepts Two related concepts for Multi-Temperature Data: Performance of Storage • Differences between different drives on different controllers (spinning disk vs. SSD) • Differences between different areas within a drive – Outer zones on a disk have fastest transfer rates. Data Access Pattern (Frequency) or Temperature is determined by: • Frequency of access • Frequency of updates • Data maintenance Faster Data Transfer (more data per revolution) • Depth of history Slower Data Transfer (less data per revolution) Teradata Virtual Storage Page 11-15 Storage Performance vs. Data Temperature For the purposes of describing the performance characteristics of storage devices, we’ll use terms like “fast”, “medium” and “slow” to describe the relative response times (grade) of the physical cylinders that comprise each device. The important thing to keep in mind is that temperatures (hot, warm, and cold) refer to data access and grade (fast, medium, slow) refer to the speed of physical devices. PUT executes the TVS Profiler on one node per clique. This is done during initial install while the system is essentially idle. The TVS Profiler measures and records the cylinder response times of one disk of each size (i.e., 146 GB, 300 GB, 450 GB, 600 GB, etc.) and type (i.e., Cheetah-5). These metrics are then used for all disks of the same size and type in the clique throughout the life of the system. This can also be done using the TVAM utility. Data temperature values are maintained by the TVS Allocator vproc. They are viewable at an extent level via the new TVAM (Teradata Virtual Administration Menu) command and at a system, AMP, and table level via the Ferret utility SHOW command. The data temperatures are calculated as a function of both frequency of access and “recency” of access and are measured relative to the temperatures of the rest of the user data stored in the warehouse. This concept of “recency” is important because it allows the data temperatures to cool off as time passes so that even if a table or partition has a lot of historical access, the temperature of that data appears lower than data that has the same amount of access during a more recent time period. This trait of data becoming cooler over time is commonly referred to as data aging. But just because data is older doesn’t imply that it will only continue to get cooler. In fact, cooler data can become warm/hot again as access increases. For example, sales data from this quarter may remain hot until several months in the future as current quarter/previous quarter comparisons are run in different areas of the business. After 6 months, that sales data may cool off until nearly a year later when it becomes increasingly queried (i.e. becomes hotter) by comparison against the new current quarter’s sales data. Teradata Virtual Storage enhances performance with multi-temperature data placement. Page 11-16 Teradata Virtual Storage Storage Performance vs. Data Temperature Storage Performance relative response times – (e.g., fast, medium, slow). • Profiles the performance of all the disk drives (e.g., SSD versus spinning disks) • Identifies performance zones (usually 10) on each spinning disk drive Data Access Frequency – referred to as "Data Temperature" (e.g., hot, warm, cold). • TVS records information about data access (called Profiling and Metric Collection) – How long it takes to access data (I/O response times) – How often data is accessed (effectively the data temperature) TVS places data for optimal access based upon storage performance, type of data (WAL, Depot, Spool, etc.) and the results of metric collection. • Initial Data Placement • Migration of data based upon data temperature Three types of Data Migration: • Background Process During Queries – moves 10% of data in about a one week • Optimize Storage Command (Database Off-Hours) - moves 10% of data in about eight hours – Ignores other work – just runs “flat out” • Anticipatory Migration to Make Room in Fast Reserve, Fast or Warm Storage for Hotter Data (when needed) Teradata Virtual Storage Page 11-17 Teradata with Hybrid Storage The facing page illustrates an example of a Teradata system with both HDD and SDD drives. Page 11-18 Teradata Virtual Storage Teradata with Hybrid Storage Mix of Fast SSD and HDD Spinning Drives Node Node AMPs AMPs HSN Node AMPs Node AMPs HSN Node Node AMPs AMPs HSN SSD > 300 MB/Sec HDD (Spinning) 15 MB/Sec Teradata Virtual Storage Page 11-19 What Goes Where? Virtualization is the Key to Managing Multi-Temperature Data. TVS “knows” the performance characteristics of the different storage devices TVS virtualizes location information for the database and stores the translation information in its meta-data TVS collects usage information and adds it to the meta-data TVS uses these metrics to assign a temperature to each cylinder (extent) of data – Each time cylinder is accessed it heats up – Over time all cylinders cool off TVS migrates data from one performance zone to another as appropriate to match temperature to storage performance Initial Data Placement Based on several factors and controls File System indicates to TVS expected temperature of the data, and its use (permanent tables, spool, WAL, other temp tables) TVS allocates from appropriate performance device – SSDs for hot data – Inside cylinders of HDD for cold data – All the rest of HDD for all else (warm) Initial Data Temperature for Permanent Data Page 11-20 All new cylinders are assigned an initial temperature Defaults for each type are specified in DBSControl When loading into empty tables, if don’t want the default then temperature can be specified by the user via Querybanding the Session doing the loading When adding data to existing tables, new data assumes the temperature of the data already in the table at the location it is inserted. – Possible to forcibly change temperature of existing table or part of table via Ferret – this is not a recommended management tool – Changing temperature does not move data, just make it subject to normal migration – Over time, temperature will return to ambient Teradata Virtual Storage What Goes Where? Migration is done at the Cylinder level. Depot, WAL, and Spool cylinders are allocated as HOT • 20% of Fast storage (SSD) is reserved for this Depot, WAL, and Spool. • This region is called the Fast Reserve. – This does not limit the total amount of WAL or Spool. • When Fast Reserve is full, use Fast or even Medium for WAL and Spool allocations. • These cylinders types are not subject to “cooling off”, their temperature is static. Loading perm data into an empty table defaults to HOT • This is assumed to be a short-lived staging table. • If not, this default can be changed with DBSControl. • Another option is to specify Initial Data Temperature with Query Band. – SET QUERY_BAND = 'TVSTemperature=COLD;' UPDATE FOR SESSION; Note: The UPDATE option is used so that this query band statement will not completely replace, but rather supplement, any other query band name:value pairs already specified for the session. Teradata Virtual Storage Page 11-21 Multi-Temperature Data Example The facing page illustrates an example of using a Multi-Temperature Warehouse. Example of Multi-Temperature with a PPI Table: If this is time based (e.g., DATE), then rows of the table are physically grouped by DATE and the groups ordered by DATE, even though hash ordered within the partition for each DATE value. Because the rows are logically grouped together, they reside in a set of cylinders Based on usage patterns, all the cylinders of a partition will have same temperature. As usage drops, partition cools off, eventually its cylinders get migrated out of FAST to MEDIUM, then eventually to SLOW storage. Newly loaded partition will assume temperature of previous latest (probably HOT). While TVS can monitor data temperatures, it can’t change or manipulate the temperature of your data because data temperatures are primarily dictated by the workloads that are run on the warehouse. That is, the more queries that are run against a particular table (or tables) the higher its temperature(s). The only way to change a table’s temperature is to alter the number of queries that are run against it. For technical accuracy, TVS temperature is measured at a cylinder level not a data level. The facing page illustrates the result of data migration with Teradata Virtual Storage. Teradata Virtual Storage enables you to more easily, more cost effectively, mix data with different levels of importance on the same system Advantages of Teradata Virtual Storage: Allows incremental growth of storage Provides lower cost method for adding “Cold” data to Enterprise Warehouse w/o Performance Penalty to Bread and Butter Workload Enhances multi-generation co-existence Page 11-22 Teradata Virtual Storage Multi-Temperature Data Example DSS History DSS Current Tactical Hot Warm Heavily Accessed Operational Intelligence Shallow History Cool Regulatory Compliance Trending Analysis Deep History Hybrid Storage Data Usage Model The closer this model fits your data, the more useful the Hybrid system will be. Teradata Virtual Storage Page 11-23 Teradata 6690 Cabinets Each Teradata 6690 cabinets can be configured in a 1+1 or 2+1 clique configuration. A processing/storage cabinet contains one clique. A cabinet with a 2+1 clique contains two processing nodes, one hot standby node, and four disk arrays. A cabinet with a 1+1 clique contains one processing node, one hot standby node, and four disk arrays. Virtualized Management Server (VMS) The VMS is available with the 6690 Enterprise Warehouse Server. Characteristics of the VMS include: • 1U Server that VIRTUALIZES system and cabinet management software onto a single server • Teradata System VMS – provides complete system management functionality – – – – Cabinet Management Interface Controller (CMIC) Service Workstation (SWS) Teradata Viewpoint (single system only) Automatically installed on base/first cabinet • The VMS allows full rack solutions without an additional cabinet for traditional Viewpoint and SWS • Eliminates need for expansion racks reducing customers’ floor space and energy costs • For multi-system monitoring and management traditional Teradata Viewpoint is required. Page 11-24 Teradata Virtual Storage Teradata 6690 Cabinets 6690 Characteristics • Integrated Cabinet with nodes and SSD and HDD arrays in same cabinet. • Each NetApp Drive Tray can hold up to 24 SSD and/or HDD drives. – SSD drives are 400 GB. – HDD drives (10K RPM) are 600 GB. – Possible maximum of 360 disks in the cabinet. • The primary requirement for planning a 6690 system is the completion of the Data Temperature assessment. • There is a range of configurations to meet the requirements of most customers’ data temperature assessments. Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives VMS (1U) HSN TPA Node TPA Node 2+1 Clique in a single cabinet Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives Up to 24 SAS Drives PDU PDU 6690 Teradata Virtual Storage Page 11-25 HHD to SSD Drive Configurations The facing page lists possible hybrid system configurations. Page 11-26 Teradata Virtual Storage HHD to SSD Drive Configurations • There are four preset HDD to SSD configurations (ratios of SSD:HDD per node) which vary slightly between 1+1 and 2+1 cliques. Solid State Devices (SSD) Hard Disk Drives (HDD) # of SSD per Clique/per Node # of HDD per Clique/per Node 1+1 Configuration 16* 60 15:60 16* 120 15:120 18* 160 15:160 20 80 20:80 2+1 Configuration 30/15 120/60 30/15 240/120 30/15 320/160 40/20 160/80 • PUT requires specific and even SSD numbers in a clique, thus the difference between a 1+1 and 2+1 disks per node (16 or 18 vs. 15). 18 includes GHS drives. • 6690 nodes are typically configured with 30 AMPs per node. Teradata Virtual Storage Page 11-27 Summary The facing page summarizes the key points and concepts discussed in this module. Page 11-28 Teradata Virtual Storage Summary • TVS is a change to the way in which Teradata accesses storage. • Advantages include: – Simplifies adding storage to existing cliques – Enables mixing drive sizes / speeds / technologies – Enables non-intrusive migration of data • Purpose is to manage a Multi-Temperature Warehouse. • Two related concepts for Multi-Temperature Data – Performance of Storage – Data Access Pattern – Frequency • Pools all of the cylinders within a clique's disk space and allocates cylinders from this storage pool to individual AMPs. Teradata Virtual Storage Page 11-29 Module 11: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 11-30 Teradata Virtual Storage Module 11: Review Questions 1. List two capabilities of using Teradata Virtual Storage. ____________________________________________________________ ____________________________________________________________ 2. List the two operational modes of Teradata Virtual Storage. _______________________________ _______________________________ 3. Which choice is associated with data temperature? a. b. c. d. Skewed data Frequency of access Solid State Disk Drives Inner tracks versus outer tracks on a spinning disk 4. Which data level is migrated from hot to cold storage? a. Row b. Block c. Cylinder d. Subtable 5. Which two types of data are typically considered to be HOT data? a. b. c. d. WAL DBC tables Spool data History data Teradata Virtual Storage Page 11-31 Notes Page 11-32 Teradata Virtual Storage Module 12 Physical Database Design Overview After completing this module, you should be able to: Understand the stages of database design. List and describe the input requirements for database design. List and describe the outputs and objectives for database design. Describe the differences between a Logical, Extended, and Physical Data Model. Teradata Proprietary and Confidential Physical Database Design Overview Page 12-1 Notes Page 12-2 Physical Database Design Overview Table of Contents The Stages of Database Development........................................................................................ 12-4 Example of Data Model – ER Diagram ..................................................................................... 12-6 Customer Service Logical Model............................................................................................... 12-8 Relational Terms Review ......................................................................................................... 12-10 Domains ................................................................................................................................... 12-12 Attributes .................................................................................................................................. 12-14 Entities and Relationships ........................................................................................................ 12-16 Decomposable Data ................................................................................................................. 12-18 Normal Forms .......................................................................................................................... 12-20 Normalization........................................................................................................................... 12-22 Normalization Example ........................................................................................................... 12-24 Denormalizations ..................................................................................................................... 12-34 Derived Data ............................................................................................................................ 12-36 Pre-Joins ................................................................................................................................... 12-38 Exercise 1: Choose Indexes ..................................................................................................... 12-40 Tables Index Selection ............................................................................................................. 12-42 Database Design Components.................................................................................................. 12-44 Extended Logical Data Model ................................................................................................. 12-46 Physical Data Model ................................................................................................................ 12-48 The Principles of Index Selection ............................................................................................ 12-50 Transactions and Parallel Processing ....................................................................................... 12-52 Module 12: Review Questions ................................................................................................. 12-54 Physical Database Design Overview Page 12-3 The Stages of Database Development Four core stages are identified as being relevant to any database design task. They are: Requirement Analysis involves eliciting the initial set of information and processing requirements from users. Logical Modeling determines the contents of a database independent of a particular physical implementation’s exigencies. – Conceptual Modeling transforms the user requirements into a number of individual user views normally expressed as entity-relationship diagrams. – View Integration combines these individual user views into a single global schema expressed as key tables. The logical model is implemented by taking the conceptual model as input and transforming it into the data model supporting the target relational database management system (RDBMS). The result is the relational data model. Activity Modeling determines the volume, usage, frequency, and integrity analysis of a database. This process also consists of placing any constraints on domains and entities in addition to addressing any legal and ethical issues including referential integrity. Physical Modeling transforms the logical model into a definition of the physical model suitable for a specific software and hardware configuration. In relational terms, this is usually some schema expressed in a dialect of the data definition language of SQL. Outputs from these stages are shown on the right of the facing page and are as follows: Business Information Model (BIM) – shows major entities and their relationships – also referred to as “Business Model” – BIM acronym – also used for “Business Impact Model” Logical Data Model (LDM) - should be in Third Normal Form (3NF) – BIM plus all tables, minor entities, PK – FK relationships – constraints and attributes (columns) Extended Logical Data Model (ELDM) – LDM plus demographics and frequencies Physical Data Model (PDM) – ELDM plus index selections and any denormalizations Page 12-4 Physical Database Design Overview The Stages of Database Development Project Initiation Requirements Analysis Initial Training and Research Data Models Project Analysis (typically output from a stage) Logical Modeling Logical Database Design – Conceptual Modeling and View Integration Business Information Model (BIM) and/or Logical Data Model (LDM) Activity Modeling Activity Modeling – Volume – Usage – Frequency – Integrity Extended Logical Data Model (ELDM) Physical Modeling Physical Database Design & Creation Physical Data Model (PDM) Application Development and Testing Production Release Physical Database Design Overview Page 12-5 Example of Data Model – ER Diagram The Customer Service database is designed to handle information pertaining to phone calls by customers to Customer Service employees. The CALL table is the central table in this database. On the facing page is the Entity-Relationship (E-R) diagram of the Customer Service database. This type of model depicts entities and the relationships between them. The E-R diagram provides you with a high-level perspective. ERD Convention Overview The following conventions are generally used in ER diagramming. Symbols in this module are consistent with the ERwin data modeling tool’s conventions. Convention (FK) Example Independent entity. An independent entity does not depend on another entity for its identification. It should have a single-column PK. PK attribute appears above the horizontal line. Dependent entity. A dependent entity depends on one or more other entities for its identification. It generally has multiple columns in its PK, one or more of which is also an FK. All PK attributes appear above the horizontal line. A Foreign Key. An attribute in the entity that is the PK in another, closely related entity. FK columns are shown above or below the horizontal dividing line in all entities, depending on the nature of the relationship. For 1:1 and 1:M relationships, their FKs are below the horizontal line. For M:M relationships the FKs participating in the PK are above the horizontal line. One-to-Zero, One, or Many occurrences (1:0-1-M). Solid lines indicate a relationship (join path) between two entities. The dot identifies the child end of a parent-child relationship between two entities. The dotted line indicates that the child does not depend on the parent for identification. One-to-At least One or More occurrences (1:1-M) One-to-Zero, or at most One occurrence (1:0-1) Zero or One-to-Zero, One, or Many occurrences (0-1:0-1-M). The diamond shape on the originating end indicates the relationship is optional. Physically, this means that a NULL value can exist for an occurrence of any row of the entity positioned at the terminating end (filled dot) of the relationship. Many-to-Many occurrences (M:M). A many-to-many relationship, also called a nonspecific relationship, represents a situation where an instance in one entity relates to one or more instances in a second entity and an instance in the second entity also relates to one or more instances in the first entity. Indicates that each parent instance must participate in one and only one sub-type as shown in the LDM. Indicates that parent instances may or may not participate in one of the sub-types as shown in the LDM. Page 12-6 Physical Database Design Overview Example of Data Model – ER Diagram LOCATION_EMPLOYEE EMP# (FK) LOC# (FK) LOCATION LOC# LINE1_ADDR LINE2_ADDR LINE3_ADDR CITY STATE ZIP_CODE CNTRY LOCATION_PHONE LOC# (FK) AREA_CODE PHONE DESCR DEPARTMENT DEPT# MGR_EMP# (FK) DEPT_NAME BUDGET_AMOUNT CUSTOMER CUST# SALES_EMP# (FK) CUST_NAME PARENT_CUST_# PARENT_CUST# (FK) EMPLOYEE EMP# DEPT# (FK) JOB_CODE (FK) LAST_NAME FIRST_NAME HIRE_DATE SALARY_AMOUNT SUPV_EMP# (FK) JOB JOB_CODE DESCR HOURLY_BILLING_RATE HOURLY_COST_RATE EMPLOYEE_PHONE EMP# (FK) AREA_CODE PHONE DESCR CONTACT CONT# CONT_NAME AREA_CODE PHONE EXT LAST_CALL_DATE COMMENT SYSTEM SYS# LOC# (FK) INSTALL_DATE RECONFIG_DATE COMMENT CALL_TYPE CALL_TYPE_CODE DESCR CALL_PRIORITY CALL_PRIORITY_CODE DESCR PART_CATEGORY PART_CAT DRAW# PRICE_AMOUNT DESCR SYSTEM_LOG SYS# (FK) ENTERED_DATE ENTERED_TIME ENTERED_BY_USERID LINE# COMMENT_LINE CALL_DETAIL CALL# (FK) ENTERED_BY_USERID ENTERED_DATE ENTERED_TIME LINE# COMMENT_LINE Physical Database Design Overview CALL_STATUS CALL_STATUS_CODE DESCR CALL CALL# PLACED_BY_EMP# (FK) PLACED_BY_CONT# (FK) CALL_PRIORITY_CODE (FK) TAKEN _BY_EMP# CUST# (FK) CALL_DATE CALL_TIME CALL_STATUS_CODE (FK) CALL_TYPE_CODE (FK) CALLER_AREA_CODE CALLER_PHONE CALLER_EXT SYS# (FK) PART_CAT (FK) ORIG_CALL# (FK) CALL_EMPLOYEE EMP# (FK) CALL# (FK) CALL_STATUS_CODE (FK) ASSIGNED_DATE ASSIGNED_TIME FINISHED_DATE FINISHED_TIME LABOR_HOURS Page 12-7 Customer Service Logical Model While the E-R diagram (previous page) was very helpful, it lacked the detail necessary for broad user acceptance. How many columns are in the CALL table? What is the Primary Key (PK) of the CALL table? The logical model of the Customer Service database is depicted on the facing page. It shows many more table-level details than the E-R diagram does. You can see the individual column names for every table. In addition, there are codes to indicate PKs and Foreign Keys (FKs), as well as columns which are System Assigned (SA) or which allow No NULLS (NN) or No Duplicates (ND). Sample data values are also depicted. This is the type of model that comes about as a result of Relational Data Modeling. This example most closely represents a “Logical Data Model” or LDM. Page 12-8 Physical Database Design Overview Customer Service Logical Model (ERA Methodology Diagram) CALL TAKEN PLACED PLACED BY BY ORIG CALL CALL BY CALL# EMP# CUST# CONT# EMP# CALL# DATE TIME PK,SA FK,NN FK NN FK FK FK NN 1 4 1002 030215 1004 CALL STATUS CODE FK,NN 0905 CALL CALL CALLER TYPE PRIORITY AREA CALLER CALLER SYS# CODE CODE CODE PHONE EXT FK,NN FK,NN FK 1 CALL DETAIL CALL PRIORITY ENTERED BY ENTERED CALL# DATE USERID ENTERED LINE# TIME CALL PRIORITY CODE COMMENT LINE PK PK 030215 LJC 1 1625 TOP FK 1 1004 FK,NN,NC NN NN 1 891215 0905 NN 8010 CSB NN 408 7654321 DEPARTMENT BUDGET AMOUNT 403 932000 EDUC MGR EMP# FK,NN 1005 JOB 412101 NN,ND F.E. HOURLY COST RATE FK,NN 1 030212 LAST CALL DATE COMMENT NN 415 1234567 27 SUPV EMP# EMP# PK,SA FK 1001 415 CALL TYPE CUST PARENT SALES CUST# NAME CUST# EMP# 030321 LOC EMP AREA CODE PHONE DESCR LOC# NN PK PK NN,ND FK FK,NN 4 TDAT 3 1023 CALL TYPE CODE PK H EMP# PK 1234567 OFFICE FK FK 1 1001 PART CATEGORY PART CAT DRAW# DESCR PRICE DESCR AMT NN,ND PK NN,ND HDWR 1 A7Z348 1.27 NN CLIP SYSTEM LOG JOB DEPT# CODE FK FK 1003 401 412101 LAST NAME NN NOMAR FIRST HIRE NAME DATE JOE 890114 BIRTH DATE SALARY AMOUNT 450824 50000.00 ENTERED ENTERED ENTERED BY USERID SYS# DATE TIME LINE# COMMENT LINE PK FK LOCATION HOURLY JOB BILLING CODE DESCR RATE PK LOC# EMPLOYEE DEPT DEPT# NAME PK NN,ND INSTALL RECONFIG DATE DATE COMMENT FK 1001 8.5 CUSTOMER CONT AREA CONT# NAME CODE PHONE EXT LOC# LOCATION PHONE AREA EMP# CODE PHONE DESCR PK FK 1625 891215 CONTACT PK PK 547 EMPLOYEE PHONE CALL STATUS ASSIGNED ASSIGNED FINISHED FINISHED LABOR CALL# EMP# CODE TIME DATE TIME HOURS DATE FK SYS# When the CALL EMPLOYEE PK SYSTEM CALL STATUS CODE DESCR PK NN,ND 1 OPEN NN,ND FK 5 CALL STATUS DESCR 1 FK 1 1 4 H PART CAT LOC# PK,SA 1 547 LINE1 CUST# ADDR FK,NN NN 4 100 N. Physical Database Design Overview LINE2 ADDR LINE3 ADDR CITY NN ATLANTA ZIP STATE CODE NN GA 030212 1738 LJC 1 We added CNTRY NN 30096 USA Page 12-9 Relational Terms Review Relational theory uses the terms Relations, Tuples, and Attributes. Most people are more comfortable with the terms Tables, Rows, and Columns. Additional Relational terminology (such as Domains) will be discussed more completely on the following pages. Acronyms: PK – Primary Key FK – Foreign Key SA – System Assigned UA – User Assigned NN – No NULLS ND – No Duplicates NC – No Changes It is very important that you start with a well-documented relational model in Third Normal Form. This model is used as the basis for an ELDM. Knowledge of the hardware and software environment is crucial to doing physical database design for Teradata. The final PDM should be optimized for site-specific implementation. It is also crucial that you think in terms of the large volume of data that is usually stored in Teradata databases. When working with such large-scale databases, extraneous I/Os can have a great impact on performance. By understanding how the Teradata Database System works, you can make constructive physical design decisions that have a positive impact on your system’s performance. Page 12-10 Physical Database Design Overview Relational Terms Review Operational File Systems Relational Theory Logical Models & RDBMS systems File Record Relation Tuple Entity or Table Row Field Attribute Column Table A two-dimensional representation of data composed of rows and columns. Row One occurrence in a relational table – a record. Column The smallest category of data in the model – a field or attribute. Domain The definition of a pool of valid values from which column values are drawn. EMPLOYEE EMP# LAST NAME FIRST NAME PK, SA NN NN 01029821 Smith John Physical Database Design Overview MI NETWORK ID FK, ND, NN A JS129101 Page 12-11 Domains The following statements are true for domains and their administration in relational database management systems: A domain defines the SET of all possible valid values, which may appear in all columns based within that domain. A domain value is a fundamental non-decomposable unit of data. A domain must have a domain name and a domain data type. Valid domain data types are: INTEGER DECIMAL CHARACTER DATE TIME BIT STRING Any integer value Whole and fractional values Alpha-numeric values Valid Gregorian calendar dates 24 hour notation Digitized data (e.g. photos, x-rays) Domain Values A domain defines the conceptual SET, or range, of all valid values that may appear in any column based upon that domain. Sometimes domains are restricted to specific values. For example: Would you ever want negative employee numbers? Has there ever been, or will there ever be, an employee with the employee number of ZERO? Page 12-12 Physical Database Design Overview Domains Domain – the definition of a pool of valid values from which column values are drawn. Employee_Number, INTEGER > 0 53912 -123 Dept_Number, INTEGER > 1000 -12308 43156 123 123 3718 123456 3.14159 9127 4095 1023 3718 0 123456 3.14159 0 Question: Does an Employee_Number of 3718 and a Dept_Number of 3718 represent the same value? Physical Database Design Overview Page 12-13 Attributes Types of Attributes The types of attributes include: Primary Key (PK): Uniquely identifies each row in a table Foreign Key (FK): Identifies the relationship between tables Non-Key Attributes: All other attributes that are not part of any key. They are descriptive only, and do not define uniqueness (PK) or relationship (FK). Derived Attributes: An attribute whose value can be calculated or otherwise derived from other existing attributes. Example: NetPay is derived by calculating GrossPay - TaxAmt. Derived Attribute Issues The attributes from which derived attributes are calculated are in the design, so carrying the derived attribute in addition creates redundant data. Derived attributes may be identified and defined in order to validate that the model can in fact deduce them, but they are not shown in the ER Diagram, because carrying redundant data goes against relational design theory and principles. There are several good reasons to avoid carrying redundant data: The data must be maintained in two places, which involves extra work, time and expense. There is a risk (likelihood) of the copies getting out of sync with each other, causing data inconsistency. It takes more physical storage. Page 12-14 Physical Database Design Overview Attributes Types of Attributes • Primary Key (PK): Uniquely identifies each row in a table • Foreign Key (FK): Identifies the relationship between tables • Non-Key Attributes: All other attributes that are not part of any key. They are descriptive only, and do not define uniqueness (PK) or relationship (FK). • Derived Attributes: An attribute whose value can be calculated or otherwise derived from other existing attributes. Example: Count of current employees in a department. A SUM of Employee table meets this requirement. Derived Attribute Issues • Carrying a derived attribute creates redundant data. • Reasons to avoid carrying redundant data: – The data must be maintained in two places which possibly causes data inconsistency. – It takes more physical storage Physical Database Design Overview Page 12-15 Entities and Relationships The entities and their relationships are shown in table form on the facing page. The naming convention used for the tables and columns makes it easy to find the PK of any FK. Acronyms: PK – Primary Key FK – Foreign Key SA – System Assigned UA – User Assigned NN – No NULLS ND – No Duplicates NC – No Changes Relationship Descriptions Many-to-many relationships are usually implemented by an associative table (e.g., Order_Part table). Examples of relationships are shown below. 1:1 and 1:M Relationships (PK) Country Has Customer Has Employee Generates Generates Receives Location Generates Generates Receives Has Individual Order Requisitions individual (FK) LOCATIONs LOCATIONs ORDERs SHIPMENTs SHIPMENTs ORDERs SHIPMENTs SHIPMENTs PARTs PARTs M : M Relationships Order/Part Category Show kinds of PARTs on an ORDER before it is filled (Direct.) Order/Shipment Page 12-16 Shows which PARTs belong to which ORDERs and SHIPMENTs after the ORDER is filled (INDIRECT). Physical Database Design Overview Entities and Relationships There are three types of relationships: 1:1 Relationships are rare. Ex. One employee has only one Network ID and a Network ID is only assigned to one Employee. EMPLOYEE EMPLOYEE NUMBER PK, SA 30547 21289 1:1 1:M EMPLOYEE NETWORK L_NAME ID FK, ND, NN SMITH BS100421 NOMAR JN450824 M:M NETWORK_USERS NETWORK VIRTUAL ID FLAG PK, UA BS100421 JN450824 Y SecurID ND 231885 348145 1:M and M:M Relationships are common. CUSTOMER CUST CUST ID NAME PK, SA 1001 MALONEY 1002 JONES CUST ADDRESS Examples: 1:M – A Customer can place many orders. M:M – An Order can have many parts on it. The same part can be on many Orders. An “associative” table is used to resolve M:M relationships. 100 Brown St. 12 Main St. ORDER ORDER # PK, SA ORDER DATE CUST ID 1 2 3 2005-12-24 2006-01-23 2006-02-07 FK, NN 1001 1002 1001 Physical Database Design Overview ORDER_ITEM ORDER ITEM # ID PK FK FK 1 1 2 6001 6200 6001 ITEM QTY 3 1 5 ITEM ITEM ID PK ITEM DESC RETAIL PRICE 6001 6200 Paper Printer 15.00 300.00 Page 12-17 Decomposable Data Data may be either decomposable or atomic. Decomposable data can be broken down into finer, smaller units while atomic data is already at its finest level. There is a Relational Rule that “Domains must not be decomposable.” If you normalize your relational design and create your tables based on domains, you will have columns that do not contain decomposable data. In practice, you may have columns that contain decomposable data. This will not cause excessive problems if those columns are not used for access. You should create a column for any individual character or number that is used for access. A good example of decomposable data is a person’s name: Name can be broken down into last name and first name. Last name and first name are good examples of atomic data since they really can’t be broken down into meaningful finer units. There are several benefits to designing your system in this manner. You should get increased performance because there will be fewer Full Table Scans due to partial value index searches. Also, if the columns are NUSI’s, you will increase the chance of using NUSI Bit Mapping. Finally, you will simplify the coding of your SQL queries. Remember that storage and display are separate issues. Page 12-18 Physical Database Design Overview Decomposable Data RELATIONAL RULE: Domains must not be decomposable. • Atomic level data should be defined. • Continue to normalize through the lifetime of the system. • Columns with multiple domains should be decomposed to the finest level of ANY access. • Create a column for an individual character or number if it is used for access. • Storage and display are separate issues. The GOAL: • Eliminate FTS (Full Table Scans) on partial value index searches. • Simplify SQL coding. Physical Database Design Overview Page 12-19 Normal Forms Normalization is a set of rules and a methodology for making sure that the attributes in a design are carried in the correct entity to map accurately to reality, eliminate data redundancy and minimize update anomalies. Stated simply: One Fact, One Place! 1NF, 2NF and 3NF are progressively more refined and apply to non-key attributes regarding their dependency on PK attributes. 4NF and 5NF apply to dependencies between or among PK attributes. For most models, normalizing to 3NF meets the business requirements. Normalization provides a rigorous, relational theory based way to identify and eliminate most data problems: Provides precise identification of unique data values Creates data structures which have no anomalies for access and maintenance functions Later in the module, we will discuss the impact of denormalizing a model and the effect it may have (good or bad) on performance. By implementing a model that is in Third Normal Form (3NF), you might gain the following Teradata advantages. Usually more tables – therefore, more primary index choices – – Possibly fewer full table scans More Data control Fewer Columns per Row – usually smaller rows – – – – Better user isolation from the data Better application separation from the data Better blocking Less transient and permanent journaling space These advantages will be discussed in Physical Design and Implementation portion of this course. Page 12-20 Physical Database Design Overview Normal Forms Once you’ve identified the attributes, the question is which ones belong in which entities? • A non-key attribute should be placed in only one entity. • This process of placing attributes in the correct entities is called normalization. First Normal Form (1NF) • Attributes must not repeat within a table. No repeating groups. Second Normal Form (2NF) • An attribute must relate to the entire Primary Key, not just a portion. • Tables with a single column Primary Key (entities) are always in Second Normal form. Third Normal Form (3NF) • Attributes must relate to the Primary Key and not to each other. • Cover up the PK and the remaining attributes must not describe each other. Physical Database Design Overview Page 12-21 Normalization The facing page illustrates violations of First, Second and Third Normal Form. First Normal Form (1NF) The rule for 1NF is that attributes must not repeat within a table. 1NF also requires that each row has a unique identifier (PK). In the violation example, there are six columns representing sales amount. Second Normal Form (2NF) The rule for 2NF is that attributes must describe the entire Primary Key, not just a portion. In the violation example, the ORDER DATE column describes only the ORDER portion of the Primary Key. Third Normal Form (3NF) The rule for 3NF is that attributes must describe only the Primary Key and not each other. In the violation example, the JOB DESCRIPTION column describes only the JOB CODE column and not the EMPLOYEE NUMBER (Primary Key) column. Fourth (4NF) and Fifth (5NF) Normal Forms 4NF and 5NF are covered here only for your information. The vast majority of models never apply these levels. Essentially these Normal Forms are designed to impose the same level of consistency within a PK composed of more than two columns as the first 3NFs impose on attributes outside the PK. Entities with more than two columns in the PK often contain no non-key attributes. If nonkey attributes do exist, 4NF and 5NF violations are unlikely because bringing the model into 3NF compliance precludes them. Usually 4NF and 5NF violations occur when the definition of the information to be represented is ambiguous (e.g. the user has either not really understood what they are asking for, or they have failed to state it clearly enough for the designer to understand it). 4NF and 5NF really represent two flip sides of the same issue: The PK must contain the minimum number of attributes that accurately describe all of the business rules. Formal Definitions: 4NF: The entity’s PK represents a single multi-valued fact that requires all PK attributes be present for proper representation. Attributes of a multi-valued dependency are functionally dependent on each other. 5NF: The entity represents, in its key, a single multi-valued fact and has no unresolved symmetric constraints. A 4NF entity is also in 5NF if no symmetric constraints exist. Page 12-22 Physical Database Design Overview Normalization Normalization is a technique for placing non-key attributes in tables in order to: – Minimize redundancy – Provide optimum flexibility – Eliminate update anomalies SALES HISTORY First Normal Form (1NF) attributes must not repeat within a table. Second Normal Form (2NF) attributes must describe the entire Primary Key, not just a portion. Third Normal Form (3NF) attributes must describe only the Primary Key and not each other. Physical Database Design Overview FIGURES FOR LAST SIX MONTHS EMP NUMBER PK, SA 2518 SALES SALES SALES SALES SALES SALES 32389 21405 18200 27590 29785 ORDER PART ORDER PART NUMBER NUMBER PK FK FK 100 1234 100 2537 EMPLOYEE EMPLOYEE EMPLOYEE NUMBER NAME PK, SA 30547 SMITH 21289 NOMAR ORDER DATE 2005-02-15 2005-02-15 JOB CODE FK 9038 9038 35710 QUANTITY 200 100 JOB DESCRIPTION INSTRUCTOR INSTRUCTOR Page 12-23 Normalization Example The facing page contains an illustration of a simple order form that a customer may use. It is possible to simply convert this data file into a relational table, but it would not be in Third Normal Form. Dr. Codd Mnemonic Every non-key attribute in an entity must depend on: The KEY The WHOLE key And NOTHING BUT the Key -- E.F. Codd Page 12-24 - 1st Normal Form (1NF) - 2nd Normal Form (2NF) - 3rd Normal Form (3NF) Physical Database Design Overview Normalization Example One of the order forms a customer uses is shown below. Order # _______ Order # _______ Customer ID Customer ID Customer Name Customer Name Customer Address Customer Address Customer City Customer City Item Item ID ID ______ ______ ______ ______ ______ ______ ______ ______ Order Date ______ Order Date ______ __________ __________ __________________________ __________________________ ____________________________________ ____________________________________ ____________ State _______ Zip _______ ____________ State _______ Zip _______ Item Item Description Description _____________________ _____________________ _____________________ _____________________ _____________________ _____________________ _____________________ _____________________ Item Item Item(s) Item Item Item(s) Price Quantity Total Price Price Quantity Total Price _______ ______ ________ _______ ______ ________ _______ ______ ________ _______ ______ ________ _______ ______ ________ _______ ______ ________ _______ ______ ________ _______ ______ ________ Order Total ________ Order Total ________ Repeats Physical Database Design Overview A listing of the fields is: Order # Order Date Customer ID Customer Name Customer Address Customer City State Zip Item ID Item Description Item Price Item Quantity Item(s) Total Price Order Total Page 12-25 Normalization Example (cont.) The tables on the facing page represent the normalization to 1NF for the previous order form example. Recall that the rule for 1NF is that attributes must not repeat within a table. Negative effects of violating 1NF include: Page 12-26 Places artificial limits on the number of repeating items (attributes) Sorting on the attribute becomes very difficult Searching for a particular value of the attribute is more complex Physical Database Design Overview Normalization Example (cont.) A modeler chooses to remove the repeating groups and creates two tables as shown below. Order Table Order-Item Table Order # Order Date Customer ID Customer Name Customer Address Customer City State Zip Order Total Order # Item ID Item Description Item Price Item Quantity Item(s) Total Price This places the data in first normal form. Physical Database Design Overview Page 12-27 Normalization Example (cont.) The tables on the facing page represent the normalization to 2NF for the previous order form example. Recall that the rule for 2NF is that attributes must describe the entire Primary Key, not just a portion. Negative effects of violating 2NF include: Page 12-28 More disk space may be used Redundancy is introduced Updating is more difficult Can also comprise the integrity of the data model Physical Database Design Overview Normalization Example (cont.) A modeler checks that attributes describe the entire Primary Key. Order Table Order-Item Table Item Table Order # Order Date Customer ID Customer Name Customer Address Customer City State Zip Order Total Order # Item ID Item Price (sale) Item Quantity Item(s) Total Price Item ID Item Description Item Price (retail) This places the data in second normal form. As an option, the item price may be kept at the Order-Item level in the event a discount or different price is given for the order. The Item table may identify the retail price. The Order Total and Item(s) Total Price are derived data and may or may not be included. Physical Database Design Overview Page 12-29 Normalization Example (cont.) The tables on the facing page represent the normalization to 3NF for the previous order form example. Recall that the rule for 3NF is that attributes must describe only the Primary Key and not each other. Negative effects of violating 3NF include: Page 12-30 More disk space may be used Redundancy is introduced Updating is more costly Physical Database Design Overview Normalization Example (cont.) A modeler checks that attributes only describe the Primary Key. Order Table Order-Item Table Item Table Order # Order Date Customer ID Order Total Order # Item ID Item Price (sale) Item Quantity Item(s) Total Price Item ID Item Description Item Price (retail) Customer Table Customer ID Customer Name Customer Address Customer City State Zip These tables are now in third normal form. If the item sale price is always the same as the retail price, then the item price only needs to be kept in the item table. The Order Total and Item(s) Total Price are derived data and may or may not be included. Physical Database Design Overview Page 12-31 Normalization Example (cont.) The facing page completes this example and illustrates the tables in a logical format showing PK-FK relationships. Page 12-32 Physical Database Design Overview Normalization Example (cont.) The tables are shown below in 3NF with PK-FK relationships. ORDER ORDER # PK, SA 1 2 ORDER DATE CUSTOMER ID 2005-02-27 2005-04-24 FK 1001 1002 ORDER_ITEM ORDER ITEM # ID PK FK FK 1 1 2 5001 5002 5001 CUSTOMER CUST CUST ID NAME PK, SA 1001 MALONEY 1002 JONES SALE PRICE ITEM QUANTITY 15.00 300.00 15.00 2 1 1 ITEM ITEM ID PK 5001 5002 CUST ADDRESS CUST CITY CUST CUST STATE ZIP 100 Brown St. Dayton 12 Main St. San Diego OH CA ITEM DESCRIPTION RETAIL PRICE PS20 Electric Pencil Sharpener MFC140 Multi-Function Printer 15.00 300.00 45479 92127 Note that Items Total Price & Order_Total are not shown in this model. How are Items Total Price & Order_Total handled? Physical Database Design Overview Page 12-33 Denormalizations This course recommends that the corporate database tables that represent the company's business be maintained in Third Normal Form (3NF). Due to the large volume of data normally stored in a Teradata system, denormalization may be necessary to improve performance. If you do denormalize, make sure that you are aware of all the trade-offs involved. It is also recommended that, whenever possible, you keep the normalized tables from the Logical Model as an authoritative source and add additional denormalized tables to the database. This module will cover the various types of denormalizations that you may choose to use. They are: Derived Data Repeating Groups Pre-Joins Summary and/or Temporary Tables Partitioning (Horizontal or Vertical) Complete the Logical Model before choosing to use these denormalizations. There are a few costs in normalizing your data. Typically, the advantages of having a data model in 3NF outweigh the costs of normalizing your data. Costs of normalizing to 1NF include: you use more disk space you have to do more joins Costs of normalizing to 2NF when already in 1NF include: you have to do more joins Costs of normalizing to 3NF when already in 2NF include: you have to do more joins A customer may choose to implement a semantic layer between the data tables and the end users. The simplest definition of a semantic layer is as the view layer that uses business terminology and does presentation. The semantic layer can also be viewed as a logical construct to support a presentation layer which may interface directly with some end-user access methodology. The "semantic layer" may change column names, derive new column values, perform aggregation, or whatever else the presentation layer needed to support the users. Page 12-34 Physical Database Design Overview Denormalizations Denormalize only when all of the trade-offs of the following are known. Examples of denormalizations are: • Derived data • Pre-Joins • Repeating groups • Partitioning (Horizontal or Vertical) • Summary and/or Temporary tables Make these choices AFTER completing the Logical Model. • Keep the Logical Model pure. • Keep the documentation of the physical model up-to-date. Denormalization may increase or decrease system costs. • • • • • It may be positive for some applications and negative for others. It generally makes new applications harder to implement. Any denormalization automatically reduces data flexibility. It introduces the potential for data problems (anomalies). It usually increases programming cost and complexity. Note: Only a few denormalization examples are included in this module. Other techniques will be discussed throughout the course. Physical Database Design Overview Page 12-35 Derived Data Attributes whose values can be determined or calculated from other data are known as Derived Data. Derived Data can be either integral or stand-alone, examples of which are shown on the facing page. You should notice that integral Derived Data requires no additional I/O and no denormalization. Stand-alone Derived Data, on the other hand, requires additional I/O and may require denormalization. Creating temporary tables to hold Derived Data is a good strategy when the Derived Data will be used frequently and is stable. Handling Derived Data Storing Derived Data is a normalization violation that breaks the rule against redundant data. Whenever you have stand-alone Derived Data, you must decide whether to calculate it or store it. This decision should be based on the following demographics: number of tables and rows involved access frequency column data value volatility and column data value change schedule All above demographics are determined through Activity Modeling – also referred to as Application and Transaction Modeling. The following table gives you guidelines on what approach to take depending on the value of the demographics. Guidelines apply when you have a large number of tables and rows. In cases where you have a small number of tables and rows, calculate the Derived Data on demand. Access Frequency High Change Rating High Update Frequency Dynamic High High Scheduled High High Low Low Dynamic Scheduled Low ? ? Recommended Approach Denormalize the model or use Temporary Table Use Temporary Table or produce batch report Use Temporary Table Use Temporary Table or produce batch report Calculate on demand Note that, in general, using summary/temporary tables is preferable to denormalization. The example on the facing page shows an example of using a derived data column (Employee Count) to identify the number of employees in a department. This count can be determined by doing a count of employees from the Employee table. Page 12-36 Physical Database Design Overview Derived Data Derived data is an attribute whose value can be determined or calculated from other data. Storing a derived item is a denormalization (redundant data). Normalized DEPARTMENT DEPT DEPT NUM NAME PK, SA NN, ND UPI 1001 ENGINEERING 1002 EDUCATION EMPLOYEE EMPLOYEE NUMBER PK, SA UPI 22416 30547 82455 17435 23451 EMPLOYEE NAME NN DEPT NUM FK JONES SMITH NOMAR NECHES MILLER 1002 1001 1002 1001 1002 Carrying the count of the number of employees in a department is a normal forms violation. The number of employees can be determined from the Employee table. Denormalized DEPARTMENT DEPT DEPT EMPLOYEE NUM NAME COUNT PK, SA NN, ND Derived Data UPI 1001 ENGINEERING 2 1002 EDUCATION 3 Physical Database Design Overview EMPLOYEE EMPLOYEE NUMBER PK, SA UPI 22416 30547 82455 17435 23451 EMPLOYEE NAME NN DEPT NUM FK JONES SMITH NOMAR NECHES MILLER 1002 1001 1002 1001 1002 Page 12-37 Pre-Joins Pre-Joins can be created in order to eliminate Joins to small, static tables (Minor Entities). The example on the facing page shows a Pre-Join table that contains columns from both the JOB and EMPLOYEE tables above it. Although this is a violation of Third Normal Form, there are several reasons that you may want to use it: It is a good performance technique for the Teradata DBS especially when there are known queries. It is a good way to handle situations where you have tables with fewer rows than AMPs. You still have your original Minor Entity to maintain data consistency and avoid anomalies. Costs of pre-joins include: Page 12-38 Additional space is required More maintenance and I/Os are required. Physical Database Design Overview Pre-Joins To eliminate joins to a small table (possibly static), consider including their attribute(s) in the parent table. NORMALIZED DENORMALIZED JOB JOB CODE PK, SA UPI 1015 1023 JOB DESCRIPTION NN, ND PROGRAMMER ANALYST EMPLOYEE EMPLOYEE NUMBER PK, SA UPI 22416 30547 EMPLOYEE EMPLOYEE NUMBER PK, SA UPI 22416 30547 EMPLOYEE NAME JOB CODE FK JONES SMITH 1023 1015 EMPLOYEE NAME JOB CODE JOB DESCRIPTION JONES SMITH 1023 1015 ANALYST PROGRAMMER Reasons you may want Pre-Joins: • Performance technique when there are known queries. • Option to handle situations where you have tables with fewer rows than AMPs. A Join Index (Teradata feature covered later) provides a way of creating a “pre-join table”. As the base tables are updated, the Join Index is updated automatically. Physical Database Design Overview Page 12-39 Exercise 1: Choose Indexes At right is the EMPLOYEE table from the CUSTOMER_SERVICE database. The legend below explains the abbreviations you see below the column names. The following pages contain fifteen more PTS tables. Choose the best indexes for these tables. Remember, you must choose exactly one Primary Index per table, but you may choose up to 32 Secondary Indexes. Primary Keys do not have to be declared. Any Primary Key which is declared must have all columns of the PK defined as NOT NULL, and will be implemented by Teradata as a Unique index (UPI or USI). REMEMBER The Primary Key is the logical reference for the Logical Data Model. The Primary Index is the physical access mechanism for the Physical Data Model. They may be but will not always be the same. Page 12-40 Physical Database Design Overview Exercise 1: Choose Indexes The next page contains a portion of the logical model of the PTS database. Indicate the candidate index choices for all of the tables. An example is shown below. The Teradata database supports four index types: UPI (Unique Primary Index) USI (Unique Secondary Index) NUPI (Non-Unique Primary Index) NUSI (Non-Unique Secondary Index) EMPLOYEE 50,000 Rows PK/FK PI/SI SUPV JOB LAST FIRST HIRE BIRTH EMP# EMP# DEPT# CODE NAME NAME DATE DATE PK,SA UPI FK FK FK NUSI NUSI NUSI NN NN NN NN SAL AMT NN NUSI LEGEND PK = Primary Key (implies NC, ND, NN) NC = No Change ND = No Duplicates NN = No Nulls Physical Database Design Overview FK = Foreign Key SA = System Assigned Value UA = User Assigned Value Page 12-41 Tables Index Selection On the facing page, you will find some of the tables in the PTS database. Choose the best indexes for these tables. Remember that you must choose exactly one Primary Index per table, but you may choose up to 32 Secondary Indexes. Page 12-42 Physical Database Design Overview Tables Index Selection LOCATION PK/FK LOC# CUST# LINE1 ADDR LINE2 ADDR PK,SA FK,NN NN ORD# CUST# LOC# ORD DATE PK,SA FK,NN FK,NN NN LINE3 ADDR CITY STATE ZIP CNTRY NN PI/SI ORDER PK/FK CLOSE DATE UPD DATE UPD TIME UPD USER SA,NN SA,NN FK,NN SHIP# ORD# STAT FK,NN FK,NN SA,NN PI/SI PART PK/FK PART# PART CAT SER# LOC# PK,SA FK,NN FK,NN FK,NN SYS# UPD DATE PI/SI Physical Database Design Overview Page 12-43 Database Design Components Each System Development Phase adds to the design. As we mentioned earlier, they are: Logical Data Modeling Extended Data Modeling (also known as Application and Transaction Modeling; we will call it Activity Modeling). Physical Data Modeling First and foremost, make sure the system is designed as a function of business usage and not the reverse. Let usage drive design. Page 12-44 Physical Database Design Overview Database Design Components Data Demographics Logical Data Model Application Knowledge (CURRENT) (FUTURE) • A good logical model reduces application workload. • Thorough application knowledge produces dependable demographics. • Proper demographics are needed to make sound index choices. • Though you don’t know users’ access patterns, you will need that information in the future. For example, management may want to know why there are two copies of data. • For DSS, OLAP, and Data Warehouse systems, aim for even distribution and let Teradata parallel architecture handle the changing access needs of the users. Physical Database Design Overview Page 12-45 Extended Logical Data Model At right is the Extended Logical Data Model (ELDM), which includes data demographic information pertaining to data distribution, sizing and access. Information provided by the ELDM results from user input about transactions and transaction rates. The Delete Rules and Constraint Numbers (from a user-generated list) are provided as an aid to application programmers, but have no effect on physical modeling. The meaning and importance of the other ELDM data to physical database design will be covered in coming modules of this course. Page 12-46 Physical Database Design Overview Extended Logical Data Model TABLE NAME: Employee EXTENDED LOGICAL DATA MODEL • It provides demographics DESCRIPTION: Someone who works for our company and on payroll. ROW COUNT: of data distribution, sizing and access. • It is the main information source for creating the physical data model PK/FK SUPERVISOR EMPLOYEE EMPLOYEE DEPARTMENT JOB LAST FIRST HIRE BIRTH SALARY NUMBER NUMBER NUMBER CODE NAME NAME DATE DATE AMOUNT PK, SA FK FK FK NN N DEL RULES CONSTR# 101 N P 101 VALUE ACC FREQ 10K 0 8K 1K 200 0 0 0 0 JOIN ACC FREQ 17K 50 12K 6K 0 0 0 0 0 JOIN ACC ROWS 136K 10K 96K 50K 0 0 0 0 0 DISTINCT VALUES 50K 7K 2K 3K 40K NA NA NA NA MAXIMUM ROWS/VAL 1 30 40 4K 2K NA NA NA NA MAX ROWS NULL 0 1 18 40 0 NA NA NA NA TYPICAL ROWS/VAL 1 7 23 15 1 NA NA NA NA CHANGE RATING 0 3 2 4 1 NA NA NA NA 2431 18 OZ SAMPLE DATA Physical Database Design Overview TABLE TYPE: Entity EMPLOYEE • It maps applications and transactions to the related tables, columns and row sets. 50,000 8326 647 WIZ Page 12-47 Physical Data Model The model at right is the Physical Data Model (PDM), which contains the same information as the ELDM except that index selections and other physical design choices such as data protection mechanisms (e.g., Fallback) have been added. A complete PDM will define all tables, indexes and views to be implemented. Due to physical design considerations, the PDM may differ from the logical model. In general, the more the PDM differs from the logical model, the less flexible it is and the more programming it requires. Page 12-48 Physical Database Design Overview Physical Data Model PHYSICAL DATA MODEL • A collection of DBMS constructs that define the tables, indexes and views to be implemented. TABLE NAME: Employee DESCRIPTION: Someone who works for our company and on payroll. FALLBACK: YES • The more it differs, the less flexible it is and the more programming it requires. PK/FK SUPERVISOR EMPLOYEE EMPLOYEE DEPARTMENT JOB LAST FIRST HIRE BIRTH SALARY NUMBER NUMBER NUMBER CODE NAME NAME DATE DATE AMOUNT PK, SA FK FK FK NN N DEL RULES CONSTR# 101 N P 101 VALUE ACC FREQ 10K 0 8K 1K 200 0 0 0 0 JOIN ACC FREQ 17K 50 12K 6K 0 0 0 0 0 JOIN ACC ROWS 136K 10K 96K 50K 0 0 0 0 0 DISTINCT VALUES 50K 7K 2K 3K 40K NA NA NA NA NA MAXIMUM ROWS/VAL 1 30 40 4K 2K NA NA NA MAX ROWS NULL 0 1 18 40 0 NA NA NA NA TYPICAL ROWS/VAL 1 7 23 15 1 NA NA NA NA 0 3 4 1 NA NA NA NA OZ WIZ CHANGE RATING Physical Database Design Overview TABLE TYPE: Entity EMPLOYEE the entities of the business function. logical model due to implementation issues. IMPLEMENTATION: 3NF ROW COUNT: 50,000 • The main tables represent • It may differ from the PERM JRNL: NO PI/SI UPI SAMPLE DATA 8326 647 2 NUSI NUSI 2431 18 Page 12-49 The Principles of Index Selection The right-hand page illustrates the many factors that impact Index selection. As you can see, they represent all three of the Database Design Components (Logical Data Model, Data Demographics and Application Knowledge). Index selection can be summarized as follows: Page 12-50 Start with a well-documented 3NF logical model. Develop demographics to create the ELDM. Make index selections based upon these demographics. Physical Database Design Overview The Principles of Index Selection There are many factors which guide the designer in choosing indexes: – – – – – – – – – – – – – – – – – – The way the system uses the index. The space the index requires. The table type. The number of rows in the table. The type of data protection. The column(s) most frequently used to access rows in the table. The number of distinct column values. The maximum rows per value. Whether the rows are accessed by values or through a Join. The primary use of the table data (Decision support, Ad Hoc, Batch Reporting, Batch Maintenance, OLTP). The number of INSERTS and when they occur. Through Throughlecture lectureand andexercises, exercises, The number of DELETEs and when they occur. this course points this course pointsout outthe the The number of UPDATEs and when they occur. importance and use of all importance and use of allthese these The way transactions are written. factors. factors. The way the transactions are parceled. The level and type of locking a transaction requires. How long a transaction hold locks. How normalized the data model is. Physical Database Design Overview Page 12-51 Transactions and Parallel Processing One additional goal of this course is to point out what causes all-AMP operations. In some cases, they are accidental and can be changed into one-or two-AMP operations. To have the maximum number of transactions that need only one-or two-AMPs, you require a good logical model (Third Normal Form), a good physical model (what you will learn about in this course), and good SQL coding (we will provide some examples). Page 12-52 Physical Database Design Overview Transactions and Parallel Processing Teradata does all-AMP processing very efficiently. However, one-AMP and two-AMP processing is even more efficient. It allows the existing configuration to support a greater workload. Ideal for Decision Support (DSS), Ad Hoc, Batch Processing, and some Batch Maintenance operations. TXN1 TXN2 TXN3 TXN4 AMP1 Best for OLTP, tactical transactions, and preferred for many Batch Maintenance operations. Created by a good Logical Model AND a good Physical Model AND good SQL coding. AMP2 TXN1 TXN13 TXN18 AMP1 AMP4 TXN2 TXN7 TXN12 AMP3 AMP2 AMP5 TXN3 TXN8 TXN9 AMP6 AMP7 AMP8 TXN4 TXN5 TXN6 TXN10 TXN11 TXN14 TXN15 TXN19 TXN20 TXN21 AMP5 AMP6 AMP3 AMP4 TXN16 TXN17 TXN22 AMP7 AMP8 This This course coursepoints pointsout outthe themethods methodsof ofmaximizing maximizingthe theuse use of ofone-AMP one-AMP and and two-AMP two-AMP transactions transactionsand andwhen whenall-AMP all-AMPoperations operationsare areneeded. needed. Physical Database Design Overview Page 12-53 Module 12: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 12-54 Physical Database Design Overview Module 12: Review Questions 1. Which three are benefits to creating a data model in 3NF? ____ ____ ____ a. b. c. d. e. Minimize redundancy To reduce update anomalies To improve distribution of data To improve flexibility of access To reduce number of I/Os to access data 2. Which data model would include the definition of a partitioned primary index? ____ a. b. c. d. 3. Which two factors should be considered when deciding to denormalize a table? ____ ____ a. b. c. d. 4. Logical data model Physical data model Business information model Extended logical data model Volatility Performance Distribution of data Connectivity of users Which is a benefit of implementing data types at the domain level? ____ a. b. c. d. Reduce storage space Avoid data conversion Provides consistent display of data Reduce need for secondary indexes Physical Database Design Overview Page 12-55 Notes Page 12-56 Physical Database Design Overview Module 13 Data Distribution and Hashing After completing this module, you will be able to: Describe the data distribution form and method. Describe Hashing. Describe Primary Index hash mapping. Describe the reconfiguration process. Describe a Block Layout. Describe File System Read Access. Teradata Proprietary and Confidential Data Distribution and Hashing Page 13-1 Notes Page 13-2 Data Distribution and Hashing Table of Contents Data Distribution ........................................................................................................................ 13-4 Hashing ...................................................................................................................................... 13-6 Enhanced Hashing Algorithm Starting with Teradata 13.10 ................................................. 13-6 Hash Related Expressions .......................................................................................................... 13-8 Hashing – Numeric Data Types ............................................................................................... 13-10 Multi-Column Hashing ............................................................................................................ 13-12 Multi-Column Hashing (cont.) ............................................................................................. 13-14 Additional Hash Examples....................................................................................................... 13-16 Using Hash Functions to View Distribution ............................................................................ 13-18 Identifying the Hash Buckets ............................................................................................... 13-18 Identifying the Primary AMPs ............................................................................................. 13-18 Primary Index Hash Mapping .................................................................................................. 13-20 Hash Maps................................................................................................................................ 13-22 Primary Hash Map ................................................................................................................... 13-24 Hash Maps for Different Systems ............................................................................................ 13-26 Fallback Hash Map .................................................................................................................. 13-28 Reconfiguration ........................................................................................................................ 13-30 Row Retrieval via PI Value – Overview .................................................................................. 13-32 Names and Object IDs ............................................................................................................. 13-34 Table ID ................................................................................................................................... 13-36 Spool File Table IDs ........................................................................................................ 13-36 Row ID ..................................................................................................................................... 13-38 AMP File System – Locating a Row via PI ............................................................................. 13-40 Teradata File System Overview ............................................................................................... 13-42 Master Index Format ................................................................................................................ 13-44 Cylinder Index Format ............................................................................................................. 13-46 Data Block Layout ................................................................................................................... 13-48 Example of Locating a Row – Master Index ........................................................................... 13-50 Example of Locating a Row – Cylinder Index......................................................................... 13-52 Example of Locating a Row – Data Block............................................................................... 13-54 Accessing the Row within the Data Block............................................................................... 13-56 AMP Read I/O Summary ......................................................................................................... 13-58 Module 13: Review Questions ................................................................................................. 13-60 Data Distribution and Hashing Page 13-3 Data Distribution Parsing Engines (PE) are assigned either to channel connections (e.g., IBM Mainframe) or to LAN connections. Data is always stored by the AMPs in 8-bit ASCII. If the input is in EBCDIC, the PE converts it to ASCII before any hashing and distribution takes place. A USER may have a COLLATION = EBCDIC, ASCII, MULTINATIONAL, or HOST. If the HOST is an EBCDIC host or COLLATION = EBCDIC, then the AMPs convert from ASCII to EBCDIC before doing any comparisons or sorts. MULTINATIONAL allows sites to create their own collation file. Otherwise, all comparisons and sorts use the ASCII collating sequence. Teradata has no concept of pre-allocated table space. The rows of all hashed tables are distributed randomly across all AMPs and then randomly within the space available on the selected AMP. Page 13-4 Data Distribution and Hashing Data Distribution Records From Client (in random sequence) 2 32 67 12 90 6 54 75 18 25 80 41 From Host Teradata ASCII EBCDIC Data distribution is dependent on the hash value of the primary index. Parsing Engine(s) Converted and Hashed Parsing Engine(s) ASCII Distributed Message Passing Layer AMP 0 2 AMP 1 AMP 2 12 80 54 18 90 41 25 67 75 32 Data Distribution and Hashing AMP 3 Formatted Stored 6 Page 13-5 Hashing Hashing is the mechanism by which Teradata utilizes the Primary Index to distribute rows of data. The Hashing Algorithm acts like a mathematical “blender”. It takes up to 64 columns of mixed data as input and generates a single 32-bit binary value called a Row Hash. The Row Hash is the logical storage locator of the row. A part of this value is used in determining the AMP to which the row is distributed. Teradata uses the Row Hash value for distribution, placement and retrieval of rows. The Hashing Algorithm is random but consistent. Although consecutive PI values do not normally produce consecutive hash values, identical Primary Index (PI) values always generate the same Row Hash (assuming that the data types hash identically). Rows with the same Row Hash are always distributed to the same AMP. Different PI values rarely produce the same Row Hash. When this does occur, they are known as Hash Synonyms or Hash Collisions. Note: Upper and lower case values hash to the same hash value. For example, ‘Jones’ and ‘JONES’ generate the same hash value. Enhanced Hashing Algorithm Starting with Teradata 13.10 This enhancement is targeted to reduce the number of hash collisions for character data stored as either Latin or Unicode, notably strings that contain primarily numeric data. Reduction in hash collisions reduces access time per AMP and produces a more balanced row distribution which in-turn improves parallelism. Reduced access time and increased parallelism translate directly to better performance. This capability is only available starting in TD 13.10. This feature is available to new systems and requires a System Initialization (sysinit) for existing systems. It is anticipated that typically this activity would be performed during technology refresh opportunities. Page 13-6 Data Distribution and Hashing Hashing • The Hashing Algorithm creates a fixed length value from any length input string. • Input to the algorithm is the Primary Index (PI) value of a row. • The output from the algorithm is the Row Hash. – A 32-bit binary value. – Used to identify the AMP of the row and the logical storage location of the row in the AMP. – Table ID + Row Hash is used to locate the Cylinder and Data Block. • Row Hash uniqueness depends directly on PI uniqueness. – Good data distribution depends directly on Row Hash uniqueness. • The algorithm produces random, but consistent, Row Hashes. – The same PI value and data type combination always hash identically. – Rows with the same Row Hash will always go to the same AMP. • Teradata has a new "Enhanced Hashing Algorithm" starting with Teradata 13.10 new systems and fresh installs (sysinit). – Solves the problem of too many hash synonyms when character columns contain numeric data. – Problem most commonly occurs with long strings of numeric data in CHAR or VARCHAR columns as either Latin or Unicode. Data Distribution and Hashing Page 13-7 Hash Related Expressions The Teradata Database includes extensions to Teradata SQL, known as hash functions, which allow the user to extract statistical properties from the current index, evaluate those properties for other columns to determine their suitability as a future primary index, or more effectively design the primary index of rows. These statistics also help minimize hash synonyms and enhance data distribution uniformity. Hash functions are valid within a Teradata SQL statement where other functions (like SUBSTRING or INDEX) can occur. HASHROW — this function returns the row hash value of a given sequence of expressions in BYTE (4) data type. For example, the following statement returns the average number of rows per row hash where C1 and C2 constitute an index (or potential index) of table TabX SELECT COUNT(*) (FLOAT) / COUNT (DISTINCT(HASHROW (C1,C2)) FROM TabX; HASHBUCKET — this function returns the bucket number that corresponds to a hashrow. The bucket number is an integer type. The following example returns the number of rows in each hash bucket where C1 and C2 are an index (or potential index) of table TabX: SELECT HASHBUCKET (HASHROW(C1,C2)), COUNT(*) FROM TabX GROUP BY 1 ORDER BY 1; Query results can be treated as a histogram of table distribution among the hash buckets. HASHAMP and HASHBACKAMP — this function returns the identification number of the primary or fallback AMP corresponding to a hashbucket. With Teradata V2R6.2 (and before), HASHAMP accepts only integer values between 0 and 65,535 as its argument. In this example, HASHAMP is used to determine the number of primary rows on each AMP where C1 and C2 are to be the primary index of table TabX: SELECT HASHAMP (HASHBUCKET (HASHROW (C1, C2))), COUNT(*) FROM TabX GROUP BY 1 ORDER BY 1; Query results can be treated as a histogram of the table distribution among the AMPs. Further information on these functions and their uses can be found in the Teradata RDBMS SQL Reference. Note the examples on the facing page. This example was captured on a 26 AMP system using a hash map with 1,048,576 entries. The row hash of the literal 'Teradata' is the same with 16-bit or 20-bit hash bucket numbers. However, the target AMP numbers are different for a system with 65,536 hash buckets as compared to the same system with 1,048,576 hash buckets. Page 13-8 Data Distribution and Hashing Hash Related Expressions • The SQL hash functions are: HASHROW (column(s)) HASHAMP (hashbucket) HASHBUCKET (hashrow) HASHBAKAMP (hashbucket) • Example 1: SELECT HASHROW ('Teradata') ,HASHBUCKET (HASHROW ('Teradata')) ,HASHAMP (HASHBUCKET (HASHROW ('Teradata'))) ,HASHBAKAMP (HASHBUCKET (HASHROW ('Teradata'))) Hash Value F5C4BC93 Bucket Num 1006667 AMP Num 12 AS "Hash Value" AS "Bucket Num" AS "AMP Num" AS "AMP Fallback Num" ; AMP Fallback Num 25 AMP Numbers based on 26-AMP system with 1,048,576 hash buckets. • Example 2: SELECT HASHROW ('Teradata') ,HASHROW ('Teradata ') ,HASHROW (' Teradata') Hash Value 1 F5C4BC93 Data Distribution and Hashing AS "Hash Value 1" AS "Hash Value 2" AS "Hash Value 3" ; Hash Value 2 F5C4BC93 Hash Value 3 01989D47 Note: Literals are converted to Unicode and then hashed. Page 13-9 Hashing – Numeric Data Types The hashing algorithm will hash the same numeric value in different data types to the same value. A DATE data type and an INTEGER data type hash to the same value. An example follows: CREATE TABLE tableE (c1_int INTEGER ,c2_date DATE) UNIQUE PRIMARY INDEX (c1_int); INSERT INTO tableE (1010601, 1010601); INSERT INTO tableE (NULL, NULL); SELECT c1_int, HASHROW (c1_int), HASHROW (c2_date) from tableE; c1_int 1010601 ? HASHROW (c1_int) 1213C458 00000000 HASHROW (c2_date) 1213C458 00000000 A second example follows: CREATE TABLE tableF (c1_int INTEGER ,c2_int INTEGER ,c3_char CHAR(4) ,c4_char CHAR(4)) UNIQUE PRIMARY INDEX (c1_int, c2_int); INSERT INTO tableF (0, NULL,'0', NULL); SELECT HASHROW (c1_int) AS "Hash c1" ,HASHROW (c2_int) AS "Hash c2" ,HASHROW (c3_char) AS "Hash c3" ,HASHROW (c4_char) AS "Hash c4" FROM tableF; Hash c1 00000000 Hash c2 00000000 Hash c3 2BB7F6D9 Hash c4 00000000 Note: The BTEQ commands .SET SIDETITLES and .SET FOLDLINE were used to display the output on the bottom of the facing page. Page 13-10 Data Distribution and Hashing Hashing – Numeric Data Types • The Hashing Algorithm hashes the following numeric data types to the same hash value: – BYTEINT, SMALLINT, INTEGER, BIGINT, DECIMAL(x,0), DATE Example: SELECT CREATE TABLE tableA (c1_bint BYTEINT ,c2_sint SMALLINT ,c3_int INTEGER ,c4_bigint BIGINT ,c5_dec DECIMAL(8,0) ,c6_dec2 DECIMAL(8,2) ,c7_float FLOAT ,c8_char CHAR(10)) UNIQUE PRIMARY INDEX (c1_bint, c2_sint); FROM INSERT INTO tableA (5, 5, 5, 5, 5, 5, 5, '5'); Output from SELECT Data Distribution and Hashing HASHROW ,HASHROW ,HASHROW ,HASHROW ,HASHROW ,HASHROW ,HASHROW ,HASHROW tableA; (c1_bint) (c2_sint) (c3_int) (c4_bigint) (c5_dec) (c6_dec2) (c7_float) (c8_char) Hash Byteint Hash Smallint Hash Integer Hash BigInt Hash Dec80 Hash Dec82 Hash Float Hash Char AS "Hash Byteint" AS "Hash Smallint" AS "Hash Integer" AS "Hash BigInt" AS "Hash Dec80" AS "Hash Dec82" AS "Hash Float" AS "Hash Char" 609D1715 609D1715 609D1715 609D1715 609D1715 BD810459 E40FE360 551DCFDC Page 13-11 Multi-Column Hashing The hashing algorithm uses multiplication and addition as commutative operators for handling a multi-column index. If the data types hash the same, a multi-column index will hash the same for the same values in different columns. Note the example on the facing page. Note: The result would be the same if 3.0 and 5.0 were used as decimal values instead of 3 and 5. INSERT INTO tableB (5, 3.0); INSERT INTO tableB (3, 5.0); SELECT c1_int AS c1 ,c2_dec AS c2 ,HASHROW (c1_int) AS “Hash c1” ,HASHROW (c2_dec) AS “Hash c2” ,HASHROW (c1_int, c2_dec) as “Hash c1c2” FROM tableB; c1 5 3 Page 13-12 c2 3 5 Hash c1 609D1715 6D27DAA6 Hash c2 6D27DAA6 609D1715 Hash c1c2 6C964A82 6C964A82 Data Distribution and Hashing Multi-Column Hashing • The Hashing Algorithm uses multiplication and addition to create the hash value for a multi-column index. • Assume PI = (A, B) [Hash(A) * Hash(B)] + [Hash(A) + Hash(B)] = [Hash(B) * Hash(A)] + [Hash(B) + Hash(A)] • Example: A PI of (3, 5) will hash the same as a PI of (5, 3) if both c1 & c2 are equivalent data types. CREATE TABLE tableB (c1_int INTEGER ,c2_dec DECIMAL(8,0)) UNIQUE PRIMARY INDEX (c1_int, c2_dec); INSERT INTO tableB (5, 3); INSERT INTO tableB (3, 5); SELECT c1_int AS c1 ,c2_dec AS c2 ,HASHROW (c1_int) AS "Hash c1" ,HASHROW (c2_dec) AS "Hash c2" ,HASHROW (c1_int, c2_dec) as "Hash c1c2" FROM tableB; *** Query completed. 2 rows found. 5 columns returned. These two rows will hash the same and will produce a hash synonym. Data Distribution and Hashing c1 c2 5 3 3 5 Hash c1 Hash c2 Hash c1c2 609D1715 6D27DAA6 6D27DAA6 609D1715 6C964A82 6C964A82 Page 13-13 Multi-Column Hashing (cont.) As mentioned before, the hashing algorithm uses multiplication and addition as commutative operators for handling a multi-column index. If the data types hash differently, then a multi-column index will hash differently for the same values in different columns. Note the example on the facing page. Page 13-14 Data Distribution and Hashing Multi-Column Hashing (cont.) • A PI of (3, 5) will hash differently than a PI of (5, 3) if column1 and column2 are data types that do not hash the same. • Example: CREATE TABLE tableC (c1_int INTEGER ,c2_dec DECIMAL(8,2)) UNIQUE PRIMARY INDEX (c1_int, c2_dec); INSERT INTO tableC (5, 3); INSERT INTO tableC (3, 5); SELECT c1_int AS c1 ,c2_dec AS c2 ,HASHROW (c1_int) AS "Hash c1" ,HASHROW (c2_dec) AS "Hash c2" ,HASHROW (c1_int, c2_dec) as "Hash c1c2" FROM tableC; *** Query completed. 2 rows found. 5 columns returned. These two rows will not hash the same and probably will not produce a hash synonym. Data Distribution and Hashing c1 c2 5 3 3.00 5.00 Hash c1 Hash c2 Hash c1c2 609D1715 6D27DAA6 A4E56902 BD810459 0E452DAE 336B8C96 Page 13-15 Additional Hash Examples A numeric value of 0 hashes the same as a NULL. A character data type with a value of all spaces also hashes the same as a NULL. However, a character value of ‘0’ hashes to a value different than the hash of a NULL. Upper and lower case characters hash the same. The following example shows that different numeric types with a value of 0 all hash to the same hash value. CREATE TABLE tableA (c1_bint BYTEINT, c2_sint SMALLINT, c3_int INTEGER, c4_dec DECIMAL(8,0), c5_dec2 DECIMAL(8,2), c6_float FLOAT, c7_char CHAR(10)) UNIQUE PRIMARY INDEX (c1_bint, c2_sint); .SET FOLDLINE .SET SIDETITLES INSERT INTO tableA (0,0,0,0,0,0,'0'); SELECT HASHROW (c1_bint) ,HASHROW (c2_sint) ,HASHROW (c3_int) ,HASHROW (c4_dec) ,HASHROW (c5_dec2) ,HASHROW (c6_float) ,HASHROW (c7_char) FROM tableA; Hash Byteint Hash Smallint Hash Integer Hash Dec0 Hash Dec2 Hash Float Hash Char AS "Hash Byteint" AS "Hash Smallint" AS "Hash Integer" AS "Hash Dec0" AS "Hash Dec2" AS "Hash Float" AS "Hash Char" 00000000 00000000 00000000 00000000 00000000 00000000 2BB7F6D9 Note: An INTEGER value of 500 and a DECIMAL (8, 2) value of 5.00 will both have the same hash value. Page 13-16 Data Distribution and Hashing Additional Hash Examples • A NULL value for numeric data types is treated as 0. • Upper and lower case characters hash the same. Example: CREATE TABLE tableD (c1_int INTEGER ,c2_int INTEGER ,c3_char CHAR(4) ,c4_char CHAR(4)) UNIQUE PRIMARY INDEX (c1_int, c2_int); INSERT INTO tableD ( 0, NULL, 'EDUC', 'Educ' ); SELECT FROM Result: HASHROW ,HASHROW ,HASHROW ,HASHROW tableD; (c1_int) AS "Hash c1" (c2_int) AS "Hash c2" (c3_char) AS "Hash c3" (c4_char) AS "Hash c4" Hash c1 Hash c2 Hash c3 Hash c4 00000000 00000000 6ED679D5 6ED679D5 Hash of 0 Hash of NULL Hash of 'EDUC' Hash of 'Educ' Data Distribution and Hashing Page 13-17 Using Hash Functions to View Distribution The Hash Functions can be used to view the distribution of rows for a chosen Primary Index. Notes: HashRow – returns the row hash value for a given value(s) HashBucket – the grouping for a specific hash value HashAMP – the AMP that is associated with the hash bucket HashBakAMP – the fallback AMP that is associated with the hash bucket Identifying the Hash Buckets If you suspect data skewing due to hash synonyms or NUPI duplicates, you can use the HashBucket function to identify the number of rows in each hash bucket. The HashBucket function requires the HashRow of the columns that make up the Primary Index or the columns being considered for a Primary Index. Identifying the Primary AMPs The HASHAMP function can be used to determine data skewing and which AMP(s) have the most rows. The Customer table on the facing page consists of 7017 rows. Page 13-18 Data Distribution and Hashing Using Hash Functions to View Distribution Hash Functions can be used to calculate the impact of NUPI duplicates and synonyms for a PI. SELECT HASHROW (Last_Name, First_Name) AS "Hash Value" ,COUNT(*) FROM customer GROUP BY 1 ORDER BY 2 DESC; Hash Value 2D7975A8 14840BD7 HASHAMP (HASHBUCKET (HASHROW (Last_Name, First_Name))) AS "AMP #" ,COUNT(*) FROM customer GROUP BY 1 ORDER BY 2 DESC; Count(*) AMP # Count(*) 12 7 7 6 4 5 2 3 1 0 929 916 899 891 864 864 833 821 (Output cut due to length) E7A4D910 AAD4DC80 SELECT 1 1 Data Distribution and Hashing The largest number of NUPI duplicates or synonyms is 12. AMP #7 has the largest number of rows. Page 13-19 Primary Index Hash Mapping The diagram on the facing page gives you an overview of Primary Index Hash Mapping, the process by which all data is distributed in the Teradata DBS. The Primary Index value is fed into the Hashing Algorithm, which produces the Row Hash. The row goes onto the Message Passing Layer. The Hash Maps in combination with the Row Hash determines which AMP gets the row. The Hash Maps are part of the Message Passing Layer interface. Starting with Teradata Database 12.0, Teradata supports either 65,536 or 1,048,576 hash buckets for a system. The larger number of buckets primarily benefits systems with thousands of AMPs, but there is no disadvantage to using the larger number of buckets on smaller systems. The hash map is an array indexed by hash bucket number. Each entry of the array contains the number of the AMP that processes the rows in the corresponding hash bucket. The RowHash is a 32-bit result obtained by applying the hash function to the primary index of the row. On systems with: 65,536 hash buckets, the system uses 16 bits of the 32-bit RowHash to index into the hash map. 1,048,576 hash buckets, the system uses 20 bits of the 32-bit RowHash as the index. Page 13-20 Data Distribution and Hashing Primary Index Hash Mapping Primary Index Value for a Row * Most newer systems have hash bucket numbers that are represented in the first 20 bits of the row hash. Hashing Algorithm Row Hash (32 bits) Hash Bucket Number (20 bits)* Remaining bits (12 bits) • With a 20-bit hash bucket number, the hash map will have 1,048,576 hash buckets. • The hash bucket number is effectively used to index into the hash map. Hash Map – 1,048,576* entries (memory resident) • Older systems (before TD 12.0) use the first 16 bits of the row hash for the hash bucket number. These systems have hash maps with 65,536 hash buckets. • This course will assume 20 bits for the hash bucket number unless otherwise noted. Data Distribution and Hashing Message Passing Layer (PDE and BYNET) AMP AMP AMP AMP AMP AMP AMP AMP AMP AMP 0 1 2 3 4 5 6 7 8 9 Page 13-21 Hash Maps As you have seen, Hash Maps are the mechanisms that determine which AMP gets a row. They are duplicated on every TPA node in the system. There are 4 Hash Maps: Current Configuration Primary (designates where rows are stored) Current Configuration Fallback (designates where copies of rows are stored) Reconfiguration Primary (designates where rows move during a system reconfiguration) Reconfiguration Fallback (designates where copies of rows move during a reconfiguration) Hash Maps are also used whenever there is a PI or USI operation. Hash maps are arrays of Hash Map entries. There are 65,536 or 1,048,576 Hash Map entries. Each of these entries points to a single AMP in the system. The Row Hash generated by the Hashing Algorithm contains information that designates a particular entry on a particular Hash Map. This entry tells the system which AMP should be interrupted. Teradata Version 1 used a Hash Map with only 3643 hash buckets. Teradata Version 2 (prior to Teradata 12.0) used hash maps with 65,536 hash buckets. Starting with Teradata 12.0, the number of hash buckets in a hash map can be either 65,536 or 1,048,576. One of the important impacts of this change was that this increase provides for a more even distribution of data with large numbers of AMPs. For systems upgraded to Teradata Database 12.0, the default number of hash buckets remains unchanged at 65,536 buckets. For new systems or following a sysinit, the default is 1,048,676 buckets. Note: The Hash Maps are stored in GDO (Globally Distributed Object) files on each SMP and are loaded into the PDE memory space when PDE software is started – usually as part of the UNIX MP-RAS, Windows 2003, or Linux startup process. Page 13-22 Data Distribution and Hashing Hash Maps Hash Maps are the mechanism for determining which AMP gets a row. • There are four (4) Hash Maps on every TPA node. • By default, the two Current Hash Maps are loaded into PDE memory space of each TPA node when PDE software boots. Message Passing Layer Current Configuration Primary Reconfiguration Primary Current Configuration Fallback Reconfiguration Fallback Hash Maps have either 65,536 or 1,048,576 entries. Each entry is 2 bytes in size. • Starting with Teradata 12.0, new systems (or for systems that have a sysinit), the default number of hash buckets is 1,048,576. • The increased number of hash buckets provides for a more even distribution of data with large numbers of AMPs. • For systems upgraded to Teradata Database 12.0, the default number of hash buckets remains unchanged at 65,536 buckets. Data Distribution and Hashing Page 13-23 Primary Hash Map The diagram on the facing page is a graphical representation of a Primary Hash Map. (It serves to illustrate the concept; they really don’t look like this.) The Hash Map utilized by the system is the Current Configuration Primary Hash Map. The Fallback Hash Map IS NOT an exact copy of the Primary Hash Map. The Primary Hash Map identifies which AMP the first (Primary) copy of a row belongs to. The Fallback Hash Map is only used for Fallback protected tables and identifies a different AMP in the same "cluster" for the second (Fallback) row copy. Note: On most systems (i.e., systems since the 5450), clusters typically consist of 2 AMPs. That portion of the Row Hash that points to a particular Hash Map entry is called the Hash Bucket Number (HBN). The hash bucket number is the first 16 or 20 bits of the Row Hash depending on the size of the hash maps. The hash bucket number points to a single entry in a Hash Map. As the diagram shows, the system looks at the particular Hash Map entry specified by the hash bucket number to determine which AMP the row belongs to. The Message Passing Layer (or Communications Layer) uses only the hash bucket number portion of the Row Hash to determine which AMP gets the row when inserting a new row into a table. The AMP uses the entire 32 bit Row Hash to determine logical disk storage location of the row. Teradata builds Hash Maps in a consistent fashion. The Primary Hash Map of systems with the same number of AMP vprocs is identical assuming the same number of buckets in the hash map (65,536 or 1,048,576 hash buckets). Fallback Hash Maps may differ due to clustering differences at each site. The hash bucket number (prior to Teradata 12.0) was commonly referred to as the Destination Selection Word (DSW). Page 13-24 Data Distribution and Hashing Primary Hash Map Row Hash (32 bits) Hash Bucket Number (20 or 16 bits) Remaining bits PRIMARY HASH MAP – 14 AMP System 0000 0001 0002 0003 0004 0005 • • • 0 1 2 3 4 5 6 7 8 9 13 13 10 07 04 01 12 07 10 06 04 00 13 08 13 13 05 05 12 10 05 03 07 04 13 08 11 06 09 08 11 08 11 08 06 10 12 11 12 13 09 10 10 11 12 02 07 05 13 09 11 13 03 08 10 09 11 13 02 08 A B C 11 10 06 01 03 06 12 12 12 00 08 09 11 09 13 07 01 07 D E 12 09 04 08 00 06 13 10 12 05 02 05 F 09 13 12 07 06 11 Note: This partial hash map (1,048,576 buckets) is associated with a 14 AMP System. Assume the Hash Bucket Number is the first 20 bits of the Row Hash. The Hash Bucket Number points to one entry within the map. The referenced Hash Map entry identifies the AMP for the row hash. Data Distribution and Hashing Page 13-25 Hash Maps for Different Systems The diagrams on the facing page show a graphical representation of a Primary Hash Map for an 8 AMP system and a Primary Hash Map for a 16 AMP system. These examples assume hash maps with 1,048,576 entries. A data value which hashes to “00023 1AB” will be directed to different AMPs on different systems. For example, this hash value will be associated with AMP 5 on an 8 AMP system and AMP 14 on a 16 AMP system. Page 13-26 Data Distribution and Hashing Hash Maps for Different Systems Row Hash (32 bits) Hash Bucket Number Remaining bits PRIMARY HASH MAP – 8 AMP System 0000 0001 0002 0003 0004 0005 Portions of actual hash maps with 1,048,576 hash buckets. Data Distribution and Hashing 1 2 3 4 5 6 7 8 9 A B C D E F 07 07 01 07 04 01 06 07 00 06 04 00 07 02 05 03 05 05 06 04 05 03 07 04 07 01 03 06 05 03 04 00 02 06 06 02 05 05 04 02 07 06 06 04 03 02 07 05 05 03 01 01 03 01 05 02 00 00 02 00 06 03 06 01 03 06 07 00 04 07 00 06 04 06 00 07 06 07 06 05 02 00 04 05 07 01 04 07 01 07 03 02 01 05 02 05 PRIMARY HASH MAP – 16 AMP System Assume row hash of 00023 1AB 8 AMP system – AMP 05 16 AMP system – AMP 14 0 0000 0001 0002 0003 0004 0005 0 1 2 3 4 5 6 7 8 9 15 13 10 15 15 01 14 14 10 15 04 00 15 14 13 13 05 05 15 10 14 14 07 04 13 15 11 06 09 08 14 08 11 08 06 10 12 11 12 13 09 10 14 11 12 14 07 05 13 15 11 13 15 08 15 09 11 13 15 08 A B C 15 10 14 14 03 06 12 12 12 14 08 09 11 09 13 07 15 07 D E 12 09 14 08 15 06 13 10 12 15 02 05 F 14 13 12 07 06 11 Page 13-27 Fallback Hash Map The diagram on the facing page is a graphical representation of a Primary Hash Map and a Fallback Hash Map. The Fallback Hash Map is only used for Fallback protected tables and identifies a different AMP in the same “cluster” for the second (Fallback) row copy. Note: These are the actual partial primary and fallback hash maps for a 14 AMP system with 1,048,576 hash buckets. Page 13-28 Data Distribution and Hashing Fallback Hash Map Row Hash (32 bits) Hash Bucket Number Remaining bits PRIMARY HASH MAP – 14 AMP System Assume row hash of 00023 1AB 0000 0001 0002 0003 0004 0005 Data Distribution and Hashing 1 2 3 4 5 6 7 8 9 13 13 10 07 04 01 12 07 10 06 04 00 13 08 13 13 05 05 12 10 05 03 07 04 13 08 11 06 09 08 11 08 11 08 06 10 12 11 12 13 09 10 10 11 12 02 07 05 13 09 11 13 03 08 10 09 11 13 02 08 A B C 11 10 06 01 03 06 12 12 12 00 08 09 11 09 13 07 01 07 D E 12 09 04 08 00 06 F 13 10 12 05 02 05 09 13 12 07 06 11 FALLBACK HASH MAP – 14 AMP System Primary AMP – 05 Fallback AMP – 12 Notes: 14 AMP System with 2 AMP clusters; hash maps with 1,048,576 buckets. 0 0000 0001 0002 0003 0004 0005 0 1 2 3 4 5 6 7 8 9 06 06 03 00 11 08 05 00 03 13 11 07 06 01 06 06 12 12 05 03 12 10 00 11 06 01 04 13 02 01 04 01 04 01 13 03 05 04 05 06 02 03 03 04 05 09 00 12 06 02 04 06 10 01 03 02 04 06 09 01 A B C D E F 04 03 13 08 10 13 05 02 11 01 07 13 02 06 05 00 13 04 05 05 05 07 01 02 04 02 06 00 08 00 06 03 05 12 09 12 Page 13-29 Reconfiguration Reconfiguration (Reconfig) is the process for changing the number of AMPs in a system and is controlled by the Reconfiguration Hash Maps. The system constructs Reconfiguration Hash Maps by reassigning Hash Map Entries to reflect a new configuration of AMPs. This is done in a way that minimizes the number of rows (and Hash Map Entries) reassigned to a new AMP. After rows are moved, the Reconfiguration Primary Hash Map becomes the Current Configuration Primary Hash Map, and the Reconfiguration Fallback Hash Map becomes the Current Fallback Hash Map. The diagram on the right illustrates a 200 AMP to 300 AMP Reconfig for a system. The 1,048,576 Hash Map entries are distributed evenly across the 200 AMPs in the initial configuration (top illustration), with approximately 5243 entries referencing each AMP. Thus, there are 5243 Hash Map Entries pointing to AMP 1. In a 300 AMP system, each AMP will have approximately 3496 referencing the AMP. It is necessary to change 1748 (5243 - 3496) of those and divide them between the new AMPs (AMP 200 through 299). The system does the same thing for the Hash Map Entries that currently point to the other AMPs. This constitutes the Reconfiguration Primary Hash Map. A similar process is done for the Reconfiguration Fallback Hash Map. Once the new Hash Maps are ready, the system looks at every row on each AMP and checks to see if the Hash Bucket Number points to one of the Hash Map Entries which was changed. If so, then the row is moved to its new destination AMP. The formula used to determine the percentage of rows migrating to new AMPs during a Reconfig is shown at the bottom of the right-hand page. Divide the Number of New AMPs by the Sum of the Old and New AMPs (the number of AMPs after the Reconfig). For example, the above 200 to 300 AMP Reconfig causes 33.3% of the rows to migrate. Page 13-30 Data Distribution and Hashing Reconfiguration Existing AMPs If a 12.0 system (with 1,048,576 Hash buckets) has 200 AMPs, then each of the 200 AMPs will have approx. 5243 entries in the hash map. If upgrading to 300 AMPs, then each of the 300 AMPs will have a similar number of entries (approx. 3496) in the hash map. 0 1 2 5243 5243 5243 New AMPs ….. 199 200 5242 Empty 299 ….. Empty 1,048,576 Hash Map Entries 0 1 2 3496 3496 3496 299 …………………………………….. 3495 • The system creates new Hash Maps to accommodate the new configuration. • Old and new maps are compared – each AMP reads its rows, and moves only those that hash to a new AMP. Percentage of Number of New AMPs Rows Moved = SUM of Old + New AMPs to new AMPs = 100 300 = 1 3 = 33.3% • It is not necessary to offload and reload data due to a reconfiguration. • If the hash map size is changed (65,536 to 1,048,576), more data will be moved as part of a reconfiguration. Data Distribution and Hashing Page 13-31 Row Retrieval via PI Value – Overview The facing page illustrates the step-by-step process involved in Primary Index retrieval. The SELECT statement (shown on facing page) retrieves the row or rows where the PI is equal to a particular column value (or column values in the case of a multi-column PI). The PE parser always puts out a three-part message composed of the Table ID, Row Hash and Primary Index value. The 48 bit Table ID is looked up in the Data Dictionary, the 32 bit Row Hash value is generated by the Hashing Algorithm and the Primary Index value comes from the SQL request. The Message Passing Layer (a.k.a., Communications Layer) Interface uses the Hash Bucket Number (first 16 or 20 bits of the Row Hash) to determine which AMP to interrupt and pass on the message. The AMP uses the Table ID and Row Hash to identify and locate the proper data block, then uses the Row Hash and PI value to locate the specific row(s). The PI value is required to distinguish between Hash Synonyms. Page 13-32 Data Distribution and Hashing Row Retrieval via PI Value – Overview SELECT … FROM tablename WHERE primaryindex = values(s); Parsing Engine SQL Request Parser Hashing Algorithm 48 Bit TABLE ID Index Value Hash Bucket Number Message Passing Layer AMP File System 32 Bit Row Hash Logical Block Identifier Logical Row Identifier With a PI row retrieval, only the AMP (whose number appears in the referenced Hash Map) is accessed by the system. Data Distribution and Hashing Vdisk Data Block Page 13-33 Names and Object IDs DBC.Next is a Data Dictionary table that consists of a single row with 9 columns as shown below. One of the counters is used to assign a globally unique numeric ID to every Database, User, Role, and Profile. A different counter is used to assign a globally unique numeric ID to every Table, View, Macro, Trigger, Stored Procedure, User-Defined Function, Join Index, and Hash Index. DBC.Next always contains the next value to be assigned to any of these. Think of these columns as counters for ID values. You may be interested in noting that DBC.Next only contains a single, short row but it requires a Table Header on every AMP, as does any table. Columns and Indexes are also assigned numeric IDs, which are unique within their respective tables. However, column and index IDs are not assigned from DBC.Next. DBC.Next columns Values Data Type RowNum DatabaseID TableID ProcsRowLock EventNum LogonSequenceNo TempTableID StatsQueryID ReconfigID 1 numeric numeric numeric numeric numeric numeric number number CHAR(1) BYTE(4) BYTE(4) BYTE(4) BYTE(4) BYTE(4) BYTE(4) BYTE(4) INTEGER Page 13-34 Data Distribution and Hashing Names and Object IDs DBC.Next (1 row) NEXT DATABASE ID NEXT TVM ID 6 Other Counters • • Each Database/User/Profile/Role – is assigned a globally unique numeric ID. • • Each Column – is assigned a numeric ID unique within its Table ID. • • • The DD keeps track of all SQL names and their numeric IDs. Each Table, View, Macro, Trigger, Stored Procedure, User-defined Function, Join Index, and Hash Index – is assigned a globally unique numeric ID. Each Index – is assigned a numeric ID unique within its Table ID. The PE’s RESOLVER uses the DD to verify names and convert them to IDs. The AMPs use the numeric IDs supplied by the RESOLVER. Data Distribution and Hashing Page 13-35 Table ID The Table ID is the first part of the three-part message. It is a 48-bit number supplied by the parser. There are two major components of the Table ID: The first component of the Table ID is the Unique Value. Every table, view and macro is assigned a 32-bit Unique Value, which is assigned by the system table called DBC.Next. In addition to specifying a particular table, this value also indicates whether the table is a normal data table, Permanent Journal table or Spool file table. The second component of the Table ID is known as the Subtable ID. Teradata stores various types of rows of a table in separate blocks. For example, Table Header rows (described later) are stored in different blocks than primary data rows, which are stored in different blocks than Fallback data rows, and so on (more examples are shown on the facing page). Each separate set of blocks is known as a subtable. The Subtable ID is a 16-bit value that tells the file system which type of blocks to search for. The facing page lists subtable IDs in decimal value for 2-AMP clusters. The SHOWBLOCKS utility will display the block allocations by subtable and uses decimal values to represent each subtable. If a Reference Index subtable was created, it would have subtable IDs of 1536 and 2560. For convenience, Table ID examples throughout this course only refer to the Unique Value and omit the Subtable ID. The Table ID, together with the Row ID, gives Teradata a way to uniquely identify every single row in the entire system. Spool File Table IDs Spool files are temporary work tables which are created and dropped as queries are executed. When a query is complete, all of the spool files that it used will be dropped automatically. Like all tables, a spool file (essentially a temporary work table) requires a Table ID (or tableid). There is a range of tableids exclusively reserved for spool files (C000 0001 through FFFF FFFF) and the system cycles through them. Eventually, the system will cycle through all the tableids for spool files and reassign spool tableids starting at C000 0001. Page 13-36 Data Distribution and Hashing Table ID The Table ID is a Unique Value for Tables, Views, Macros, Triggers, Stored Procedures, Join Indexes, etc. that comes from DBC.Next dictionary table. Unique Value also defines the type of table: • Normal data table • Permanent journal • Global Temporary • Spool file UNIQUE VALUE 32 Bits + SUB-TABLE ID 16 Bits Sub-table ID identifies the part of a table the system is looking at. Sub-table type Table Header Data table 1st Secondary index 2nd Secondary index 1st Reference index 1st BLOB or CLOB 2nd BLOB or CLOB Archive Online Subtable Primary ID Fallback ID (shown in decimal format) 0 1024 2048 1028 2052 1032 2056 1536 2560 1792 2816 1794 2818 18440 n/a Table ID plus Row ID makes every row in the system unique. Examples shown in this manual use the Unique Value to represent the entire Table ID. Data Distribution and Hashing Page 13-37 Row ID The Row Hash is not sufficient to identify a specific row in a table. Since it is based on a Primary Index value, multiple rows can have the same Row Hash. This is due either to Hash Synonyms or NUPI Duplicates. The Row ID makes every row within a table uniquely identifiable. For a non-partitioned table, the Row ID consists of the Row Hash plus a Uniqueness Value. The Uniqueness Value is a 32-bit numeric value, designed to identify specific rows within a single Row Hash value. When there are multiple rows with the same Row Hash within a table, the first row is assigned a Uniqueness Value of 1. Additional rows with the same Row Hash are assigned ascending Uniqueness Values. For Primary Index retrievals, only the Row Hash and Primary Index values are needed to find the qualifying row(s). The Uniqueness Value is needed for Secondary Index support. Since a Row ID is a unique identifier of a row within a table, Teradata uses Row IDs as Secondary Index pointers. Although Row IDs do identify every row in a table uniquely, they do not guarantee that the data itself is unique. In order to avoid the problem of duplicate rows (permitted in Multiset tables), the complete set of data values for a row (in a Set table) must also be unique. Summary For a non-partitioned table (NPPI), the Row ID consists of the Row Hash + Uniqueness Value for a total of 8 bytes in length. Page 13-38 For a partitioned table (PPI), the Row ID actually consists of the Partition Number + Row Hash + Uniqueness Value for a total of 10 or 16 bytes in length. Data Distribution and Hashing Row ID On INSERT, Teradata stores both the data values and the Row ID. ROW ID = ROW HASH and UNIQUENESS VALUE Row Hash • Row Hash is based on Primary Index value. • Multiple rows in a table could have the same Row Hash. • NUPI duplicates and hash synonyms have the same Row Hash. Uniqueness Value • • • • • • The AMP creates a numeric 32-bit Uniqueness Value. The first row for a Row Hash has a Uniqueness Value of 1. Additional rows have ascending Uniqueness Values. Row IDs determine sort sequence within a Data Block. Row IDs support Secondary Index performance. The Row ID makes every row within a table uniquely identifiable. Duplicate Rows • Row ID uniqueness does not imply data uniqueness. Note: The Row ID for a non-partitioned table is effectively 8 bytes long. Data Distribution and Hashing Page 13-39 AMP File System – Locating a Row via PI The steps on the right-hand page outline the process that Teradata uses to locate a row. We know that rows are distributed according to their Row Hash. More specifically, the Hash Bucket Number points to a single entry in a Hash Map which designates a particular AMP. Once the correct AMP has been found, the Master Index for that AMP is used to identify which Cylinder Index should be referenced. The Cylinder Index then identifies the correct Data Block. A search of the Data Block locates the row or rows specified by the original threepart message. The system performs either linear or indexed searches. The diagram at the bottom of the facing page illustrates these steps in a graphical fashion. Page 13-40 Data Distribution and Hashing AMP File System – Locating a Row via PI • The AMP accesses its Master Index (always memory-resident). – An entry in the Master Index identifies a Cylinder # and the AMP accesses the Cylinder Index (frequently memory-resident). • An entry in the Cylinder Index identifies the Data Block. – The Data Block is the physical I/O unit and may or may not be memory resident. – A search of the Data Block locates the row(s). The PE sends request to an AMP via the Message Passing Layer (PDE & BYNET). Table ID Row Hash PI Value AMP Memory Master Index Cylinder Index (accessed in FSG Cache) Data Block (accessed in FSG Cache) Data Distribution and Hashing Vdisk CI Row Page 13-41 Teradata File System Overview The Teradata File System software has these characteristics: part of AMP address space unaware of other AMP or File System instances AMP Interface to disk services uses PDE FSG services The Master Index contains an entry (CID) for each allocated cylinder. (CID – Cylinder Index Descriptor) On the facing page, SRD–A represents an SRD (Subtable Reference Descriptor) for table A. DBD–A1 and DBD–A2 represent data blocks for table A. (DBD – Data Block Descriptor) On the facing page, SRD–B represents an SRD for table B. DBD–B1, etc. represent data blocks for table B. There are actually two cylinder indexes allocated for each cylinder. Each cylinder index is 12 KB in size. Therefore, there is 24 KB (48 sectors) allocated for cylinder indexes at the beginning of each cylinder. Prior to Teradata 13.10 and Large Cylinder Support, cylinders are 3872 sectors. Miscellaneous notes: Master index entries are 72 bytes long. A cylinder index is 12 KB in size for 2 MB cylinders and are 64 KB in size for 12 MB cylinders Data rows for PPI tables require an additional 2 bytes to identify the partition number and the spare byte is set to x'80' to identify the row as a PPI row. Secondary index subtable rows also have the Part # + Row Hash + Uniqueness ID) to identify data rows. Page 13-42 Data Distribution and Hashing Teradata File System Overview Master Index CID CID CID CID . .. AMP Memory CID – Cylinder Index Descriptor SRD – Subtable Reference Descriptor DBD – Data Block Descriptor VDisk Cylinder Index SRD - A DBD - A1 DBD - A2 Data Block A1 SRD - B DBD - B1 DBD - B2 Cylinder 3872 sectors Data Block B1 Data Block A2 Data Block B2 Cylinder Index SRD - B DBD - B3 DBD - B4 DBD - B5 Data Block B3 Data Block B4 Data Distribution and Hashing Data Block B5 Page 13-43 Master Index Format The first cylinder in each Vdisk contains a number of control structures used by the AMP’s File System software. Segment 0 (512 bytes) contains the Vdisk status and a number of structure pointers for the AMP. Following Segment 0 is the FIB (File System Information Block). The FIB contains global file system information – a key component is a status array that shows the status of cylinders (used, free, bad, etc.), and the sorted list of CIDs that are the descriptors for the cylinders currently in use. The FIB effectively contains the list of free or available cylinders. Unlike the Master Index (MI), the FIB is written to disk when cylinders are allocated, and it is read from disk when Teradata boots or when the MI needs to be rebuilt in memory. If necessary, software will allocate additional cylinders for these structures. The Master Index is a memory resident structure that contains an entry for every allocated data cylinder on that AMP. Entries in the Master Index are sorted by the lowest Table ID and Row ID that can be found on the associated cylinder. The Master Index is used to identify which cylinder a specific row can be found in. The key elements of the Master Index are: Master Index Header - 32 bytes (not shown) Cylinder Index Descriptors (CID) – one per allocated cylinder – 72 bytes in length Cylinder Index Descriptor Reference Array (not shown) – set of 4 byte pointers to the CIDs; these entries are sorted in descending order. Note: This array is similar to the row reference array at the end of a data block. Cylinders that contain no data are not listed in the Master Index. They appear in the Free Cylinder List (which is part of the FIB – File System Information Block) for the associated Vdisk. Entries in the Free Cylinder List are sorted by Cylinder Number. Each Master Index entry (or CID) contains the following data: Lowest Table ID in the cylinder Lowest Part # / Row ID value in the cylinder (associated with the lowest Table ID) Highest Table ID in the cylinder Highest Part # / Row hash (not Row ID) value in the cylinder (associated with the highest Table ID) Drive (Pdisk) and Cylinder Number Free sectors Flags The maximum size of the Master Index is based on number of cylinders available to the AMP. Page 13-44 Data Distribution and Hashing Master Index Format Characteristics • Memory resident structure specific to each AMP. • Contains Cylinder Index Descriptors (CID) – one for each allocated Cylinder (72 bytes long). • Each CID identifies the lowest Table ID / Part# / Row ID and the highest Table ID / Part# / Row Hash for a cylinder. • Range of Table ID / Part# / Row IDs does not overlap with any other cylinder. • Sorted list of CIDs. Vdisk Cylinder 0 Seg. 0 Master Index FIB (contains Free Cylinder List) CI CID 1 CID 2 CID 3 . . Cylinder CI Cylinder CI Cylinder CID n CI Cylinder Notes: • The Master index and Cylinder Index entries include the partition #’s to support partition elimination for Partitioned Primary Index (PPI) tables. • For non-partitioned tables, the partition number is 0 and the Master and Cylinder Index entries (for NPPI tables) will use 0 as the partition number in the entry. Data Distribution and Hashing Page 13-45 Cylinder Index Format Each cylinder has its own Cylinder Index (CI). The Cylinder Index contains a list of the data blocks and free sectors that reside on the cylinder. The Cylinder Index is accessed to determine which data block a row resides in. The key elements of the Cylinder Index include: Cylinder Index Header (not shown) Subtable Reference Descriptors (SRD) contain – – Table ID Range of DBDs (1st and count) Data Block Descriptors (DBD) – – – – – First Part # / Row ID Last Part # / Row Hash Sector number and size Flags Row count Free Sector Entries (FSE) – identifies free sectors in the cylinder. There is one FSE (for each free sector range in the cylinder. The set of FSEs effectively make up the “Free Block List” or also known as the “Free Sector List”. Subtable Reference Descriptor Array (not shown) – set of 2 byte pointers to the SRDs; these entries are sorted in descending order. Note: This array is similar to the row reference array at the end of a data block. Data Block Descriptor Array (not shown) – set of 2 byte pointers to the DBDs; these entries are sorted in descending order. Note: This array is similar to the row reference array at the end of a data block. There are two cylinder indexes allocated for each cylinder. Each cylinder index is 12 KB in size. Therefore, there is 24 KB (48 sectors) allocated for cylinder indexes at the beginning of each cylinder. The facing page illustrates a logical view of SRDs and DBDs and does not represent the actual physical implementation. For example, the SRD and DBD reference arrays are not shown. Page 13-46 Data Distribution and Hashing Cylinder Index Format Characteristics VDisk • Located at beginning of each Cylinder.. • There is one SRD (Subtable Reference Descriptor) for each subtable that has data blocks on the cylinder. • Each SRD references a set of DBD(s). A DBD is a Data Block Descriptor.. • One DBD per data block - identifies location and lowest Part# / Row ID and the highest Part # / Row Hash within a block. • FSE - Free Segment (or Sector) Entry identifies free sectors. • Note: Each Cylinder actually has Cylinder Index SRD A DBD A1 DBD A2 . SRD B DBD B1 DBD B2 . FSE FSE Cylinder Data Block A1 Data Block A2 Data Block B1 Data Block B2 Range of Free Sectors Range of Free Sectors two 12K Cylinder Indexes and the File System software alternates between them. Data Distribution and Hashing Page 13-47 Data Block Layout A Block is the physical I/O unit for Teradata. It contains one or more data rows, all of which belong to the same table. They must fit entirely within the block. The maximum block size is 255 sectors or 127.5 KB. A Data Block consists of three major sections: The Data Block Header (DB Header) The Row Heap The Row Reference Array Rows cannot be split between blocks. Each row in a DB is referenced by a separate index to the row known as the Row Reference Array. The Row Reference Array is placed at the end of the data block just before the Block Trailer. With tables that are not partitioned (Non-Partitioned Primary Index – NPPI), each row has at least 14 bytes of overhead in addition to the data values stored in that row. With tables that are partitioned (PPI), each row has at least 16 bytes of overhead in addition to the data values stored in that row. The partition number uses the additional two bytes. There are also 2 bytes of space used in the Row Reference Array for a 2-byte Reference Array Pointer. This 2-byte pointer identifies the offset of where the row starts within the block. If a row is an odd number of bytes in length, the Row Length specifies its precise length, but the system allocates whole words within the block for the row. Rows will start on an even address boundary. Teradata truly supports variable length rows. The max amount of user data that you can define in a table row is 64,243 bytes because there is a minimum of 12 bytes of overhead within the row. This gives a total of 64,255 bytes for the data row plus an additional 2 bytes for the row offset within the row reference array. Page 13-48 Data Distribution and Hashing Data Block Layout • A data block contains rows with same subtable ID. – Contains rows within range of Row IDs of associated DBD entry and the range of Row IDs does not overlap with any other data block. – Logically sorted set of rows. • The maximum block size is 255 sectors (127.5 KB). – Blocks can vary in size from 1 sector to 255 sectors. Data Distribution and Hashing Row 1 Row 3 Row 2 Row 4 Row Reference Array -3 -2 -1 0 Trailer (2 bytes) Header (72 bytes) • A maximum row size is 64,255 bytes. Page 13-49 Example of Locating a Row – Master Index In the example on the facing page, you can see how Teradata would use the Master Index to locate the data requested by a SELECT statement. The three-part message is Table ID=100, Row Hash=1000 and EMPNO=3755. After identifying the appropriate AMP, Teradata uses that AMP’s Master Index to locate which cylinder contains this Table ID and Row Hash. By examining the Master Index, you can see that Cylinder Number 169 contains the appropriate row, if it exists in the system. Teradata’s File System software does a binary search of the CIDs based on Table ID / Part # / Row Hash or Table ID / Part # / Row ID to locate the cylinder number that has the row(s). The CI for that cylinder is accessed to locate the data block. A user request for a row based on a Primary Index value will only have the Table ID / Part # / Row Hash. A user request for a row based on a Secondary Index (SI) will have the Table ID / Row Hash for the SI value. The SI subtable row contains the Row ID(s) of the base table row(s). Teradata software uses the Table ID / Row ID(s) to locate the base table row(s). If a table is partitioned, the SI subtable row will have the Part # and the Row ID. Free cylinders appear in the Free Cylinder List which is part of the FIB (File System Information Block) for the associated Vdisk. Summary Page 13-50 There is only one entry for each cylinder on the AMP. Cylinders with data appear on the Master Index. Cylinders without data appear on the free Cylinder List (which is located within the FIB – File System Information Block). Each index entry identifies its cylinder’s lowest Table ID / Partition # / Row ID. Index entries are sorted by Table ID, Partition #, and Lowest Row ID. Multiple tables may have rows on the same cylinder. A table may have rows on many cylinders on different Pdisks on an AMP. The Free Cylinder List is sorted by Cylinder Number. Data Distribution and Hashing Example of Locating a Row – Master Index Table ID Part # 100 0 Row Hash 1000 empno 3755 SELECT * FROM employee WHERE empno = 3755; Master Index Lowest Highest Pdisk and Table ID Part # Row ID Table ID Part # Row Hash Cylinder Number : 078 098 100 100 100 100 100 100 100 123 123 : : 0 0 0 0 0 0 0 0 0 1 2 : : 58234, 2 00107, 1 00773, 3 01361, 2 02937, 1 03662, 1 04123, 2 05974, 1 07353, 1 00343, 2 06923, 1 : Part # - Partition Number : 095 100 100 100 100 100 100 100 120 123 123 : : 0 0 0 0 0 0 0 0 0 2 3 : : 72194 00676 01361 02884 03602 03999 05888 07328 00469 01864 00231 : : 204 037 169 777 802 117 888 753 477 529 943 : Free Cylinder List Pdisk 0 Free Cylinder List Pdisk 1 : 124 125 168 170 183 189 201 217 220 347 702 : : 761 780 895 896 914 935 941 1012 1234 1375 1520 : To CYLINDER INDEX What cylinder would have Table ID = 100, Row Hash = 00598? Data Distribution and Hashing Page 13-51 Example of Locating a Row – Cylinder Index Using the example on the facing page, the File System would determine that the data block it needs is the six-sector block beginning at sector 0789. The Table ID and Row Hash we are looking for (100 + 1000, n) falls between the lowest and highest entries of 100 + 00998, 1 and 100 + 01010. The convention of 00998, 1 is as follows: 00998 is the Row Hash and 1 is the Uniqueness Value. Teradata’s File System software does a binary search of the SRDs based on Table ID and a binary search of the DBDs Partition #, Row Hash or Row ID to identify the data block(s) that has the row(s). A user request for a row based on a Primary Index value will include the Table ID / Part # / Row Hash. A user request for a row based on a Secondary Index (SI) will have the Table ID / Part # / Row Hash for the SI value. The SI subtable row contains the Row ID(s) of the base table row(s). Teradata software uses the Table ID / Part # / Row ID(s) to locate the base table row(s) for secondary index accesses. If a table is partitioned, the SI subtable row will have the Part # and the Row ID. The example on the facing page illustrates a cylinder that only has one SRD. All of the data blocks in this cylinder are associated with the same subtable. Summary Page 13-52 There is an entry (DBD) for each data block on this cylinder. These entries are sorted ascending on Table ID, Partition #, and Lowest Row ID. Only rows belonging to the same table and sub-table appear in a block. Blocks belonging to the same sub-table can vary in size. Blocks without data appear on the Free Sector List that is sorted ascending on sector number. Data Distribution and Hashing Example of Locating a Row – Cylinder Index Table ID Part # 100 0 Row Hash empno 1000 3755 SELECT * FROM employee WHERE empno = 3755; Cylinder Index - Cylinder #169 SRDs Table ID First DBD DBD Offset Count SRD #1 Free Block List 100 FFFF 12 Free Sector Entries DBDs Part # Lowest Row ID Part # Highest RowHash Start Sector : DBD #4 DBD #5 DBD #6 DBD #7 DBD #8 DBD #9 : : 0 0 0 0 0 0 : : 00867, 2 00938, 1 00998, 1 01010, 3 01185, 2 01290, 1 : : 0 0 0 0 0 0 : : 00902 00996 01010 01177 01258 01333 : : 1010 0093 0789 0525 0056 1138 : Sector Row Count Count : 4 7 6 3 5 5 : : 5 10 8 4 6 6 : Start Sector Sector Count : 0270 0301 0349 0470 0481 0550 : : 3 5 5 4 6 5 : This example assumes that only 1 table ID has rows on this cylinder and the table is not partitioned. Part # - Partition Number Data Distribution and Hashing Page 13-53 Example of Locating a Row – Data Block A Block is the physical I/O unit for Teradata. It contains one or more data rows, all of which belong to the same table. They must fit entirely within the block. The maximum block size is 255 sectors or 127.5 KB. A Data Block consists of three major sections: The Data Block Header (DB Header) The Row Heap The Row Reference Array Rows cannot be split between blocks. Each row in a DB is referenced by a separate “offset or pointer” to the row. These offsets are kept in the Row Reference Array. The Row Reference Array is placed near the end of the DB just before the Block Trailer. The DB Header contains control information for both the Row Reference Array and the Row Heap. The DB Header is 72* bytes of information which contains the Table ID (6 bytes). It shows which table and subtable the rows in the block are from. The Row Heap is where the rows reside in the DB. The rows may be in any physical order, are aligned on an even address boundary, and therefore have an even number of bytes allocated for them. The Reference Array Pointers (2 bytes each), which point to the first byte of a row (Row Length), are maintained in reverse Row ID sequence. The Reference Array pointers are used to do both binary and sequential searches. The Block Trailer (2 bytes) consists of a block version number which must match the block version number in the Data Block Header. * Notes on amount of space used by DB Headers. Page 13-54 If the DB is on a 32-bit system and has never been updated, then the DB Header is only 36 bytes long. If the DB is on a 64-bit system and has never been updated, then the DB Header is only 40 bytes long. If a data block is new or has been updated (either a 32-bit or 64-bit system), then the DB Header is 72 bytes long. The length of the block header for a compressed block is 128 bytes. Note that, in a compressed block, the header is not compressed and neither is the block trailer. Only the row data within the block is compressed. The extended block header has the normal block header at the start and then 56 extra bytes that contains information specific to the compressed block plus some extra filler bytes to allow for later additions without requiring data conversion. Data Distribution and Hashing Example of Locating a Row – Data Block Sector 789 790 791 792 793 794 Header (72) Row 1 Row 3 Row 2 Row 4 Row 6 Row 5 Row Heap Row 7 Row 8 Row R eference Array Trailer (2) • A block is the physical I/O unit. • The block header contains the Table ID (6 bytes). • Only rows for the same table reside in the same data block. – Rows are not split across block boundaries. • Blocks within a table vary in size. The system adjusts block sizes dynamically. – Blocks may be from 1 sector (512 bytes) to 255 sectors (127.5 KB). • Data blocks are not chained together. • Row Reference Array pointers are stored (sorted) in reverse sequence based on Row ID within the block. Data Distribution and Hashing Page 13-55 Accessing the Row within the Data Block Teradata’s File System software does a binary search of the Row Reference Array to locate the rows that have a matching Row Hash. Since the Row Reference Array is sorted in reverse sequence based on Row ID, the system can do a binary or linear search. The first row with a matching Row Hash has its Primary Index value compared with the Primary Index value in the request. The PI value must be checked to eliminate Hash Synonyms. The matching rows are then put into spool. If no matches are made, a message is returned that no rows are found. In the case of a Unique Primary Index (UPI), the search ends with the first row found matching the criteria. The row is then returned. In the case of a Non-Unique Primary Index (NUPI), the matching rows (same PI value and Row Hash) are put into spool. With a NUPI, the matching rows in spool are returned. The example on the right-hand page illustrates how Teradata utilizes the Primary Index data value to eliminate synonyms. This is the conclusion of the example that we have been following throughout this module. In earlier steps the Master Index was used to find that the desired row was in Cylinder 169. Then the Cylinder Index was used to find that the desired row was in the 6-sector block beginning in Sector Number 789. The diagram shows that block. The objective is to find that row with Row Hash=1000 and Index Value=3755. When the block is searched, the first row with Row Hash 1000 does not meet these criteria. Its Index Value is 1006, which means that it is a Hash Synonym. The system must continue its search to the next row, the only row that meets both criteria. The diagram on the facing page shows the logical order of rows in the block with a binary search. Page 13-56 Data Distribution and Hashing Accessing the Row within the Data Block • Within the data block, the Row Reference Array is used to locate the first row with a matching Row Hash value within the block. • The Primary Index data value is used as a row qualifier to eliminate synonyms. Value 3755 SELECT FROM WHERE Hash 1000 * employee employee_number = 3755; Data Distribution and Hashing Index Hash Uniq Value Data Columns 998 1 4219 Row data 999 1 2968 Row data 999 2 6324 Row data 1000 1 1006 Row data 1000 2 3755 Row data 1002 1 6838 Row data 1008 1 8825 Row data 1010 1 0250 Row data Data Block Sectors 789 794 Page 13-57 AMP Read I/O Summary You have seen that a Primary Index Read requires that the Master Index, Cylinder Index and Data Block all must be accessed. The number of I/Os involved in this process can vary. The Master Index is always resident in memory. The Cylinder Index may or may not be resident in memory and the Data Block may or may not be resident in memory. Factors that affect the number of I/Os involved include AMP memory, cache size and locality of reference. Often the Cylinder Index is memory resident so that a Unique Primary Index retrieval requires only a single I/O. Note that no matter how many rows are in the table and no matter how many inserts are made, Primary Index access never gets any more complicated than Master Index to Cylinder Index to Data Block. Page 13-58 Data Distribution and Hashing AMP Read I/O Summary The Master Index is always memory resident. The AMP reads the Cylinder Index if not memory resident. The AMP reads the Data Block if not memory resident. • The amount of FSG cache size also has an impact if either of these steps require physical I/O. • The data block may or may not be memory residence depending on recent accesses of this data block. • The Cylinder Index is usually memory resident and a Unique Primary Index retrieval requires only one I/O. Message Passing Layer Table ID Row Hash PI Value AMP Memory Master Index Cylinder Index (accessed in FSG Cache) Data Block (accessed in FSG Cache) Data Distribution and Hashing Vdisk CI Row Page 13-59 Module 13: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 13-60 Data Distribution and Hashing Module 13: Review Questions 1. The Row Hash for a PI value of 824 is the same for the data types of INTEGER and DECIMAL(18,0). True or False. _______ 2. The first 16 or 20 bits of the Row Hash is referred to as the _________ _________ _________ . 3. The Hash Map consists of entries or buckets which identify an _____ number for the Row Hash. 4. The Current Configuration ___________ Hash Map is used to locate the AMP to locate/store a row based on PI value. 5. The ____________ utility is used to redistribute rows to a new system configuration with more AMPs. 6. When creating a new table, the Unique Value of a Table ID comes from the dictionary table named DBC.________ . 7. The Row ID consists of the _______ ________ and the __________ _____ . 8. The _______ _______ contains a Cylinder Index Descriptor (CID) for each allocated Cylinder. 9. The _______ _______ contains an entry for each data block in the cylinder. 10. The ____ __________ ________ consists of a set of 2 byte pointers to the data rows in data block. 11. The maximum block size is approximately _______ and the maximum row size is approximately _______ . 12. The Primary Index data value is used as a row qualifier to eliminate hash _____________ . Data Distribution and Hashing Page 13-61 Notes Page 13-62 Data Distribution and Hashing Module 14 File System Writes After completing this module, you will be able to: Describe File System Write Access. Describe what happens when Teradata inserts a new row into a table. Describe the impact of row inserts on block sizes. Describe how fragmentation affects performance. Teradata Proprietary and Confidential File System Writes Page 14-1 Notes Page 14-2 File System Writes Table of Contents AMP Write I/O........................................................................................................................... 14-4 New Row INSERT – Part 1 ....................................................................................................... 14-6 New Row INSERT – Part 2 ....................................................................................................... 14-8 New Row INSERT – Part 2 (cont.).......................................................................................... 14-10 New Row INSERT – Part 3 ..................................................................................................... 14-12 New Row INSERT – Part 4 ..................................................................................................... 14-14 Alternate Cylinder Index ...................................................................................................... 14-14 Blocking in Teradata ................................................................................................................ 14-16 Block Size and Filling Cylinders ............................................................................................. 14-18 Variable Block Sizes ................................................................................................................ 14-20 Block Splits (INSERT and UPDATE) ..................................................................................... 14-22 Space Fragmentation ................................................................................................................ 14-24 Cylinder Full ............................................................................................................................ 14-26 Mini-Cylpack ........................................................................................................................... 14-28 Space Utilization ...................................................................................................................... 14-30 Teradata 13.10 Auto Cylinder Pack Feature ........................................................................ 14-30 Merge Datablocks (13.10 Feature) ........................................................................................... 14-32 Merge Datablocks (Teradata 13.10) cont. ............................................................................ 14-34 How to use this Feature .................................................................................................... 14-34 File System Write Summary .................................................................................................... 14-36 Module 14: Review Questions ................................................................................................. 14-38 Module 14: Review Questions (cont.) ................................................................................. 14-40 File System Writes Page 14-3 AMP Write I/O The facing page illustrates how Teradata performs write operations and it outlines steps required to perform an AMP Write operation. WAL (Write Ahead Logging) is a recoverability/reliability feature that also provides performance improvements in the area of database writes. WAL is a Teradata V2R6.2 (and later) feature. WAL can batch up modifications from multiple transactions and apply them with a single disk I/O, thereby saving I/O operations. WAL will help improve throughput for I/O-bound workloads. WAL is a log-based file system recovery scheme in which modifications to permanent data are written to a log file, the WAL log. The log file contains change records (Redo records) which represent the updates. At key moments, such as transaction commit, the WAL log is forced to disk. In the case of a reset or crash, Redo records can be used to transform the old copy of a permanent data block on disk into the version that existed at the time of the reset. By maintaining the WAL log, the permanent data blocks that were modified no longer have to be written to disk as each block is modified. Only the Redo records in the WAL log must be written to disk. This allows a write cache of permanent data blocks to be maintained. WAL protects all permanent tables and all system tables but is not used to protect either the Transient Journal (TJ), since TJ records are stored in the WAL log, or any type of spool tables, including global temporary tables. The WAL log is maintained as a separate logical file system from the normal table area. Whole cylinders are allocated to the WAL log, and it has its own index structure. The WAL log data is a sequence of WAL log records and includes the following: Redo records, used for updating disk blocks and insuring file system consistency during restarts. TJ records used for transaction rollback. There is some additional CPU cost for maintaining the WAL log so WAL may reduce throughput for CPU-bound workloads. However, the overall performance is expected to be better with WAL since the benefit of I/O improvement outweighs the much smaller CPU cost. If CHECKSUM = NONE and the New Block length = Old Block length, Teradata will attempt to update-in-place for any INSERT, DELETE, or UPDATE operations. If the CHECKSUM feature is enabled for a table, any INSERT, UPDATE, or DELETE operation will cause a new data block to be allocated. The FastLoad and MultiLoad utilities always allocate new data blocks for write operations. TPump follows the same rules as an SQL INSERT, UPDATE, or DELETE. Page 14-4 File System Writes AMP Write I/O For SQL writes, Teradata uses WAL logic to manage disk write operations. • Read the Data Block if not in memory (Master Index > Cylinder Index > Data Block). • Place appropriate entries (e.g., before-images) into the Transient Journal buffer (actually a WAL buffer) and write it to the WAL log on disk. • Data blocks are updated in Memory, but not written immediately to disk. • The after-image or changed image (REDO row) is written to a WAL buffer which is written to the WAL log on disk. – WAL can batch up modifications from multiple transactions and apply them with a single disk I/O, thereby saving I/O operations. – Updated data blocks in memory will be eventually aged out and written to disk. • Make the changes to the Data Block in memory and determine the new block’s length. – If the New Block has changed size, always allocate a new Data Block. – If the New Block length = Old Block length, Teradata will attempt to update-inplace for any INSERT, DELETE, or UPDATE operations. These operations happen concurrently on the Fallback AMP. File System Writes Page 14-5 New Row INSERT – Part 1 The facing page illustrates what happens when Teradata INSERTs a new row into a table. The three part message is Table ID = 100, Partition # = 0, Row Hash = 1123 and PI Value = 7923. The AMP uses its Master Index to locate the proper cylinder for the new row. As you can see, Cylinder #169 is where a row with Table ID = 100, Partition # = 0, and Row Hash = 1123 should be inserted. The next step is to access the Cylinder Index for Cylinder #169, as illustrated on the facing page. Teradata’s File System software does a binary search of the CIDs based on Table ID / Partition # / Row Hash to locate the cylinder number in which to insert the row. The CI for that cylinder is accessed to locate the data block. Note: The Partition # (shown in the examples) does not exist in Teradata systems prior to V2R5. Page 14-6 File System Writes New Row Insert – Part 1 INSERT INTO employee VALUES (7923, . . . . ); INSERT Table ID ROW 100 Part # Row Hash 0 1123 data column values 7923 Master Index Lowest Highest Pdisk and Table ID Part # Row ID Table ID Part # Row Hash Cylinder Number : 078 098 100 100 100 100 100 100 100 123 123 : : 0 0 0 0 0 0 0 0 0 1 2 : : 58234, 2 00107, 1 00773, 3 01361, 2 02937, 1 03662, 1 04123, 2 05974, 1 07353, 1 00343, 2 06923, 1 : Part # - Partition Number File System Writes : 095 100 100 100 100 100 100 100 120 123 123 : : 0 0 0 0 0 0 0 0 0 2 3 : : 72194 00676 01361 02884 03602 03999 05888 07328 00469 01864 00231 : : 204 037 169 777 802 117 888 753 477 529 943 : Free Cylinder List Pdisk 0 Free Cylinder List Pdisk 1 : 124 125 168 170 183 189 201 217 220 347 702 : : 761 780 895 896 914 935 941 1012 1234 1375 1520 : To CYLINDER INDEX Page 14-7 New Row INSERT – Part 2 The example on the facing page is a continuation from the previous page. Teradata has determined that the new row must be INSERTed into Cylinder #169 in this example. Page 14-8 File System Writes New Row Insert – Part 2 INSERT INTO employee VALUES (7923, . . . . ); INSERT Table ID ROW 100 Part # Row Hash 0 1123 data column values 7923 Cylinder Index - Cylinder #169 SRDs Table ID First DBD DBD Offset Count SRD #1 Free Block List 100 FFFF 12 Free Sector Entries DBDs Part # Lowest Row ID Part # Highest RowHash Start Sector : DBD #4 DBD #5 DBD #6 DBD #7 DBD #8 DBD #9 : : 0 0 0 0 0 0 : : 00867, 2 00938, 1 00998, 1 01010, 3 01185, 2 01290, 1 : : 0 0 0 0 0 0 : : 00902 00996 01010 01177 01258 01333 : : 1010 0093 0789 0525 0056 1138 : Read the block into memory (FSG cache). File System Writes Sector Row Count Count : 4 7 6 3 5 5 : : 5 10 8 4 6 6 : Start Sector Sector Count : 0270 0301 0349 0470 0481 0550 : : 3 5 5 4 6 5 : To Data Block Page 14-9 New Row INSERT – Part 2 (cont.) The example on the facing page is a continuation from the previous page. Teradata has determined that the new row hash value falls with the range of the data block that starts at sector 525 and is 3 sectors long. If the block that has been read into memory (FSG Cache) has enough contiguous free bytes, then the row is inserted into this space within the block. The row reference array and the Cylinder Index are updated. If the block that has been read into memory (FSG Cache) does not have enough contiguous free bytes, but it does have enough free bytes within the entire block, the software will defragment the block and insert the row. The row reference array and the Cylinder Index are updated. Note: The block header contains a field that indicates the total number of free bytes within the block. Also note that the Row Reference Array expands by 2 bytes to reflect the added row. If the block now has 5 rows, the Row Reference Array will increase from 8 bytes to 10 bytes in length. Acronyms: FS – Free Space RRA – Row Reference Array BT – Block Trailer Page 14-10 File System Writes New Row Insert – Part 2 (cont.) Read the block into memory (FSG cache). 1. If the block has enough free contiguous bytes, then insert row into block and update CI. 525 526 527 Block Header Row #1 Free Space RRA 525 526 Row #4 Row #3 FS Row #2 2. If the block has enough free bytes, then defragment the block and insert row into block and update CI. BT Block Header 527 Row #1 Free Space Row #4 New row to INSERT Block Header 526 527 Row #1 FS Free Space Row #4 RRA BT FS Block is defragmented and row is inserted into contiguous free space. 525 Row #4 526 Row #2 Row #3 New row Row #3 New row to INSERT Row is inserted into contiguous free space. 525 Row #2 RRA BT 527 Block Header Row #1 Row #3 New row Row #2 Row #4 FS RRA BT FS - Free Space; RRA - Row Reference Array; BT - Block Trailer File System Writes Page 14-11 New Row INSERT – Part 3 The File System then accesses the 3-sector block which starts at sector 525 and makes it available in AMP memory. The row is placed into the block, and the new block length is computed. In this example, inserting the row has caused the block to expand from 3 sectors to 4 sectors. Note that the Row Reference Array expands by 2 bytes to reflect the added row. If the block now has 5 rows, the Row Reference Array will increase from 8 bytes to 10 bytes in length. Acronyms: FS – Free Space RRA – Row Reference Array BT – Block Trailer Page 14-12 File System Writes New Row Insert – Part 3 3. If the new row is larger than the total free space within the block, then the Insert expands the block by as many sectors as needed (in memory) to hold the row. In this example, the block is expanded by one sector in memory. 525 Block Header 526 527 Row #1 Row #2 Row #4 Row #3 FS Free Space RRA BT Row #2 New row to INSERT Block Header In memory, block is expanded. Row #1 Row #3 Row #4 New row Free Space RRA BT 4. The next step is to locate the first block on the Free Block List equal to, or greater than 4 sectors. File System Writes Page 14-13 New Row INSERT – Part 4 The File System searches the Free Sector (or Block) List looking for the first Free Block whose size is equal to or greater than the new block’s requirement. It does not have to be an exact match. Upon finding a 5-sector free block starting at sector 0301, the system allocates a new 4-sector block (sectors 301, 302, 303, 304) for the new data block, leaving a free block of one sector (305) remaining. The new data block is written to disk. The old, 3-sector data block is placed onto the Free Sector List (or Free Block List). The modified CI will be copied to the buddy node (FSG Cache) and the modified CI will be written back to disk (eventually). If a transaction failure occurs (or the transaction is aborted), the Transient Journal is used to undo the changes to both the data blocks and the Cylinder Indexes. Before images of data rows are written to the Transient Journal. Before images of Cylinder Indexes are not written to the Transient Journal because Teradata uses the Alternate Cylinder Index for the changes. If a transaction fails, the before image in the Transient Journal is used to return the data row(s) back to the state before the transaction. Alternate Cylinder Index Starting with V2R6.2 and with WAL, space for 2 Cylinder Indexes (2 x 12 KB = 24 KB) is allocated at the beginning of every cylinder. Characteristics include: Page 14-14 Two Cylinder Indexes are used – Teradata alternates between the two Cylinder Indexes. Changes are written to an “Alternate Cylinder Index”. When a CI is changed, it is not updated in place. This provides for better I/O integrity. File System Writes New Row INSERT – Part 4 Cylinder Index - Cylinder #169 Free Block List Free Sector Entries SRDs Table ID First DBD DBD Offset Count SRD #1 100 FFFF 12 DBDs Part # Lowest Row ID Part # Highest RowHash Start Sector : DBD #5 DBD #6 DBD #7 DBD #8 : : 0 0 0 0 : : 00938, 1 00998, 1 01010, 3 01185, 2 : : 0 0 0 0 : : 00996 01010 01177 01258 : : 0093 0789 0525 0056 : Sector Row Count Count : 7 6 3 5 : : 10 8 4 6 : 1 Start Sector : 0270 0301 0349 0470 0481 0550 : Sector Count : 3 5 5 4 6 5 : Alternate Cylinder Index - Cylinder #169 Free Block List Free Sector Entries SRDs Table ID First DBD DBD Offset Count SRD #1 100 FFFF 12 DBDs Part # Lowest Row ID Part # : DBD #5 DBD #6 DBD #7 DBD #8 : : 0 0 0 0 : : 00938, 1 00998, 1 01010, 3 01185, 2 : : 0 0 0 0 : File System Writes Highest RowHash : 00996 01010 01177 2 01258 : Start Sector : 0093 0789 0301 0056 : Sector Row Count Count : 7 6 4 5 : : 10 8 3 5 6 : Start Sector : 0270 0305 0349 0470 0481 0525 0550 Sector Count : 3 2a 1 5 4 6 3 2b 5 Page 14-15 Blocking in Teradata Tables supporting Data Warehouse and Decision Support users generally have their block size set very large to accommodate more rows per block and reduce the number of block I/Os needed to do full table scans. Tables involved in online applications and heavy data maintenance generally have smaller block sizes. Extremely large rows, called Oversized Rows, are very costly. Each Oversized row requires its own block and costs one I/O every time it is touched. Oversized rows are common in non-relational data models and appear in poor relational data models. Page 14-16 File System Writes Blocking in Teradata Definitions Largest Data Block Size • The largest multi-row data block allowed. Impacts when a block split occurs. • Determined by: – – Table level attribute DATABLOCKSIZE System default - PermDBSize parameter (DBS Control) – 5555 default is 254 sectors (127 KB) Large (or typical) Row • The largest fixed length row that allows multiple rows/block. • Defined as ((Largest Block - 74 ) / 2); – Block header is 72 bytes and trailer is 2 b ytes. Oversized Row • A row that requires its own Data Block (one I/O per row): • A fixed length row that is larger than Large Row. Example: • Assume DATABLOCKSIZE = 65,024 (127 sectors x 512 bytes) – – – Largest Block = 65,024 bytes Large Row 32,475 bytes ((65,024 – 74) / 2) Oversize row > 32,476 bytes File System Writes Page 14-17 Block Size and Filling Cylinders Teradata supports a maximum block size of 255 sectors. With newer, larger, and faster systems, it typically makes sense to use a large block size for transactions that do full table or partition scans. A large block may help to minimize the number of I/Os needed to access a large amount of data. Therefore, it may seem that using the largest possible block size of 255 sectors would be a good choice. However, a maximum block size of 254 sectors is actually a better choice in most situations. Why? With 254 sector blocks, a cylinder can hold 15 blocks. With 255 sector blocks, a cylinder can only hold 14 blocks. Why? A cylinder consists of 3872 sectors and 48 sectors are used for the cylinder indexes. The available space for user data blocks is 3872 – 48 = 3824 sectors. 3824 ÷ 254 = 15.055 or 15 blocks 3824 ÷ 255 = 14.996 or 14 blocks 15 x 254 = 3810 sectors of a cylinder are utilized or 99.6% 14 x 255 = 3570 sectors of a cylinder are utilized or only 93.4% Assume an empty staging table and using FastLoad to load data into the table. With 255 sector blocks, the table will use 6% more cylinders to hold the data. By using a system default (PermDBSize) or data block size (DATABLOCKSIZE) of 254 sectors will effectively utilize the space in cylinders more efficiently than 255 sector blocks. The same is true if you are considering 127 or 128 sector blocks. 127 sector blocks – cylinder can hold 30 blocks – utilize 99.6% of cylinder 128 sector blocks – cylinder can hold 29 blocks – utilize 97.1% of cylinder Therefore, 127 or 254 sector blocks are typically better choices. A greater percentage of cylinder space can be utilized with these choices. Page 14-18 File System Writes Block Size & Filling Cylinders What is the difference between choosing maximum data block size of 254 or 255 sectors? • With 254 sector blocks, a cylinder can hold 15 blocks. • With 255 sector blocks, a cylinder can only hold 14 blocks. Why? • A cylinder consists of 3872 sectors and 48 sectors are used for the cylinder indexes. – 3824 254 = 15.055 or 15 blocks – 3824 255 = 14.996 or 14 blocks – 15 x 254 = 3810 sectors of a cylinder are utilized or 99.6% – 14 x 255 = 3570 sectors of a cylinder are utilized or only 93.4% • Assume an empty staging table and using FastLoad to load data into the table. With 255 sector blocks, the table will use 6% more cylinders. What about 127 and 128 sector blocks? • With 127 sector blocks, a cylinder can hold 30 blocks – utilize 99.6% of cylinder • With 128 sector blocks, a cylinder can hold 29 blocks – utilize 97.1% of cylinder Therefore, 127 or 254 sector blocks are typically better choices for PermDBSize and/or data block sizes. A greater percentage of cylinder space can be utilized with these choices. File System Writes Page 14-19 Variable Block Sizes The Teradata RDBMS supports true variable block sizes. The illustration on the facing page shows how blocks can expand to accommodate additional rows as they are INSERTed. As rows are INSERTed, the Reference Array Pointers are placed into Row ID sequence. REMEMBER Large rows require more disk space for Transient Journal, Permanent Journal, and Spool files. Page 14-20 File System Writes Variable Block Sizes • When inserting rows (ad hoc SQL or TPump), the block expands as needed to accommodate them. • The system maintains rows within the block in logical ROW ID sequence. • Large rows take more disk space for Transient Journal, Permanent Journal, and Spool files. • Blocks are expanded until they reach “Largest Block Size”. At this point, a Block Split is attempted. 1 Sector 1 Sector 2 Sector 2 Sector 2 Sector 3 Sector Block Block Block Block Block Block Row Row Row Row Row Row Row Note: Row Row Row Row Row Row Row Row Row Row Row Row Row Row Rows do NOT have to be contiguous in a data block. File System Writes Page 14-21 Block Splits (INSERT and UPDATE) Block splits occur during INSERT and UPDATE operations. Normally, when a data block expands beyond the maximum multi-row block size (Largest Block), it splits into two approximately equal-sized blocks. This is shown in the upper illustration on the facing page. If an Oversize Row is INSERTed into a data block, it causes a three-way block split (as shown in the lower illustration). This type of block split may result in uneven block sizes. With Teradata, block splits cost only one additional I/O per extra block created. There is little impact on OLTP and OLCP performance. Block splits automatically reclaim any contiguous, unused space greater than 511 bytes. Page 14-22 File System Writes Block Splits (INSERT and UPDATE) Two-Way Block Splits • When a Data Block expands beyond Largest Block, it splits into two, fairly-equal blocks. • This is the normal case. Three-Way Block Splits • An oversize row gets its own Data Block. The existing Data Block splits at the row’s logical point of existence. • This may result in uneven block sizes. New Row Oversized Row New Row Oversized Row Notes: • Block splits automatically reclaim any unused space over 511 bytes. • While it is not typical to increment blocks by one 512 sector, it is tunable as to how many sectors are acquired at a time for the system. File System Writes Page 14-23 Space Fragmentation Space fragmentation is not an issue in the Teradata database because the system collects free blocks as a normal part of routine table maintenance. If a block of sectors is freed up and is adjacent to already free sectors in the cylinder, these are combined into one entry on the free block list. As previously described, when an actual data block has to grow, it does not grow into adjacent free blocks – a new block is assigned from the free block list. The freed up data block (set of sectors) is placed on the free block (or segment) list. If there is already an entry on the free block list representing adjacent free blocks, then the freed up data block is combined with adjacent free sectors and only one entry is placed on the free block list. Using the example on the facing page, assume we are looking at a 40-sector portion of a cylinder. These sectors are physically adjacent to each other. The free block list would have 2 entries on it – one representing the 4 unused sectors and a second entry representing the 6 unused sectors. We will now consider 4 situations. First case – If the first 10-sector data block is freed up, software will not place an entry on the free block list for just these 10 sectors. Software will effectively combine these 10 sectors with the following adjacent free 4 sectors and place one entry representing the 14 free sectors on the free block list. For this 40-sector portion of a cylinder, there will be 2 entries on the free block list – one for the first 14 unused sectors and a second entry for the 6 unused sectors that are still there. Second case – If the middle 12-sector data block is freed up, software will not place an entry on the free block list for just these 12 sectors, but will effectively combine these 12 sectors with the previous adjacent 4 free sectors and with the following 6 free adjacent sectors, effectively represented by one entry for 22 free sectors. For this 40-sector portion of a cylinder, there will be one entry on the free block list showing that 22 sectors that are free. Third case – If the last 8-sector data block is freed up, software will not place an entry on the free block list for just these 8 sectors, but will effectively combine these 8 sectors with the previous adjacent 6 free sectors. One entry representing the 14 free sectors is placed on the free block list. For this 40-sector portion of a cylinder, there will be 2 entries on the free block list – one for the first 4 unused sectors and a second entry for the 14 unused sectors. Fourth case – If there is no entry on the free block list large enough to meet a request for a new block, Teradata’s file system software may choose to dynamically defragment the cylinder. In this case, all free sectors are combined together at the end of a new cylinder and one entry for the free space (sectors) is placed on the free block list. Defragmentation is actually done in the new cylinder and the existing cylinder is placed in the free cylinder list. Page 14-24 File System Writes Space Fragmentation • The system collects free blocks as a normal part of table maintenance. • Smaller Free Blocks become larger when adjacent blocks become free, or when defragmentation is performed on the cylinder. This 10 Sector Block 4 Unused Sectors becomes OR 14 Unused Sectors OR 10 Sector Block 8 Sector Block This example represents a 40 sector portion of a cylinder; 30 sectors have data and 10 sectors are unused. File System Writes 10 Sector Block 4 Unused Sectors 12 Sector Block 6 Unused Sectors OR 12 Sector Block 22 Unused Sectors 6 Unused Sectors 8 Sector Block 1st used block is freed up 2nd used block is freed up 10 Sector Block 12 Sector Block 12 Sector Block 8 Sector Block 14 Unused Sectors 8 Sector Block OR 3rd used block is freed up 10 Unused Sectors After Defragmentation Page 14-25 Cylinder Full A Cylinder Full condition occurs when there is no block on the Free Block List that has enough sectors to accommodate additional data during an INSERT or UPDATE. If this condition occurs, the File System goes through the steps outlined on the facing page which results in a Cylinder Migrate to an existing adjacent cylinder or to a new cylinder. As part of this process, the file system software may also choose to perform a Cylinder Defragmentation or a Mini Cylinder Pack (Mini-Cylpack) operation. A Mini-Cylpack is a background process that occurs automatically when the number of free (or available) cylinders falls below a threshold. The mini-Cylpack process is the mechanism that Teradata uses to rearrange data blocks to free cylinders. This process involves moving data blocks from a data cylinder to the logically preceding data cylinder until a whole cylinder becomes empty. Mini-Cylpack is an indication that the system does not have enough free space to handle its current workload. In the example at the bottom of the facing page, if Cylinder 37 became full, the File System would check Cylinder 204 and Cylinder 169 to see if they had enough room to perform a Cylinder Migrate. These two cylinders are logically adjacent to Cylinder 37 in the Master Index, but not necessarily physically adjacent on the disk. During the Cylinder Migrate, if data blocks were moved to Cylinder 204, they would be taken from the top of Cylinder 37. If they were moved to Cylinder 169, they would be taken from the bottom of Cylinder 37. Note: Performance tests show that defragging can cause a significant performance hit. Therefore, the default tuning parameters that control how often you do this are set to only defragment cylinders if there are very few free cylinders left (<= 100) and the cylinder has quite a bit of free space that isn’t usable (>= 25%). The latter indicates that, although there is significant free space on the cylinder, the free space is apparently so fragmented that a request for new sectors couldn’t be satisfied. Otherwise, it’s assumed that the cylinder is full and the overhead of defragging it wouldn’t be worth it. Page 14-26 File System Writes Cylinder Full Cylinder Full means there is no block big enough on the Free Block List. The File System does either of the following: • Cylinder Migrate to an adjacent cylinder — checks logically adjacent cylinders for fullness. If it finds room, it moves a maximum of 10 data blocks from the full cylinder to an adjacent one. • Cylinder Migrate to a new Cylinder — looks for a free cylinder, allocates one, and moves a maximum of 10 data blocks from the congested cylinder to a new one. While performing a Cylinder Migrate operation, the File System software may also do the following operations in the background. • Cylinder Defragmentation — if the total cylinder free space 25% of the cylinder size (25% is default), then the cylinder is defragmented. Defragmentation collects all free sectors at the end of a new cylinder by moving all the data blocks to the top of the new cylinder. • Mini-Cylpack — if the number of free cylinders falls below a threshold (default is 10), then a "MiniCylpack" is performed to pack data together to free up a cylinder and place it on the free cylinder list. Master Index Lowest Highest Pdisk and Table ID Part # Row ID Table ID Part # Row Hash Cylinder Number : 078 098 100 : : 0 0 0 : File System Writes : 58234, 2 00107, 1 00773, 3 : : 095 100 100 : : 0 0 0 : : 72194 00676 01361 : : 204 037 169 : Free Cylinder List Pdisk 0 Free Cylinder List Pdisk 1 : 124 125 168 : : 761 780 895 : Page 14-27 Mini-Cylpack The Mini-Cylpack is the mechanism that Teradata uses to rearrange data blocks to free cylinders. The process involves moving data blocks from a data cylinder to the logically preceding data cylinder until a whole cylinder becomes empty. A Mini-Cylpack is an indication that the system does not have enough free space to handle its current workload. Excessive numbers of Mini-Cylpacks indicate too little disk space is available and/or too much spool is being utilized during data maintenance. Spool cylinders are never “Cylpacked”. Teradata has a Free Space (a percentage) parameter that can be set to control how much free space is left in a cylinder during loading and the use of the Ferret PackDisk utility. This parameter is not used with mini-cylpacks. This parameter should be set low (close to 0%) for systems which are used solely for Decision Support as there is no data maintenance involved. In cases where there is moderate data maintenance (batch or some OLTP), the Free Space parameter should be set at approximately 25%. If heavy data maintenance is to be done (OLTP), the Free Space parameter may have to be set at approximately 50% to prevent Cylpacks from affecting OLTP response times. The Free Space parameter can be set at the system level, at a table level, and when executing the Ferret PackDisk utility. DBSControl – FREESPACEPERCENT (0% is the default) CREATE TABLE – FREESPACE = integer [PERCENT] (0 – 75) FERRET PACKDISK – FREESPACEPERCENT (or FSP) integer The system administrator can specify a count of empty cylinders the system should attempt to maintain. Whenever a Cylinder Migrate to a new cylinder occurs, the system checks to see if the minimum number of empty cylinders still exists. If the system has dropped below the minimum, it starts a background task that begins packing cylinders. The task stops when either a cylinder is added to the Free Cylinder List or it has packed 10 cylinders. This process continues with every Cylinder Migrate to a new cylinder until the minimum count of empty cylinders is reached, or a full mini-cylpack is required. Page 14-28 File System Writes Mini-Cylpack BEFORE AFTE R A Mini-Cylpack m oves data blocks from the data cylinder(s) to logically preceding data cylinder(s) until a single cylinder is em pty. • Spool cylinders are never cylpacked. • Mini-Cylpacks indicate that the system does not have space to handle its current workload. • Excessive Cylpacks indicate too little disk space and/or spool utiliza tion during data maintenance. The Free Space parameter impacts how full a cylinder is filled with data loading and PackDisk. • DBSControl – FREESPACEPERCENT • CREATE TABLE – FREESP ACE • FERRET P ACKDISK – FREESP ACEPERCENT (FSP) File System Writes Page 14-29 Space Utilization The Teradata Database can use any particular cylinder to either store data or hold Spool files. A cylinder cannot be used for both data and Spool simultaneously. In sizing a system, you must make certain that you have enough cylinders to accommodate both requirements. Limiting the number of rows and columns per query helps keep Spool requirements under control, as does keeping the number of columns per row to a minimum. Both can result from proper normalization. Teradata 13.10 Auto Cylinder Pack Feature One new background task in Teradata 13.10 is called AutoCylPack which attempts to combine adjacent, sparsely filled cylinders. These cylinder packs are typically executed when the system is idle. AutoCylPack is particularly useful if a customer is using temperature-based BLC, because it cleans up post-compression cylinders that are no longer holding as much data. However, this feature works with compressed as well as uncompressed cylinders. Sometimes the activity of AutoCylPack can result in seeing a little bit of wait I/O (less than 5%). File System Field 17 (DisableAutoCylPack) has a default value of FALSE, which means AutoCylPack is on and active all the time, unless you change this setting. General notes: There are a number of background tasks running in the Teradata database and AutoCylPack is another of these tasks. These tasks include deadlock detection, cylinder defragmentation, transient journal purging, periodic cache flushes, etc. These tasks generally consume a small amount of system resources. However, you will tend to notice them more when the system is idle. Page 14-30 File System Writes Space Utilization Space being used is managed via Master Index and Cylinder Indexes Master Index Cylinder Indexes Free Cylinder Lists Fre e Block Lists Cylinders not being used are listed in Free Cylinder Lists Fr ee Sectors within cylinder s ar e listed in Free Block Lists Cylinders contain Perm, Spool, Temporary, Permanent Journal, or WAL data, but NOT a combination. BE SURE THAT YOU HAVE ENOUGH SPACE OF EACH. Limiting the rows and columns per query reduces spool use. File System Writes Page 14-31 Merge Datablocks (13.10 Feature) This Teradata Database 13.10 feature automatically searches for “small” data blocks within a table and will combine (merge) these small datablocks into a single larger block. Over time, modifications to a table (especially with DELETEs of data rows) can result in a table having blocks that are less than 50% of the maximum datablock size. This File System feature combines these small blocks into a larger block. The benefit is simply that future full table operations (full table scans and/or full table updates) will perform faster because fewer I/Os are performed. By having larger blocks in the Teradata file system, the selection of target rows can also be more efficient. Blocks that are 50% or greater of the maximum multi-row datablock size (63.5 KB in this example) are not considered to be small blocks. Small blocks are less than 50% of the maximum datablock size. The merge of multiple small blocks into a larger block is limited by cylinder boundaries – does not occur between cylinders. A maximum of 7 logically adjacent preceding blocks can are merged together into a target block when the target block is updated. Therefore, a maximum of 8 total blocks can be merged together. Why are logical following blocks NOT merged together? The File System software does not know if following blocks are going to be immediately updated. Reduces performance impact during dense sequential updates How does a table get to the point of having many small blocks? DELETEs from this table can cause blocks to permanently shrink to a much smaller size unless a large amount of data is added again. How have customers resolved this problem before Teradata 13.10? The ALTER TABLE command can be used to re-block a table. This technique can be time consuming and requires an exclusive table lock. This technique is still available with Teradata 13.10. – ALTER TABLE DATABLOCKSIZE =IMMEDIATE If this featured is enabled, the merge of small data blocks into a larger block runs automatically during full table SQL write operations. This feature can merge datablocks for the primary/fallback subtables and all of the index subtables. This feature runs automatically when the following SQL functions are executed. Page 14-32 INSERT-SELECT, UPDATE-WHERE DELETE-WHERE (used on both permanent table and permanent journal datablocks) During the DELETE phase of Reconfig utility on source amps File System Writes Merge Datablocks (Teradata 13.10) This Teradata Database 13.10 feature automatically searches for “small” data blocks within a table and will combine (merge) these small datablocks into a single larger block. • Over time, modifications to a table (especially with DELETEs of data rows) can result in a table having blocks that are less than 50% of the maximum datablock size. • Up to 8 datablocks can be merged together. If enabled, the merge of small data blocks into a larger block runs automatically during full table SQL write operations. This feature can merge datablocks for the primary/fallback subtables and all of the index subtables. • INSERT-SELECT • UPDATE-WHERE • DELETE-WHERE How have customers resolved this problem before Teradata 13.10? • The ALTER TABLE command can be used to re-block a table. This technique can be time consuming and requires an exclusive table lock. This technique is still available with Teradata 13.10. ALTER TABLE DATABLOCKSIZE = IMMEDIATE; File System Writes Page 14-33 Merge Datablocks (Teradata 13.10) cont. How to use this Feature Defaults for this feature can be set at the system level via DBSControl settings and can be overridden with table level attributes. The CREATE TABLE and ALTER TABLE commands have options to enable or disable this feature for a specific table. The key parameter that controls this feature is MergeBlockRatio. This parameter can be set at the system level and also as a table level attribute. MergeBlockRatio has the following characteristics: Limits the resulting size of a merged block. Reduces the chances that a merged block will split again soon after it is merged, defeating the feature’s purpose. Computed as a percentage of the maximum multi-row datablock size for the associated table. Candidate merged block must be smaller than this computed size after all target row updates are completed. Source blocks are counted up as eligible until the size limit is reached (zero to 8 blocks can be merged together). The default system level percentage is 60% and can be changed. CREATE TABLE or ALTER TABLE options DEFAULT MERGEBLOCKRATIO – Default option on all CREATE TABLE statements MERGEBLOCKRATIO = integer [PERCENT] – Fixed MergeBlockRatio used for full table modification operations – Overrides the system default value NO MERGEBLOCKRATIO – Disables merges completely for the table DBSControl FILESYS Group parameters 25. DisableMergeBlocks (TRUE/FALSE, default FALSE) – Disables feature completely across the system, even for tables with a defined MergeBlockRatio as a table level attribute. – Effective immediately – does not require a Teradata restart (tpareset) 26. MergeBlockRatio (1-100%, default 60%) – Default setting for any table – this can be overridden at the table level. – Ignored when DisableMergeBlocks is TRUE (FILESYS Flag #25) – This is not stored in or copied to table header – Effective immediately without a tpareset Page 14-34 File System Writes Merge Datablocks (Teradata 13.10) cont. This feature is automatically enabled for new Teradata 13.10 systems, but must be enabled for existing systems upgraded to 13.10. Defaults for this feature are set via DBSControl settings. • • System defaults will work well for most tables. • The key parameter is MergeBlockRatio. The CREATE TABLE and ALTER TABLE commands have options to enable/disable this feature for a specific table or change the default ratio. MergeBlockRatio has the following characteristics: • • • The default system level percentage is 60%. Computed as a percentage of the maximum multi-row datablock size for the associated table. Candidate merged block must be smaller than this computed size after all target row updates are completed. CREATE TABLE or ALTER TABLE options • DEFAULT MERGEBLOCKRATIO – Default option on all CREATE TABLE statements • MERGEBLOCKRATIO = integer [PERCENT] – Fixed MergeBlockRatio used for full table modification operations • NO MERGEBLOCKRATIO – Disables merges completely for the table File System Writes Page 14-35 File System Write Summary Regardless of how large your tables get, or how many SQL-based INSERTs, UPDATEs or DELETEs are executed, the process is the same. This module has discussed in some detail the sequence of steps that Teradata’s file system software will attempt in order to complete the write operation. The facing page summarizes some of the key topics discussed in this module. Page 14-36 File System Writes File System Write Summary Teradata’s file system software automatically maintains the logical sequence of data rows within an AMP. • The logical sequence is based on tableid, partition #, and rowid. For write (INSERT, UPDATE, or DELETE) operations: • Read the Data Block if not present in memory. • Place appropriate entries into the Transient Journal buffer (WAL buffer). • Make the changes to the Data Block in memory and determine the new block’s length. • If the new block has changed size, allocate a new Data Block. Blocks will grow to the maximum data block size determined by the DATABLOCKSIZE table attribute, and then be split into smaller blocks. • Blocks will vary in size with Teradata. • For a table that has been updated with "ad hoc" or TPump INSERTs, UPDATEs, or DELETEs, a typical block size for the table will be approximately 75% of the maximum data block size. If the Write operation fails, the file system does a rollback using the Transient Journal. File System Writes Page 14-37 Module 14: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 14-38 File System Writes Module 14: Review Questions 1. When Teradata INSERTs a new row into a table, it first goes to the _________ to locate the proper cylinder for the new row. a. Cylinder Index b. Fallback AMP c. Free Cylinder List d. Master Index 2. When a new block is needed, the File System searches the Free Block List looking for the first Free Block whose size is equal to, or greater than the new block’s requirement. It does not have to be an exact match. a. True b. False 3. Name the condition which occurs when there is no block on the Free Block List with enough sectors to accommodate the additional data during an INSERT or UPDATE. a. Mini Cylinder Pack b. Cylinder Migrate to a new cylinder c. Cylinder Migrate to an adjacent cylinder d. Cylinder Full 4. The ______________ parameter can be set to control how completely cylinders are filled during loading and PackDisk. a. Free Space Percent b. DataBlockSize c. PermDBSize d. PermDBAllocUnit File System Writes Page 14-39 Module 14: Review Questions (cont.) Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 14-40 File System Writes Module 14: Review Questions (cont.) 5. Number the following steps in sequence from 1 to 6 that the File System software will attempt to perform in order to insert a new row into an existing data block. ____ Perform a Cylinder Migrate operation to an adjacent cylinder ____ Simply insert the row into data block if enough contiguous free bytes in the block ____ Perform a Block split ____ Perform a Cylinder Migrate operation to a new cylinder ____ Defragment the block and insert the row ____ Expand or grow the block to hold the row 6. As part of a cylinder full condition, if the number of free sectors within a cylinder is greater than 25%, what operation will Teradata perform in the background? ___________________ 7. If the number of free cylinders falls below a minimum threshold, what operation will Teradata perform in the background? ___________________ File System Writes Page 14-41 Notes Page 14-42 File System Writes Module 15 Teradata SQL Assistant After completing this module, you will be able to: Define an ODBC data source for Teradata. Submit SQL using SQL Assistant. Utilize Explorer Tree to simplify creation of queries. Use SQL Assistant to import/export a LOB. Teradata Proprietary and Confidential Teradata SQL Assistant Page 15-1 Notes Page 15-2 Teradata SQL Assistant Table of Contents SQL Assistant ............................................................................................................................ 15-4 Defining a Data Source .............................................................................................................. 15-6 Compatibility ..................................................................................................................... 15-6 Defining a Teradata .Net data source ................................................................................. 15-6 Defining a Data Source (cont.) .................................................................................................. 15-8 Defining an ODBC Data Source ............................................................................................ 15-8 Defining a Data Source (cont.) ................................................................................................ 15-10 ODBC Driver Setup for LOBs ............................................................................................. 15-10 Connecting to a Data Source .................................................................................................... 15-12 Main Window........................................................................................................................... 15-14 Database Explorer Tree ............................................................................................................ 15-16 Creating and Executing a Query .............................................................................................. 15-18 Creating statements (single and multi-queries) ................................................................ 15-18 Dragging Object Names to the Query Window ....................................................................... 15-20 Dragging Multiple Objects............................................................................................... 15-20 Query Options .......................................................................................................................... 15-22 To submit any part of any query .......................................................................................... 15-22 Clearing the Query Window ................................................................................................ 15-22 Formatting a Query .............................................................................................................. 15-22 Viewing Query Results ............................................................................................................ 15-24 Sorting an Answerset Locally .............................................................................................. 15-24 Formatting Answersets ............................................................................................................ 15-26 Using Query Builder ................................................................................................................ 15-28 Description of the Options ............................................................................................... 15-28 History Window ....................................................................................................................... 15-30 General Options ....................................................................................................................... 15-32 Connecting to Multiple Data Sources ...................................................................................... 15-34 Additional Options ................................................................................................................... 15-36 Importing/Exporting Large Object Files .................................................................................. 15-38 Teradata SQL Assistant 12.0 Note ....................................................................................... 15-38 Importing/Exporting Large Object Files .................................................................................. 15-40 To Import a LOB into Teradata ............................................................................................... 15-40 Selecting from a Table with a LOB ......................................................................................... 15-42 Displaying a JPG within SQL Assistant .................................................................................. 15-44 Teradata SQL Assistant Summary ........................................................................................... 15-46 Module 15: Review Questions ................................................................................................. 15-48 Lab Exercise 15-1 .................................................................................................................... 15-50 Lab Exercise 15-1 (cont.) ..................................................................................................... 15-52 Lab Exercise 15-1 (cont.) ..................................................................................................... 15-56 Teradata SQL Assistant Page 15-3 SQL Assistant Teradata SQL Assistant is an information discovery tool designed for the Windows operating system (e.g., Windows 7). Teradata SQL Assistant retrieves data from any ODBC-compliant database server. The data can then be manipulated and stored on the desktop PC. Teradata SQL Assistant is a query tool written for relational database developers. It is intended for SQL-proficient developers who know how to formulate queries for processing on Teradata or other ODBC-compliant Databases. Used as a discovery tool, Teradata SQL Assistant catalogs submitted instructions to arrive at a derived answer. Teradata SQL Assistant stores the history of your SQL in a local Microsoft Access database table. This history is available in future executions of Teradata SQL Assistant. Teradata SQL Assistant accepts standard Teradata SQL, DDL, and DML. In addition, Teradata SQL Assistant sends native SQL to any other database that provides an ODBC driver. If the driver supports the statements, they are processed correctly. Key features of SQL Assistant include: Create reports from any Relational Database that provides an ODBC interface Export data from the database to a file on a PC Import data from a PC file directly to the database Use an import file to create many similar reports (query results or Answer sets). Send queries to any supported database or the same query to many different databases Create a historical record of the submitted SQL with timings and status information such as success or failure Use the Database Explorer Tree to easily view database objects Use a procedure builder that gives you a list of valid statements for building the logic of a stored procedure Limit data returned to prevent runaway queries Teradata SQL Assistant also benefits database administrators by allowing them to directly issue SHOW statements to view text for CREATE or REPLACE commands. The DBA copies the text to the Query window, uses the Replace function to change a database name, and reissues the CREATE or REPLACE to define a new object with this new name. You can also display the CREATE text by going to the shortcut menu of the Database Explorer Tree and clicking Show Definition. Page 15-4 Teradata SQL Assistant SQL Assistant Features SQL Assistant is a Windows-based utility for submitting SQL to Teradata. SQL Assistant has the following properties: • Windows-based • Two providers are available for Teradata connections: – Teradata ODBC Driver – Teradata .Net Data Provider • Can be used to access other supported ODBC-compliant databases. • Permits retrieval of previously used queries (History). – Saves information about previous query result sets. • Supports DDL, DML and DCL commands. – Query Builder feature allows for easy creation of SQL statements. • • • • Provides both import and export capabilities to files on a PC. Provides a Database Explorer Tree to easily view database objects. Does not support non-ODBC compliant syntax such as WITH BY and FORMAT. Teradata Studio Express is a newer name for SQL Assistant Java Edition. – Targeted to Java developers who are familiar with Eclipse Teradata SQL Assistant Page 15-5 Defining a Data Source Before using Teradata SQL Assistant to access the Teradata database, you must first install the Teradata ODBC driver on your PC and the .Net Data Provider for Teradata. When connecting to a Teradata database, you can use either ODBC or the .Net Data Provider for Teradata. Connection to any other database must be made through an ODBC connection. In order to use the ODBC connection, a vendor specific ODBC driver must be installed. Before you can use Teradata SQL Assistant, you will need to define a “data source”, namely the instance of the database you wish to work with. Compatibility Teradata SQL Assistant is certified to run with any Level 2 compliant 32-bit ODBC driver. The product also works with Level 1 compliant drivers, but may not provide full functionality. Consult the ODBC driver documentation to determine the driver's conformance level. Most commercially available ODBC drivers conform to Level 2. Defining a Teradata .Net data source Use the Connection Information dialog to create, edit and delete data sources for .Net for Teradata. This dialog box is also used to connect to a .Net data source. To define a Teradata .Net data source 1. Open Teradata SQL Assistant. 2. Select Teradata .Net from the provider drop down list. 3. Click the Connect icon or go to Tools > Connect. 4. Use the Connection Information dialog to choose a .Net data source. 5. Create a new data source by entering the name and server and other applicable information Note: This module will illustrate the screens defining an ODBC data source. The specific screens defining a Teradata .Net data source are not provided in this module, but are similar Page 15-6 Teradata SQL Assistant Defining a Data Source You can define an ODBC data source in these ways: • SQL Assistant (select Connect icon) • Select Tools > Define ODBC Data Source or • ODBC Data Source Administrator Program SQL Assistant has 2 provider options: • Teradata .Net Data Provider • ODBC Select the System DSN tab and click on Add to create a new data source. If using ODBC Administrator (not shown), select the Machine Data Source tab and click on Add to create a new data source. Teradata SQL Assistant Page 15-7 Defining a Data Source (cont.) When connecting to the Teradata database, use either the ODBC or the Teradata .Net Data Provider. Connection to any other database must be made through an ODBC connection. Defining an ODBC Data Source An ODBC-based application like Teradata SQL Assistant accesses the data in a database through an ODBC data source. After installing Teradata SQL Assistant on a workstation or PC, start Teradata SQL Assistant. Next, define a data source for each database. The Microsoft ODBC Data Source Administrator maintains ODBC data sources and drivers and can be used to add, modify, or remove ODBC drivers and configure data sources. An About Box for each installed ODBC driver provides author, version number, module size, and release date. To define an ODBC data source, do one of the following: From the Windows desktop, select … Start > Control Panel > Administrative Tools > Data Sources (ODBC) From the Windows desktop, select Start > Programs > Teradata SQL Assistant After SQL Assistant launches, select Tools > Define Data Source Use the Connect icon from SQL Assistant and complete the dialog boxes. In the “Define Data Source” dialog, decide what type of data source you wish to create: Data Source Description Explanation A User Data Source can be used only by the current Windows user An ODBC user data source stores information about how to connect to the indicated data provider. A user data source is only visible to you. A System Data Source can be used by any user defined on your PC. An ODBC system data source stores information about how to connect to the indicated data provider. A system data source is visible to all users on this machine, including NT services. Page 15-8 Teradata SQL Assistant Defining a Data Source (cont.) If using ODBC Administrator, you will be given the user/system data source screen as shown to the left. You will not get this display if defining your ODBC data source via SQL Assistant. Select Teradata as the driver and click Finish on the confirmation screen. Teradata SQL Assistant Page 15-9 Defining a Data Source (cont.) A dialog box (specific to Teradata) is used to define the Teradata system you wish to access. Select This Field... To... Name Enter a name that identifies this data source. You can also enter the name of the system or the logon you will be using. Description Enter a description. This is solely a comment field to describe the data source name you used. Name(s) or IP address(es) Enter the name(s) or IP address(es) of the Teradata Server of your Teradata system. Identify the host by either name (alias) or IP address. The setup routine automatically searches for other systems that have similar name aliases. Multiple server names may be entered by pulling the entries on separate lines within this box. Do not resolve alias name to IP address When this option is checked, setup routine does not attempt to resolve alias names entered into the "Name(s) and IP address(es)" box at setup time. Instead it will be resolved at connect time. When unchecked, the setup routine automatically appends COPn (where n = 1, 2, 3, ..., 128) for each alias name you enter. Use Integrated Security Select this option if will be logging on using integrated security measures. Mechanism Select from the list of mechanisms that automatically appear in this box. Leave this field blank to use the default mechanism. Parameter The authentication parameter is a password required by the selected mechanism. Username Enter a user name. Password Enter a password to be used for the connection if you intend to use Teradata SQL Assistant in an unattended (batch) mode. Entering a password here is not very secure and is normally not recommended. Default Database Enter the default database you want this logon to use. If the Default Database is not entered, the Username is used as the default. Account String You can optionally enter one of the accounts that assigned to your Username. Session Character Use the drop down menu to choose the character set. The default is ASCII. Set ODBC Driver Setup for LOBs When defining the ODBC Data Source, from the ODBC Driver Setup screen, use the Options button to display the Teradata ODBC Driver Options screen and verify that the option - Use Native Large Object Support – is checked. Page 15-10 Teradata SQL Assistant Defining a Data Source (cont.) To access LOBs with SQL Assistant, … 1) Click on the Options button. 2) Verify that "Use Native Large Object Support" option box is checked. Teradata SQL Assistant Page 15-11 Connecting to a Data Source Connecting to a data source is the equivalent of “logging on” with SQL Assistant. You may choose from any previously defined data source. When the connection is complete, the Connect icon is disabled and the Disconnect icon, to its right, is enabled. To connect to multiple data sources: 1. 2. 3. Go to the Tools > Options > General tab. Click Allow connections to multiple data sources (Query windows), Follow the procedure for connecting to a data source. Each new data source appears in the Database Explorer Tree and opens a new query window with the data source name. To disconnect from one data source, click the Query window that is connected to the data source and click the disconnect icon. Page 15-12 Teradata SQL Assistant Connecting to a Data Source 1. Click on the Connection icon to connect to Teradata. Provider options are Teradata .NET or ODBC. 2. Select a data source. 3. Complete the logon dialog box. Teradata SQL Assistant Page 15-13 Main Window The Query window is where you enter and execute a query. The results from your query are placed into one or more Answerset windows. The Answerset window is a table Teradata SQL Assistant uses to display the output of a query. The History window is a table that displays your past queries and related processing attributes. The past queries and processing attributes are stored locally in a Microsoft Access database. This gives you flexibility to work with previous SQL statements in the future. The Database Explorer Tree displays on the left side of the main Teradata SQL Assistant window. It displays an alphabetical listing of databases and objects in the connected Teradata server. You can double-click on a database name to expand the tree display for that database. You can use the Database Explorer Tree to reduce the time required to build a query and help reduce errors in object names. The Database Explorer Tree is optional so you can display or hide this window. Page 15-14 Teradata SQL Assistant Main Window Query Window Database Explorer Tree Answerset Window History Window Teradata SQL Assistant Page 15-15 Database Explorer Tree The Database Explorer Tree feature of Teradata SQL Assistant displays an alphabetical listing of databases and objects of the connected user. It further permits drilldown on individual objects to view, column names, indexes and parameters as they apply. This is simply done by double-clicking on a database name to expand the tree display for that database. The Database Explorer Tree displays on the left side of the main Teradata SQL Assistant window. You can use the Database Explorer Tree to reduce the time required to build a query and help reduce errors in object names. The Database Explorer Tree is optional so you can display or hide this window. Initially, the following Teradata databases are loaded into the Database Explorer Tree: The User ID that was used to connect to the database The user’s default database The database "DBC" To add additional databases: 1. 2. 3. Do one of the following: – With the Database Explorer Tree active, press Insert. – Right-click anywhere in the Database Explorer Tree, then select Add Database. Type the database name to be added. If you want the database loaded only for the current session, clear the check box. By default, the check box is selected so the database will appear in the Database Explorer Tree in future sessions. The Database Explorer Tree allows you to drill down to show: Page 15-16 Columns and indexes of tables Columns of views Parameters of macros Parameters of stored procedures Teradata SQL Assistant Explorer Tree Option • The Database Explorer Tree displays an alphabetical listing of databases and objects of the connected user. – It is not a database hierarchy, but a list of databases and objects that the user needs to access. • To refresh a database, right-click on the database/user name and select "Refresh". To add another database to the Explorer Tree, right-click on the Explorer Tree. Teradata SQL Assistant To expand an item/object, click on the + sign or doubleclick on the object name. Page 15-17 Creating and Executing a Query Queries are created by simply typing in the query text into the query window. It is not necessary to add a semi-colon at the end of a command, unless you are entering multiple commands in a single window. The query may be executed by clicking on the ‘Execute’ icon in the toolbar. This icon looks like a pair of footprints. “Execute” actually executes the statements in the query one statement after the other and optionally stops if one of the statements returns an error. Function key F5 can also be used to execute queries serially. “Execute Parallel” executes all statements at the same time - and is only valid if all the statements are Teradata SQL/DML statements. This submits the entire query as a single request, allowing the database to execute all the statements in parallel. Multiple answer sets are returned in the Answerset window. Function key F9 can also be used to execute queries in parallel. Creating statements (single and multi-queries) To allow multiple queries: 1. 2. 3. Select Tools > Options. Select the General option. Select the option “Allow Multiple Queries”. Once this option is selected, you may open additional tabs in the query window. Each tab can contain a separate query, and any of these queries can be executed. However, only one query can be executed at a time. You can create queries consisting of one or more statements. A semicolon is not required when you enter one statement at a time. However, a semicolon between the statements is required for two or more statements. Each statement in the query is submitted separately to the database; therefore, your query may return more than one Answerset. Page 15-18 Teradata SQL Assistant Creating and Executing a Query 1. Create a query in the Query Window. 2. To execute a query use either the “execute” or the “execute parallel” buttons. The “execute” button (or F5) serially executes all statements in the query window. The “execute parallel” button or (F9) executes all statements in the query window in a single multi-statement request. These queries are effectively executed in parallel. Create query or queries in Query Window. Teradata SQL Assistant Page 15-19 Dragging Object Names to the Query Window You can drag object names from the Database Explorer tree to the Query pane. Click and drag the object from the Explorer tree to the Query pane. The name of the object appears in the Query window. Teradata SQL Assistant includes an option (Tools > Options) that allows objects to automatically be qualified when dragging or pasting names from the Database Tree into the Query Window. For example, if this option is checked, dragging the object "MyColumn" adds the parent object "MyTable", and appears as "MyTable.MyColumn" in the Query Window. Use the Ctrl key to add a comma after the object name when it is dragged to the Query Window. Dragging Multiple Objects Use the Shift and Ctrl keys to select more than one object from the Database Explorer Tree that can be dragged to the Query window. Page 15-20 Use the Ctrl key to select additional objects. Use the Shift key to select a range of objects. Teradata SQL Assistant Dragging Object Names to the Query Window • Click and drag the object from the Database Explorer tree to the Query window. The name of the object appears in the Query window. – If the "Qualify names when dragged or pasted from the Database Tree" option (Tools > Options) is checked, then the parent name is automatically included. Hold Ctrl key – causes a comma to be included after the object – • Selecting and dragging multiple objects – The Shift and Ctrl keys can also be used to select multiple objects in the Database Explorer tree for the purpose of dragging multiple objects to the Query Window. Note: The order of selection becomes the order of columns in the SELECT. Teradata SQL Assistant Page 15-21 Query Options To submit any part of any query 1. 2. 3. 4. Select Tools > Options. Select the Query tab. Check the option “Submit only the selected Query text, when highlighted”. From the Query window, select the part of the query to submit by highlighting it. Clearing the Query Window The query window may be cleared using the “Clear Query” button on the tool bar. Formatting a Query The query formatting feature adds line breaks and indentation before certain keywords, making SQL that comes from automatic code generators or other sources more readable. To Format a Query 1. 2. Ensure a statement exists in the Query window. Do one of the following: From the Tool Bar, click the Format Query button. Right-click in the Query window, then click Format Query Press Ctrl+Q Select Edit > Format Query Note: Some keywords will cause a line break and possibly cause the new line to be indented. If a keyword is found to already be the first word on a line and it is already prefixed by a tab character, then its indentation level will not change. Indentation When you press the Enter key, the new line will automatically indent to the same level as the line above. If you highlight one or more lines in the query and press the Tab key, those lines are indented one level. If you press Shift-Tab, the highlighted lines are un-indented by one level. This indentation of lines will only apply if the selected text includes a line feed character. For example, you must either select at least part of two lines, or if selecting only one line, then the cursor must be at the beginning of the next line. (Note that this is always the case when you use the margin to select a line.) If no line end is included in the selected text, or no text is selected, then a tab character will simply be inserted. Page 15-22 Teradata SQL Assistant Query Options To submit any part of a query: 1. Using Tools > Options > Query Check the option “Submit only the selected Query text, when highlighted”. 2. Highlight the text in the query window and execute. To clear the text in the query window, use the “Clear Query” button. To format a query, click on the “Format Query” button. Highlighted query in Query Window. Teradata SQL Assistant Page 15-23 Viewing Query Results The results of a query execution may be seen in the Answer Set window. Large answer sets may be scrolled using the slide bars. The Answerset window is a table that displays the results from a statement. You can sort the output in a number of ways and print as bitmaps in spreadsheet format. Individual cells, rows, columns, or blocks of columns may be formatted to change the background and foreground color as well as the font style, name, and size. You can make other modifications such as displaying or hiding gridlines and column headers. The table may be resized by stretching the Answerset window using standard Windows sizing techniques. Individual columns, groups of columns, rows, or groups of rows may also be sized. Output rows may be viewed as they are being retrieved from the database. Sorting an Answerset Locally There are two ways to sort an Answerset locally: quick sort or full sort. A quick sort sorts on a single column; a full sort allows sorting by data in multiple columns. To sort an Answerset using quick sort: Right-click any column heading to sort the data by that column only. The data is initially sorted in ascending order. Right-click the same column header again reverses the sort order. Note: The output from certain statements (e.g., EXPLAIN) cannot be sorted this way. To sort an Answerset using a full sort: Do one of the following: From the Tool Bar, click the sort button, right-click in the Answerset window and select Sort, or use the Edit > Sort menu. In the Sort Answerset dialog box, all columns in the active window are presented in the Available Columns list box. Select the column name in the Available Columns list box, or use the up or down arrow keys to highlight the column name and press Enter. This moves the column name to the Sort keys list box. By default, the sort direction for this new sort column is ascending (Asc). If you click a column in the Sort Keys list box, or select the item using the arrow keys or mouse and press Enter, it reverses to descending sort order (Dsc). To remove a sort column from the list, double-click the column name, or use the arrow keys to highlight the column and press Delete. Page 15-24 Teradata SQL Assistant Viewing Query Results • The Answerset window is a table that displays the results from a statement. • The output can be sorted in different ways: – Quick sort (single column) – right click on the column heading – Full sort (1 or more columns) – use Edit > Sort menu or Sort button • Data can be filtered using the funnel option at the column level. Result set in Answerset Window. Teradata SQL Assistant Page 15-25 Formatting Answersets You can format the colors, font name, font style, and font size of a block of cells, individual cells, rows, columns, or the entire spreadsheet. You can also specify the number of decimal places displayed and if commas are displayed to mark thousand separators in numeric columns. You can control the Answerset and the Answerset window by setting options. To set Answerset options, select Tools > Options > Answerset tab. For example, to display Alternate Answerset Rows in Color, check and first option in the Answerset tab, and use the Choose button. Selecting this option makes it easier to see Answerset rows. The option applies the selected background color to alternating rows in the Answerset grid. The remaining rows use the standard ‘Window Background’ color. The Choose button displays the selected color. Clicking the Choose button allows you to change this color. To format the colors, font name, font style, and font size of a block of specific cells individual cells, rows, columns, you right-click on the answer set cells. Some options are listed below. To display commas: 1. Right-click in the Answerset cell you wish to change and select Format Cells. 2. Check Display 1000 separators. 3. Click OK. To display decimal places: 1. Right-click in the Answerset cell you wish to change and select Decimal Places. 2. Select a number between 0 and 4. To designate up to 14 decimal places: a. Right-click to bring up the Shortcut menu. b. Click Format Cells to bring up the Format Cells dialog. c. Under Numerics, select the desired number of decimal places. Page 15-26 Teradata SQL Assistant Formatting Answersets To set defaults for Answersets, use the Tools > Options > Answerset tab. Teradata SQL Assistant To format specific cells, right-click on a cell or use the icon. Page 15-27 Using Query Builder Query Builder provides the user with the ability to use ‘templates’ for SQL commands, which may then be modified by the user. This is a convenient way to create commands whose syntax is complex or not easily remembered. Simply find the appropriate command, then drag and drop it into the query window where it may then be customized. The Query Builder window is a floating window you can leave open when you are working within the main Teradata SQL Assistant window. To access the Query Builder tool, do one of the following: Press F2. Select Help > Query Builder. Right-click in the Query window and select Query Builder from the shortcut menu. From the drop-down list in the upper left corner, choose one of the following options. SQL Statements Select a command from the statement list in the left pane to display an example of its syntax in the right pane. Procedure Builder Select a stored procedure statement from the list in the left pane to display an example of its syntax in the right pane. If you create a custom.syn file, this option appears in the drop-down list. The name will be the name you specified in the first line of the custom.syn file. Select this option and the queries you defined in this file will display. Description of the Options SQL Statements When you choose the SQL Statements option, the statement list in the left pane shows each of the statement types available on the current data source. These syntax examples reflect the SQL syntax of the data source you are currently connected. For example, the Teradata syntax file is Teradata.syn. Procedure Builder When you choose the Procedure Builder option, the left pane shows a list of statements that are valid only when used in a CREATE or REPLACE procedure statement. You can create a user-defined syntax file using any text editor such as Notepad or Microsoft Word. The name of the file must be custom.syn. The format of this file is the same as the other syntax files except it has an additional line at the start of the file containing the name you wish to see in the dropdown list in the Query Builder dialog. Page 15-28 Teradata SQL Assistant Query Builder Query Builder provides the user with the ability to use 'templates' for SQL commands. 1. Select Query Builder from the Help menu or use F2. 2. Double-click on SQL statement to place sample query in Query Window. Teradata SQL Assistant Page 15-29 History Window The History window is a table that displays your past queries and related processing attributes. The past queries and processing attributes are stored locally in a Microsoft Access 2000 database. This allows the flexibility to work with previous SQL statements in the future. Clicking any cell in the SQL Statement column in the History window copies the SQL to the Query Window. It may then be optionally modified and then resubmitted. You can display or hide the History window at any time. With Teradata SQL Assistant 13, all history rows are now stored in a single History database. The History Filter dialog allows you to specify a set of filters to be applied to the history rows. The operators include >, <, =, and LIKE. The filter applies to the entire history table. When you click in the fields or boxes in the Filter dialog, the possible operators and proper format are displayed at the bottom of the dialog. You can filter your history on the following options: Date Data source User Name Statement Type – for example, SELECT or CREATE TABLE Statement Count – show only those queries tat contain this many statements Row Count Elapsed Time Show successful queries only By default, Teradata SQL Assistant records all queries that are submitted. You may change this option so Teradata SQL Assistant records only those statements that are successful, or turn off history recording altogether. The most recently executed statement appears as the first row in the History window. The data may be sorted locally after it has been loaded into the History window. New entries are added as the first row of history no matter what sort order has been applied. Page 15-30 Teradata SQL Assistant History Window A history of recently submitted queries may be recalled by activating the ‘Show History’ feature. Key options available with the History window are: • All history rows are now stored in a single History database. The History Filter dialog allows you to specify a set of filters to be applied to the history rows. • You can choose to display all queries (successful or not), use a history filter to only display successful queries, or turn off history recording altogether. Query is copied into Query Window. Click on query in History Window. Teradata SQL Assistant Page 15-31 General Options To set general program preferences: 1. 2. 3. Select Tools > Options. Click the General tab. Choose from the following options: Allow multiple Queries - allows you to have multiple query windows open simultaneously. With this option selected, the New Query command opens a new tab in the Query window. The default for this setting is unchecked. Display this string for Null data fields - enter the string you want displayed in place of Null data fields in your reports and imported/exported files. The default for this setting is "?". Use a separate Answer window for – – – Page 15-32 Each Resultset - opens a new Answer window for each new result set Each Query - opens a new Answer window for each new query, but uses tabs within this window if the query returns multiple result sets. This is the default setting. Never - directs all query results to display in a single tabbed Answer window Teradata SQL Assistant General Options General options (Tools > Options > General tab) that are available include: • Allow connections to multiple data sources. • Allow multiple queries per connection – allows you to have multiple query windows open simultaneously. New Queries are opened in new tabs. Data Format options include: • Date format • Display of NULL data values • Decimal places to displace Teradata SQL Assistant Page 15-33 Connecting to Multiple Data Sources You can connect to multiple data sources. The “Allow connections to multiple data sources” option must be checked with the General Options. Each new data source appears in the Database Tree and opens a new query window with the data source name. To disconnect from one data source, click the Query window that is connected to the data source and click the disconnect icon. The example on the facing page show two connects to two different systems (tdt5-1 and tdt6-1). Page 15-34 Teradata SQL Assistant Connecting to Multiple Data Sources A separate query window is opened for each data source connection. Connections have been made to two systems: • tdt5-1 • tdt6-1 Multiple queries for tdt5-1 are shown via tabs. History includes the Source name for queries. Teradata SQL Assistant Page 15-35 Additional Options Teradata SQL Assistant provides many other tools and options, some of which are briefly noted on the facing page. Page 15-36 Teradata SQL Assistant Additional Options Additional Tools menu options include: • Explain – runs an Explain function on the SQL statements in the Query window and display the results in the Answerset window. • List Tables – displays the Table List dialog box where you can enter the name of the database and the resulting list of tables or view displays in an Answerset window. • List Columns – displays the Column List dialog box where you can list the columns in a particular table/view and the resulting list of columns displays in an Answerset window. • Disconnect – disconnects from the current data source. • Change Password – change your Teradata password. • Compact History – reclaim space that may have been lost when history rows were deleted. • Options – establish various options for queries, answersets, import/export operations, etc. Teradata SQL Assistant Page 15-37 Importing/Exporting Large Object Files To import and/or export LOB (Large Object) files with SQL Assistant, you need to first make sure the “Use Native Large Object Support” option is set when defining ODBC driver for Teradata. This option was discussed earlier in this module. This option is automatically selected starting with SQL Assistant 14.0. Teradata SQL Assistant 12.0 Note The following information is not needed with Teradata SQL Assistant 13 and later. With Teradata SQL Assistant 12.0 and prior versions, to import a file larger than 10 MB into a BLOB or CLOB column, you need to enable this capability within SQL Assistant. To enable importing of files larger than 10 MB into BLOB or CLOB columns: 1. 2. 3. 4. 5. Select Tools > Options, then select the Export/Import tab. Select the Import tab. Click in the Maximum Size of an Imported data file field. Press the Esc key, then set the value to the size of the largest file you wish to load, up to a maximum of 9 digits. If you do not press the Esc key before entering the data, you will be limited to a maximum of 7 digits in this field. Click OK. Note: This will be a temporary change. The next time you click OK on the Options screen the value will be reset to the first 7 digits of the number you had last set - for example, 50 MB (50,000,000) will become 5 MB (5,000,000). Page 15-38 Teradata SQL Assistant Importing/Exporting Large Object Files To import and/or export LOB (Large Object) files with SQL Assistant, you need to first make sure the “Use Native Large Object Support” option is set with the data source. Teradata SQL Assistant supports Large Objects. Large objects come in two types: • Binary – these columns may contain Pictures, Music, Word documents, PDF files, etc. • Text – these columns contain text data such as Text, HTML, XML or Rich Text (RTF). SQL Assistant > Tools > Options • To import a LOB, create a data file that contains the names of the Large Objects. • Use the Export/Import Options dialog to specify the field delimiter. • The example in this module assumes the fields in the imported file are TAB separated. Teradata SQL Assistant Page 15-39 Importing/Exporting Large Object Files To import and/or export LOB (Large Object) files with SQL Assistant, you need to first make sure the “Use Native Large Object Support” option is set when defining ODBC driver for Teradata. This option was discussed earlier in this module. To Import a LOB into Teradata First, create a data file that contains the names of the LOB(s) to be imported. By default, the data file needs to be located in the same folder as the LOB. Assume the data file to import from contains 4 fields that are separated. Second, select the IMPORT DATA function and execute an Insert statement. Example: INSERT INTO TF VALUES (?,?,?,?B); The parameter markers in this example are: ? The data for this parameter is read from the Import file. It is always a character string, and will be converted to a numeric value if necessary. ?B The data for this parameter resides in a file that is in the same directory as the Import file. The import file contains only the name of the file to be imported. The contents of the file are loaded as a binary image (e.g., BLOB). You can also use ?? in place of ?B. ?C The data for this parameter resides in a file that is in the same directory as the import file. The import file contains only the name of the file to be imported. Use this marker to load a text file into a CHAR or CLOB column. Page 15-40 Teradata SQL Assistant Importing a LOB into Teradata 1. Create a data file that contains the name(s) of the LOB(s). This data file needs to be located in the same folder as the LOB. TF Manual LOB.txt 1 2 TF Student Manual TF Lab Workbook PDF PDF TF v1400 Student Manual.pdf TF v1400 Lab Workbook.pdf The various fields are separated by tabs. 2. Within SQL Assistant, from the File menu, select the "Import Data" option to turn on the Import Data function. 3. Enter an INSERT statement within the Query window. INSERT INTO TF VALUES (?, ?, ?, ?B); 4. In the dialog box that is displayed, choose the name of the file to import. For example, enter or choose "TF Manual LOB.txt". 5. From the File menu, select the "Import Data" option to turn off the Import Data function. Teradata SQL Assistant Page 15-41 Selecting from a Table with a LOB To select from a table with a LOB, simply execute a SELECT statement. If LOB column is projected, then a dialog box is displayed to enter the file name for the LOB. Note that multiple files that are exported will have sequential numbers added to the file name. In the example on the facing page, the file name was specified as TF_Manual. Therefore, the two manuals that will be created are named: TF_PDF001.pdf TF_PDF002.pdf Page 15-42 Teradata SQL Assistant Selecting from a Table with a LOB With SQL Assistant, enter the following query: SELECT * FROM TF ORDER BY 1; The following dialog box is displayed to represent the data files to export the LOBs into. Also specify the "File Type" as a known Microsoft file type extension. The answer set window will include a link to exported data files. Teradata SQL Assistant Page 15-43 Displaying a JPG within SQL Assistant The “Display as picture …” can be selected to display a JPG file within the answer set. Optionally, the “Also save picture to a file” can be selected. Note that large JPG files with display very large within the answer set window. Page 15-44 Teradata SQL Assistant Displaying a JPG within SQL Assistant SELECT * FROM Photos ORDER BY 1; Optionally, the "Display as picture …" can be selected to display a JPG file within the answer set. Teradata SQL Assistant Page 15-45 Teradata SQL Assistant Summary The Teradata SQL Assistant utility can be of great value to you. The facing page summarizes some of the key features discussed in this module. Page 15-46 Teradata SQL Assistant Teradata SQL Assistant Summary Characteristics of Teradata SQL Assistant include: • Windows-based utility that can be used to submit SQL queries to the Teradata database. • Provides the retrieval of previously used queries (History). • Saves information about previous query result sets. • Supports DDL, DML and DCL commands. – Query Builder feature allows for easy creation of SQL statements. • Provides both import and export capabilities to files on a PC. • Provides a Database Explorer Tree to easily view database objects. Teradata SQL Assistant Page 15-47 Module 15: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 15-48 Teradata SQL Assistant Module 15: Review Questions 1. Which two data interfaces are available with Teradata SQL Assistant? a. b. c. d. CLIv2 JDBC ODBC Teradata .Net 2. Separate history database files are needed to maintain queries for different data sources. a. True b. False 3. Which piece of query information is not available in the History Window? a. b. c. d. e. User name Query band Elapsed time Data source name Number of rows returned 4. What are two techniques to execute multiple statements as a multi-statement request? __________________________________ Teradata SQL Assistant __________________________________ Page 15-49 Lab Exercise 15-1 Check your understanding of the concepts discussed in this module by completing the lab exercise as directed by your instructor. Page 15-50 Teradata SQL Assistant Lab Exercise 15-1 Lab Exercise 15-1 Purpose In this lab, you will use Teradata SQL Assistant to define a data source and execute some simple SQL commands. What you need Teradata SQL Assistant installed on the laptop or PC Tasks 1. Define either an ODBC data source or a .NET data source using the following instructions. Complete the dialog box with the following information: Name – TFClass Description – Teradata Training for your name Name or IP Address – ________________________ (supplied by instructor) Username – ________________________ (supplied by instructor) Password – do not fill in your password (initially needed for a .NET connection) Verify the following options are set properly. Session Character Set – ASCII Options – Session Mode – System Default Use Native Large Object Support option is checked (not needed with a .NET connection) Teradata SQL Assistant Page 15-51 Lab Exercise 15-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercise as directed by your instructor. Page 15-52 Teradata SQL Assistant Lab Exercise 15-1 (cont.) 2. Connect to the data source your just created (TFClass) and logon with your username and password. 3. Using the Tools > Options tabs, ensure the following options are set as indicated: General – Check – Allow connections to multiple data sources General – Check – Allow multiple queries per connection Query – Check – Submit only the selected Query text, when highlighted Answerset – Check – Display alternate Answerset rows in color – choose a color Answerset – Check – Display Column Titles rather than Column Names History – Check – Display SQL text on a single line History – Check – Do not save duplicate queries in history 4. If the Explorer Tree pane is not visible, use the View > Explorer option to display the Explorer Tree. Add the following databases to the Explorer Tree: AP, DS, PD, Collaterals (Hint: Right-click on the Explorer Tree pane to use the "Add Database …" option. 5. Using the Explorer Tree, view the table objects in your database. 6. Using the Query Window, execute the following query. CREATE TABLE Old_Orders AS Orders WITH NO DATA; Does the new table object appear in the table object list? _____ If not, "refresh" the database. Teradata SQL Assistant Page 15-53 Lab Exercise 15-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercise as directed by your instructor. Use the following SQL to determine a count of rows in a table. SELECT COUNT(*) FROM tablename; Step 8 Hint: Your Old_Orders table should have 2400 rows. If not, check the dates you used in your queries. Page 15-54 Teradata SQL Assistant Lab Exercise 15-1 (cont.) 7. Using the Query window, execute the following query. INSERT INTO Old_Orders SELECT * FROM DS.Orders WHERE orderdate BETWEEN '2008-07-01' AND '2008-09-30'; Use the "Format Query" option to format the query. How many rows are in the Old_Orders table? _______ 8. Using the History window, recall the query from step #7 and modify it to add orders from '2008-10-01' through '2008-12-31'. How many rows are in the Old_Orders table? _______ 9. Execute the following query by using the drag and drop object feature of SQL Assistant. SELECT custid, SUM (totalprice) FROM Old_Orders GROUP BY 1 ORDER BY 1; Use the "Add Totals" feature to automatically generate a total sum for all of the orders. What is the sum of the orders using this feature? ______________ Teradata SQL Assistant Page 15-55 Lab Exercise 15-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercise as directed by your instructor. Use the following SQL to create a view. CREATE VIEW viewname AS SELECT column1, column2 FROM table_or_view_name [WHERE condition]; Use the following SQL to create a simple macro. CREATE AS MACRO macroname (SELECT * FROM table_or_view_name ; ) ; Use the following SQL to execute a simple macro. EXEC macroname; Page 15-56 Teradata SQL Assistant Lab Exercise 15-1 (cont.) 10. Format only the cells of sum of the ordertotal to be in italics and green. 11. Using the Query Builder feature, create a view named "Old_Orders_v" for the Old_Orders table that includes the following columns and only includes orders for December, 2008. orderid, custid, totalprice, orderdate SELECT all of the rows from the view named "Old_Orders_v". How many rows are displayed from this view? _______ 12. Using the Query Builder feature, create a simple macro named "Old_Orders_m" which selects all of the orders from the view named "Old_Orders_v". Execute the macro "Old_Orders_m". What is the sum of the orders for December using this "Add Totals" feature? ______________ 13. (Optional) Use the Collaterals database to access the Photos table to display various JPG files. Execute the following: SELECT * FROM Collaterals.Photos ORDER BY 1; Note: Set the file type to JPG and check the option "Display as picture in Answerset". Teradata SQL Assistant Page 15-57 Notes Page 15-58 Teradata SQL Assistant Module 16 Analyze Primary Index Criteria After completing this module, you will be able to: Identify Primary Index choice criteria. Describe uniqueness and how it affects space utilization. Explain row access, selection, and selectivity. Choose between single and multiple-column Primary Indexes. Describe why a table might be created without a primary index. Specify the syntax to create a table without a primary index. Teradata Proprietary and Confidential Analyze Primary Index Criteria Page 16-1 Notes Page 16-2 Analyze Primary Index Criteria Table of Contents Primary Index Choice Criteria ................................................................................................... 16-4 Primary Index Defaults .............................................................................................................. 16-6 CREATE TABLE – Indexing Rules .......................................................................................... 16-8 Order of Preference Exercise ................................................................................................... 16-10 Primary Index Characteristics .................................................................................................. 16-12 Multi-Column Primary Indexes ............................................................................................... 16-14 Primary Index Considerations .................................................................................................. 16-16 PKs and Duplicate Rows.......................................................................................................... 16-18 NUPI Duplicate Row Check .................................................................................................... 16-20 Primary Index Demographics .................................................................................................. 16-22 Column Distribution Demographics for a PI Candidate .......................................................... 16-24 SQL to View Data Demographics ........................................................................................... 16-26 Example of Using Data Demographic SQL ............................................................................. 16-28 TableSize View ........................................................................................................................ 16-32 SQL to View Data Distribution ............................................................................................... 16-34 E-R Diagram for Exercises ...................................................................................................... 16-36 Exercise 2 – Sample ................................................................................................................. 16-38 Exercise 2 – Choosing PI Candidates ...................................................................................... 16-40 What is a NoPI Table? ............................................................................................................. 16-52 Reasons to Consider Using NoPI Tables ................................................................................. 16-54 Creating a Table without a PI .................................................................................................. 16-56 How is a NoPI Table Implemented? ........................................................................................ 16-58 NoPI Random Generator .......................................................................................................... 16-60 The Row ID for a NoPI Table .................................................................................................. 16-62 Multiple NoPI Tables at the AMP Level ................................................................................. 16-66 Loading Data into a NoPI Table .............................................................................................. 16-68 NoPI Options............................................................................................................................ 16-70 Summary .................................................................................................................................. 16-72 Module 16: Review Questions ................................................................................................. 16-74 Module 16: Review Questions (cont.) ..................................................................................... 16-76 Lab Exercise 16-1 .................................................................................................................... 16-78 Lab Exercise 16-2 .................................................................................................................... 16-82 Analyze Primary Index Criteria Page 16-3 Primary Index Choice Criteria There are three Primary Index Choice Criteria: Access Demographics, Distribution Demographics, and Volatility. Access demographics are the first of three Primary Index Choice Criteria. Access columns are those that would appear (with a value) in a WHERE clause in an SQL statement. Choose the column most frequently used for access to maximize the number of one-AMP operations. Distribution demographics are the second of the Primary Index Choice Criteria. The more unique the index, the better the distribution. Optimizing distribution optimizes parallel processing. In choosing a Primary Index, there is a trade-off between the issues of access and distribution. The most desirable situation is to find a PI candidate that has good access and good distribution. Many times, however, index candidates offer great access and poor distribution or vice versa. When this occurs, the physical designer must balance these two qualities to make the best choice for the index. The third of the Primary Index Choice Criteria is volatility, or how often the data values will change. The Primary Index should not be very volatile. Any changes to Primary Index values may result in heavy I/O overhead, as the rows themselves may have to be moved from one AMP to another. Choose a column with stable data values. Degree of Uniqueness and Space Utilization The degree of uniqueness of a Primary Index has a direct influence on the space utilization. The more unique the index, the better the space is used. Fewer Distinct PI Values than Amps For larger tables, it is not a good idea to choose a Primary Index with fewer distinct values than the number of AMPs in the system when other columns are available. At best, one index value would be hashed to each AMP and the remaining AMPs would carry no data. Non-Unique PIs Choosing a Non-Unique PI (NUPI) with some very non-unique values can cause “spikes” in the distribution. Unique (or Nearly-Unique) PIs The designer should choose an index which is unique or nearly unique to optimize the use of disk space. Remember that the PERM limit of a database (or user) is divided by the number of AMPs in the system to yield a threshold that cannot be exceeded on any AMP. Page 16-4 Analyze Primary Index Criteria Primary Index Choice Criteria ACCESS Maximize one-AMP operations: Choose the column(s) most frequently used for access. Consider both join and value access. DISTRIBUTION Optimize parallel processing: Choose the column(s) that provides good distribution. VOLATILITY Reduce maintenance resource overhead (I/O): Choose the column(s) with stable data values. Note: Data distribution has to be balanced with Access usage in choosing a PI. General Notes: • A good logical model identifies the Primary Key for each table or entity. – Do not assume that the Primary Key will become the Primary Index. – It is common for many tables in a database to have a Primary Index that is different than the Primary Key. – This module will first cover PI tables then cover details of the NO PRIMARY INDEX option. The general assumption in this course is that tables will have a PI. Analyze Primary Index Criteria Page 16-5 Primary Index Defaults 1. If the NO PRIMARY INDEX clause is specified, then the table is created without a primary index. If this clause is used, you cannot specify a primary index for the table. There are a number of limitations associated with a NoPI table that will be listed later. 2. If the PRIMARY INDEX, NO PRIMARY INDEX, PRIMARY KEY, or UNIQUE options are NOT specified in the CREATE TABLE DDL, then: Table to be created with or without a primary index will be based on a new DBSControl General flag Primay Index Default. The default setting is "D" which effectively means the default is to create a table with the first column as a NUPI. D - This is the default setting. This setting works the same as the P setting. P - The first column in the table will be selected as the non-unique primary index. This setting works the same as that in the past when PRIMARY INDEX was not specified. N – The table will be created without a primary index (NoPI table). 3. With the NoPI Table feature, the system default setting essentially remains the same as that in previous Teradata releases where the first column was selected as the non-unique primary index when the user did not specify a PRIMARY INDEX or a PRIMARY KEY or a UNIQUE Constraint. Users can change the default setting for PrimaryIndexDefault to P or N and not rely on the system default setting which might be changed in a future release. Page 16-6 Analyze Primary Index Criteria Primary Index Defaults A Teradata 13.0 DBSControl flag determines if a PI or NoPI table is created when a CREATE TABLE DDL does NOT have any of the following explicitly specified: • PRIMARY INDEX clause • NO PRIMARY INDEX clause • PRIMARY KEY or UNIQUE constraints Values for DBS Control General field #53 "Primary Index Default": D – "Teradata Default" (effectively same as option P) P – "First Column is NUPI" – create tables with first column as a NUPI N – "No Primary Index" – create tables without a primary index (NoPI) The PRIMARY INDEX and NO PRIMARY INDEX clauses have precedence over PRIMARY KEY and UNIQUE constraints. If the NO PRIMARY INDEX clause is specified AND if PRIMARY KEY or UNIQUE constraints are also defined, these will be implemented as Unique Secondary Indexes. • It may be unusual to create a NoPI table with these additional indexes. Analyze Primary Index Criteria Page 16-7 CREATE TABLE – Indexing Rules The primary index may be explicitly specified at table create time. If not, a primary index choice will be made based on other choices made. Primary key and uniqueness constraints are always implemented by Teradata as unique indexes, either primary or secondary. This chart assumes the system default is to create tables with a Primary Index. The index implementation schedule is as follows: Is a PI specified? No PK specified? PK = UPI PK specified and UNIQUE constraints specified? PK = UPI UNIQUE constraints = USI(s) UNIQUE column level constraints only specified? 1st UNIQUE column level constraint = UPI Other UNIQUE constraints = USI(s) UNIQUE column level constraints and table level UNIQUE constraints specified? 1st UNIQUE column level constraint = UPI Other UNIQUE constraints = USI(s) UNIQUE table level constraints only specified? 1st UNIQUE table level constraint = UPI Other table level UNIQUE constraints = USI(s) Yes Page 16-8 Neither specified? 1st column = NUPI PK specified? PK = USI PK specified and UNIQUE constraints specified? PK = USI UNIQUE constraints = USI(s) UNIQUE constraints only specified? UNIQUE constraints = USI(s) Analyze Primary Index Criteria CREATE TABLE – Indexing Rules Unspecified Primary Index option – assuming system default is "Primary Index" IfIf else else PRIMARY PRIMARYKEY KEYspecified specified 11ststUNIQUE column UNIQUE columnlevel levelconstraint constraintspecified specified PK PK column column ==UPI UPI ==UPI UPI else else else else 11ststUNIQUE UNIQUEtable tablelevel levelconstraint constraintspecified specified specified* 11ststcolumn column specified* column(s) column(s)==UPI UPI column = NUPI column = NUPI * If system default is "No Primary Index" AND none of the following have specified (Primary Index, PK, or UNIQUE), then table is created as a NoPI table. Specified PRIMARY INDEX or NO PRIMARY INDEX IfIf PRIMARY PRIMARYKEY KEYisisalso alsospecified specified and any UNIQUE constraint and any UNIQUE constraint(column (columnor ortable tablelevel) level) PK ==USI PK USI column(s) = USI column(s) = USI Every PK or UNIQUE constraint is always implemented as a unique index. Analyze Primary Index Criteria Page 16-9 Order of Preference Exercise Complete the exercise on the facing page. Answers will be provided by your instructor. Some additional examples include: If table_5 was created as follows: CREATE TABLE table_5 (col1 INTEGER NOT NULL ,col2 INTEGER NOT NULL ,col3 INTEGER NOT NULL ,CONSTRAINT uniq1 UNIQUE (col1,col2) ,CONSTRAINT uniq2 UNIQUE (col3)); Then, the indexes are a UPI on (col1,col2) and a USI on (col3). If table_5 was created as follows: CREATE TABLE table_5 (col1 INTEGER NOT NULL ,col2 INTEGER NOT NULL ,col3 INTEGER NOT NULL ,CONSTRAINT uniq1 UNIQUE (col3) ,CONSTRAINT uniq2 UNIQUE (col1,col2)); Then, the indexes are a UPI on (col3) and a USI on (col1,col2). Notes: Page 16-10 Recommendation: Specify Primary Index when creating a table. Table level constraints are typically used to specify a PK or UNIQUE constraint for multiple columns. Analyze Primary Index Criteria Order of Preference Exercise Assuming the system default is "Primary Index", show the indexes that are created as a result of the DDL. CREATE TABLE table_1 (col1 INTEGER NOT NULL UNIQUE ,col2 INTEGER NOT NULL PRIMARY KEY); CREATE TABLE table_2 (col1 INTEGER NOT NULL PRIMARY KEY ,col2 INTEGER) PRIMARY INDEX (col2); CREATE TABLE table_3 (col1 INTEGER ,col2 INTEGER NOT NULL); col1 = col2 = col1 = col2 = col1 = col2 = CREATE TABLE table_4 (col1 INTEGER NOT NULL ,col2 INTEGER NOT NULL ,col3 INTEGER NOT NULL UNIQUE ,CONSTRAINT pk1 PRIMARY KEY (col1,col2)); col1 = col2 = col3 = (col1,col2) = CREATE TABLE table_5 (col1 INTEGER NOT NULL ,col2 INTEGER NOT NULL ,col3 INTEGER NOT NULL UNIQUE ,CONSTRAINT uniq1 UNIQUE (col1,col2)); col1 = col2 = col3 = (col1,col2) = Analyze Primary Index Criteria UPI = Unique Primary Index NUPI = Non Unique Primary Index USI = Unique Secondary Index Page 16-11 Primary Index Characteristics Each table has one and only one Primary Index. A Primary Index may be different than a Primary Key. UPI = Best Performance, Best Distribution UPIs offer the best performance possible for several reasons. They are: A Unique Primary Index involves a single base table row at most No Spool file is ever required Single value access via the Primary Index is a one-AMP operation and uses only one I/O NUPI = Good Performance, Good Distribution Page 16-12 NUPI performance differs from UPI performance because: Non-Unique Primary Indexes may involve multiple table rows. Duplicate values go to the same AMP and the same data block, if possible. Multiple I/Os are required if the rows do not fit in a single data block. Spool files are used when necessary. A duplicate row check is required on INSERT and UPDATE for a SET table. Analyze Primary Index Criteria Primary Index Characteristics Primary Indexes (UPI and NUPI) • A Primary Index may be different than a Primary Key. • Every table has only one Primary Index. • A Primary Index may contain null(s). • Single-value access uses ONE AMP and, typically, one I/O. Unique Primary Index (UPI) • Involves a single base table row at most. • No spool file is ever required. • The system automatically enforces uniqueness on the index value. Non-Unique Primary Index (NUPI) • • • • • May involve multiple base table rows. A spool file is created when needed. Duplicate values go to the same AMP and the same data block. Only one I/O is needed if all the rows fit in a single data block. Duplicate row check is required for a Set table. Analyze Primary Index Criteria Page 16-13 Multi-Column Primary Indexes In practice, Primary Indexes are sometimes composed of several columns. Such composite indexes are known as multi-column Primary Indexes. They are used quite commonly and you can probably think of several existing applications that utilize them. Increased Uniqueness There are both advantages and disadvantages to using multi-column PIs. Perhaps the most important advantage is that by combining several columns, you can produce an index that is much more unique than any of the component columns. This increased uniqueness will result in better data distribution, among other benefits. For example: PI = Lastname PI = Lastname + Firstname PI = Lastname + Firstname + MI The above example points out how better data distribution occurs. Notice that each succeeding Primary Index is more unique than the one preceding it. That is, there are far less individuals with identical last and first names then there are with the same last name, and so on. Increasing uniqueness means that as the number of columns increases: The number of distinct values increases. The number of rows per value decreases. The selectivity increases. Trade-off The disadvantage involved with multi-column indexes is that as the number of columns increases, the index becomes less usable. A multi-column index can only be accessed when values for all columns are specified in the SQL statement. If a single value is omitted, the Primary Index cannot be used. It is important for the physical designer to balance these factors and use multi-column indexes that have just enough columns. This will result in optimum uniqueness while reducing unnecessary full table scans. Page 16-14 Analyze Primary Index Criteria Multi-Column Primary Indexes Advantage More columns = more uniqueness • Number of distinct values increase. • Rows/value decreases. • Selectivity increases. Disadvantage More columns = less usability • PI can only be used when values for all PI columns are provided in SQL statement. • Partial values cannot be hashed. Analyze Primary Index Criteria Page 16-15 Primary Index Considerations The facing page summarizes the concepts you have seen throughout this module and provides a list of the most important considerations when choosing a Primary Index. The first three considerations summarize the three types of demographics: Access, Distribution, and Volatility. You should choose a column with good distribution to maximize parallel processing. A good rule-of-thumb is to base your Primary Index on the column(s) most often used for access (if you don't have too many rows per value) to maximize one-AMP operations. Finally, Primary Index values should be stable to reduce maintenance resource overhead. Make sure that the number of distinct values for a PI is greater than the number of AMPs in the system, whenever possible, or some AMPs will have no rows. Duplicate values hash to the same AMP and are stored in the same data block. If the index is very non-unique, multiple data blocks are used and incur multiple I/Os. Very non-unique PIs may skew space usage among AMPs and cause Database Full conditions on AMPs where excessive numbers of rows are stored. Page 16-16 Analyze Primary Index Criteria Primary Index Considerations • Base PI on the column(s) most often used for access, provided that the values are unique or nearly unique. • Choose a column (or columns) with good distribution and no spikes. – NULLs and zero (for numeric data types) hash to binary zeroes and to the same AMP. • Distinct values distribute evenly across all AMPs. – For large tables, the number of Distinct Primary Index values should be much greater (at least 10X; 50X may be better guideline) than the number of AMPs. • Duplicate values hash to the same AMP and are stored in the same data block when possible. – Very non-unique values use multiple data blocks and incur multiple I/Os. – Very non-unique values may skew space usage among AMPs and cause premature Database Full conditions. – A large number of NUPI duplicate values on a SET table can cause expensive duplicate row checks. • Primary Index values should not be highly volatile. Analyze Primary Index Criteria This is how u write it here... Page 16-17 PKs and Duplicate Rows Each row in table or entity in a good logical model will be uniquely identified by the table's primary key. Every table must have a Primary Key. Primary Keys (PKs) must be unique. Primary Keys cannot be changed. In Set tables, the Teradata Database does not allow duplicate rows. When a table has a Unique Primary Index (UPI), the UPI enforces uniqueness. When a table has a Non-Unique Primary Index (NUPI), the matter can become more complicated. In the case of a NUPI (without a USI defined), the file system must compare data values byte-by-byte within a Row Hash in order to ensure uniqueness. Many NUPI duplicates result in lots of duplicate row checks, which can be quite expensive in terms of system resources. The way to avoid such a situation is to define a USI on the table whenever you have a NUPI. The USI does the job of enforcing uniqueness and thus save you the cost of doing duplicate row checks. Often, the best column(s) to use when defining such a USI is the PK. Specifying a UNIQUE constraint on a column(s) other than the Primary Index also causes the creation of a Unique Secondary Index. An exception to the above is found when using the load utilities, such as FastLoad and MultiLoad. These utilities do not allow the use of Secondary Indexes to enforce uniqueness. Therefore, a full row comparison is still necessary. Page 16-18 Analyze Primary Index Criteria PKs and Duplicate Rows Rule: Primary Keys Must be UNIQUE and NOT NULL. • This rule of Relational Theory eliminates duplicate rows, which have plagued the industry for decades. • With Set tables (the default in Teradata transaction mode), the Teradata database does not allow duplicate rows. • With Multiset tables, the Teradata database allows duplicate rows. – All indexes must be non-unique indexes (NUPI and NUSI) in order to allow duplicate values. – A unique index (UPI or USI) will prevent duplicate index values, and therefore, duplicate rows (even if the table is created as Multiset). • If no unique index exists for a SET table, the file system compares data values byte by byte within a Row Hash to ensure row uniqueness in a table. – Many NUPI duplicates result in expensive duplicate row checks. – To avoid these duplicate row checks, use a Multiset table. Analyze Primary Index Criteria Page 16-19 NUPI Duplicate Row Check Set tables (the default) do not allow duplicate rows. When a new row is inserted into a Set table with a Non-Unique Primary Index, the system must perform a NUPI Duplicate Row Check. The table on the facing page illustrates the number of logical reads that must occur when this happens. The middle column is the number of logical reads required before that one row can be inserted. The right hand column shows how many cumulative logical reads would be required to insert all the rows up to and including that one. As you can see, when you have a NUPI with excessive rows per value, the number of logical reads becomes prohibitively high. It is very important to limit the NUPI rows per value whenever possible. The best way to avoid NUPI Duplicate row checks is to create the table as a MULTISET table. Note: USIs should be used for access or uniqueness enforcement. They should not be used just to avoid duplicate row checking, since sometimes they may be used and at other times they will not be used. The overhead of a USI does not justify the cost of trying to avoid the duplicate row check and they don't avoid the cost in most cases. As a suggestion, keep the number of NUPI rows per value within the number of rows which will fit into you largest block. This will allow the system to satisfy a single-value NUPI access with one or two data block I/Os. Page 16-20 Analyze Primary Index Criteria NUPI Duplicate Row Check Limit NUPI rows per value to rows per block whenever possible. To avoid NUPI duplicate row checks, create the table as a MULTISET table. Row Number to be inserted This chart illustrates the additional I/O overhead. 1 2 3 4 5 6 7 8 9 10 20 50 100 200 500 1000 Number of Rows that must be logically read first Cumulative Number of logical row reads 0 1 2 3 4 5 6 7 8 9 19 49 99 199 499 999 0 1 3 6 10 15 21 28 36 45 190 1225 4950 19900 124750 499500 this is how we can write it....... Analyze Primary Index Criteria Page 16-21 Primary Index Demographics As you have seen, the three types of demographics important to choosing a Primary Index are: Access demographics, Distribution demographics and Volatility demographics. To make proper PI selections, you must have accurate demographics. Accurate demographics serve to quantify all three-index selection determinants. Access Demographics Access demographics identify index candidates that maximize one-AMP operations. Both Value Access and Join Access are important to PI selection. The higher the value, the more often the column is used for access. Distribution Demographics Distribution demographics identify index candidates that optimize parallel processing. Choose the column(s) that provides the best distribution. Volatility Volatility demographics identify table columns that are UPDATEd. This item does not refer to INSERT or DELETE operations. Volatility demographics identify index candidates that reduce maintenance I/O. You want to have columns with stable data values as your PI candidates. In this module, you will see how to use Distribution demographics to select PI candidates. Access and Volatility demographics will be presented in a later module. Page 16-22 Analyze Primary Index Criteria Primary Index Demographics Access Demographics • Identify index candidates that maximize one-AMP operations. • Columns most frequently used for access (Value and Join). Distribution Demographics • Identify index candidates that optimize parallel processing. • Columns that provide good distribution. Volatility Demographics • Identify index candidates with low maintenance I/O. Without accurate demographics, index choices are unsubstantiated. Demographics quantify all 3 index selection determinants. Analyze Primary Index Criteria Page 16-23 Column Distribution Demographics for a PI Candidate Column Distribution demographics are expressed in four ways: Distinct Values, Maximum Rows per Value, Maximum Rows NULL and Typical Rows per Value. These items are defined below: Distinct Values is the total number of different values a column contains. For PI selection, the higher the Distinct Values (in comparison with the table row count), the better. Distinct Values should be greater than the number of AMPs in the system, whenever possible. We would prefer that all AMPs have rows from each TABLE. Maximum Rows per Value is the number of rows in the most common value for the column or columns. When selecting a PI, the lower this number is, the better the candidate. For a column or columns to qualify as a UPI, Maximum Rows per Value must be 1. Maximum Rows NULL should be treated the same as Maximum Rows Per Value when being considered as a PI candidate. Typical Rows per Value gives you an idea of the overall distribution which the column or columns would give you. The lower this number is, the better the candidate. Like Maximum Rows per Value, Typical Rows per Value should be small enough to fit on one data block. The illustration at the bottom of the facing page shows a distribution graph for a column whose values are states. Note in the graph that 30K = Maximum Rows NULL, and 15K = Maximum Rows per Value (CA). Typical Rows per Value is approximately 30. You should monitor all demographics periodically as they change over time. Page 16-24 Analyze Primary Index Criteria Column Distribution Demographics for a PI Candidate Distinct Values • The more the better (compared to table row count). • Should have enough values to allow for distribution to all AMPs. Maximum Row Per Value 15K • The fewer the better. 30K Maximum Rows Null • The fewer the better. • A very large number indicates a very large distribution spike. • Large spikes can cause serious space consumption problems. Typical Rows Per Value • The fewer the better. • Monitor periodically as it changes over time. ROWS 46K 30K 15K 100 70 0 Values: NULL AZ CA Analyze Primary Index Criteria 30 10 30 30 30 GA HI MI MO NV 30 NY 30 OH OK 30 30 30 30 TX VA VT WA Page 16-25 SQL to View Data Demographics The facing page contains simple examples of SQL that can be used to determine data demographics for a column. The Average Rows per value and Typical Rows per value can be thought of as the Mean and Median of a data set. Page 16-26 Analyze Primary Index Criteria SQL to View Data Demographics # of Distinct Values for a column: SELECT COUNT(DISTINCT(column_name)) FROM tablename; Max Rows per Value for all values in a column: SELECT GROUP BY 1 column_name, COUNT(*) ORDER BY 2 DESC; FROM tablename Max Rows per Value for 5 most frequent values: SELECT FROM TOP 5 t_colvalue, t_count (SELECT column_name, COUNT(*) FROM tablename GROUP BY 1) t1 (t_colvalue, t_count) ORDER B Y t_count DESC; Average Rows per Value for a column (mean value): SELECT COUNT(*) / COUNT(DISTINCT(col_name)) FROM tablename; Typical Rows per Value for a column (median value): SELECT FROM t_count (SELECT AS "Typical Rows per Value" col_name, COUNT(*) FROM tablename GROUP BY 1) t1 (t_colvalue, t_count), (SELECT COUNT(DISTINCT(col_name)) FROM tablename) t2 (num_rows) QUALIFY ROW_NUMBER () OVER (ORDER B Y t1.t_colvalue) = t2.num_rows /2 ; Analyze Primary Index Criteria Page 16-27 Example of Using Data Demographic SQL The facing page contains simple examples of SQL that can be used to determine data demographics for a column. Page 16-28 Analyze Primary Index Criteria Example of Using Data Demographic SQL # of Distinct Values for a column: SELECT FROM COUNT(DISTINCT(Last_name)) AS "# Values" Customer; Max Rows per Value for all values: SELECT FROM GROUP BY ORDER BY Last_name, COUNT(*) Customer 1 2 DESC; Max Rows per Value for 3 most frequent values: SELECT FROM t_colvalue, t_count (SELECT Last_name, COUNT(*) FROM Customer GROUP BY 1) t_table (t_colvalue, t_count) QUALIFY RANK (t_count) <= 3; Analyze Primary Index Criteria # Values 464 Last_name Smith Jones Wilson White Lee : t_colvalue Smith Jones Wilson Count(*) 52 41 38 36 36 : t_count(*) 52 41 38 Page 16-29 Example of Data Demographic SQL (cont.) The facing page contains simple examples of SQL that can be used to determine data demographics for a column. Page 16-30 Analyze Primary Index Criteria Example of Data Demographic SQL (cont.) Average Rows per Value for a column (mean): SELECT FROM 'Last_name' AS "Column Name" ,COUNT(*) / COUNT(DISTINCT(Last_name)) AS ”Average Rows” Customer; Column Name Last_name Average Rows 15 Column Name Last_name Typical Rows 11 Typical Rows per Value for a column (median): SELECT 'Last_name' ,t_count (SELECT FROM AS "Column Name" AS "Typical Rows" FROM Last_name, COUNT(*) Customer GROUP BY 1) t_table (t_colvalue, t_count), (SELECT COUNT(DISTINCT(Last_name)) FROM Customer) t_table2 (t_distinct_count) QUALIFY RANK (t_colvalue) = (t_distinct_count / 2); Analyze Primary Index Criteria Page 16-31 TableSize View The TableSize[V][X] views are Data Dictionary views that provides AMP Vproc information about disk space usage at a table level, optionally for tables the current User owns or has SELECT privileges on. Example The SELECT statement on the facing page looks for poorly distributed tables by displaying the CurrentPerm figures for a single table on all AMP vprocs. The result displays one table, table2, which is evenly distributed across all AMP vprocs in the system. The CurrentPerm figure is nearly identical across all vprocs. The other table, table2_nupi, is poorly distributed. The CurrentPerm figures range from 9,216 bytes to 71,680 bytes on different AMP vprocs. Page 16-32 Analyze Primary Index Criteria TableSize View Provides AMP Vproc disk space usage at table level. DBC.TableSize[V][X] Vproc TableName DatabaseName CurrentPerm AccountName PeakPerm Result: Example: Display table distribution across AMPs. SELECT Vproc ,CAST (TableName AS CHAR(20)) ,CurrentPerm ,PeakPerm FROM DBC.TableSizeV WHERE DatabaseName = USER ORDER BY TableName, Vproc ; Analyze Primary Index Criteria Vproc 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 TableName table2 table2 table2 table2 table2 table2 table2 table2 table2_nupi table2_nupi table2_nupi table2_nupi table2_nupi table2_nupi table2_nupi table2_nupi CurrentPerm 41,472 41,472 40,960 40,960 40,960 40,960 40,960 40,960 22,528 22,528 71,680 71,680 9,216 9,216 59,392 59,392 PeakPerm 53,760 53,760 52,736 52,736 53,760 53,760 54,272 54,272 22,528 22,528 71,680 71,680 9,216 9,216 59,392 59,392 Page 16-33 SQL to View Data Distribution The facing page contains simple examples of SQL that can be used to determine actual data distribution for a table. Page 16-34 Analyze Primary Index Criteria SQL to View Data Distribution Ex: Display the distribution of Customer by AMP space usage. SELECT FROM WHERE AND ORDER BY Vproc ,TableName (CHAR(15)) ,CurrentPerm DBC.TableSizeV DatabaseName = DATABASE TableName = 'Customer' 1; Vproc 0 1 2 3 4 5 6 7 TableName Customer Customer Customer Customer Customer Customer Customer Customer CurrentPerm 127488 127488 127488 127488 128000 128000 126976 126976 Ex: Display the distribution of Customer by AMP row counts. SELECT FROM GROUP BY ORDER BY HASHAMP (HASHBUCKET (HASHROW (Customer_number))) AS "AMP #" ,COUNT(*) Customer 1 1; The Row Hash functions can be used to predict the distribution of data rows for any column in a table. Analyze Primary Index Criteria AMP # 0 1 2 3 4 5 6 7 Count(*) 867 886 877 870 881 878 879 862 Page 16-35 E-R Diagram for Exercises The E-R diagram on the facing page depicts the tables used in the exercises. Though the names of the tables and their columns are generic, the model is properly normalized to Third Normal Form (3NF). Page 16-36 Analyze Primary Index Criteria E-R Diagram for Exercises ENTITY 1 DEPENDENT HISTORY ENTITY 2 ASSOCIATIVE 1 ASSOCIATIVE 2 Note: The exercise table and column names are generic so that index selections are not influenced by names. Analyze Primary Index Criteria Page 16-37 Exercise 2 – Sample The facing page has an example of how to use Distribution demographics to identify PI candidates. On the following pages, you will be asked to identify PI candidates in a similar manner. Use the Primary Index Candidate Guidelines below to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a ? In later exercises, you will make the final index choices for these tables. Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates. These columns will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (maybe at least 10 times the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-38 Analyze Primary Index Criteria 4N*100amps Exercise 2 – Sample On the following pages, there are sample tables with distribution demographics. • Indicate ALL possible Primary Index candidates (UPI and NUPI). • Later exercises will guide your final choices. Example 60,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating A B PK,SA 5K 12 1M 50M 60M 1 0 1 0 UPI 2.6K 0 0 7M 12 5 7 1 NUPI Primary Index Candidate Guidelines: • PK and UNIQUE COLUMNS (ND) • Any single column with: – High Distinct values (at least 10X) – Low Maximums for NULLs or a Value – Typical Rows that is close to Max Rows C D FK,NN NN,ND 0 0 1K 5K 1.5M 500 0 35 5 NUPI? 500K 0 0 0 60M 1 0 1 3 UPI E F G H 0 0 0 0 8 8M 0 7M 0 0 0 0 0 15M 9 725K 3 4 0 0 0 0 15M 725K 5 3 4 52 4K 0 700 90K 10K 80K 9 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-39 Exercise 2 – Choosing PI Candidates Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-40 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates ENTITY 1 100,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating A B C D E F 0 0 0 0 95M 2 0 1 3 0 0 0 0 300K 400 0 325 2 0 0 0 0 250K 350 0 300 1 0 0 0 0 40M 3 1.5M 2 1 0 0 0 PK,UA 50K 0 10M 10M 100M 1 0 1 0 1M 110 0 90 1 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-41 Exercise 2 – Choosing PI Candidates (cont.) Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-42 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates (cont.) ENTITY 2 10,000,000 Rows G PK/FK PK,SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 5K 12 100M 100M 10M 1 0 1 0 H I J K L 365 0 0 0 100K 200 0 100 0 12 0 0 0 9M 2 100K 1 9 12 0 0 0 12 1M 0 800K 1 0 0 0 0 50 240K 0 190K 2 0 260 0 180K 60 0 50 0 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-43 Exercise 2 – Choosing PI Candidates (cont.) Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-44 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates (cont.) DEPENDENT 5,000,000 Rows A PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating M N O PK P Q NN, ND FK SA 0 0 700K 1M 2M 4 0 1 0 0 0 0 0 50 200K 0 60K 0 0 0 0 0 90K 75 0 50 3 0 0 0 0 3M 2 390K 1 1 0 0 0 0 5M 1 0 1 0 0 0 0 0 2M 5 1M 1 1 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-45 Exercise 2 – Choosing PI Candidates (cont.) Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-46 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates (cont.) ASSOCIATIVE 1 300,000,000 Rows A PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating G R S 0 0 0 0 15K 21K 0 19K 0 0 0 0 0 800K 400 0 350 0 PK FK FK,SA 260 0 0 0 100M 5 0 3 0 0 0 8M 300M 10M 50 0 30 0 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-47 Exercise 2 – Choosing PI Candidates (cont.) Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-48 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates (cont.) ASSOCIATIVE 2 100,000,000 Rows A M PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating G T U 0 0 0 0 560K 180 0 170 0 0 0 0 0 750 135K 0 100K 0 PK FK FK 0 0 7M 800M 50M 3 0 1 0 0 0 250K 20M 10M 150 0 8 0 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-49 Exercise 2 – Choosing PI Candidates (cont.) Use the Primary Index Candidate Guidelines to identify the PI candidates. Indicate whether they are UPI or NUPI candidates. Indicate borderline candidates with a question mark (?). Primary Index Candidate Guidelines: ALL Unique Columns are PI candidates and will be identified with the abbreviation ND for No Duplicates. The Primary Key (PK) is a UPI candidate. Any single column with high Distinct Values (at least 100% greater than the number of AMPs), low Maximum Rows NULL, and with a Typical Rows per Value that is relatively close to the Maximum Rows per Value is a PI candidate. Page 16-50 Analyze Primary Index Criteria Exercise 2 – Choosing PI Candidates (cont.) HISTORY 730,000,000 Rows A PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating DATE D E F 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A PK FK SA 10M 0 800M 2.4B 100M 18 0 3 0 5K 20K 0 0 730 1100K 0 900K 0 PI/SI Collect Statistics (Y/N) Analyze Primary Index Criteria Page 16-51 What is a NoPI Table? A NoPI Table is simply a table without a primary index. Prior to Teradata Database 13.0, Teradata tables required a primary index. The primary index was primarily used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradata’s parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index. Starting with Teradata Database 13.0, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index. Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value. Page 16-52 Analyze Primary Index Criteria What is a NoPI Table? What is a No Primary Index (NoPI) Table? • It is simply a table without a primary index – a Teradata 13.0 feature. • As rows are inserted into a NoPI table, rows are always appended at the end of the table and never inserted in a middle of a hash sequence. – Organizing/sorting rows based on row hash is therefore avoided. Basic Concepts • Rows will still be distributed between AMPs. New code (Random Generator) will determine which AMP will receive rows or blocks of rows. • Within an AMP, rows are simply appended to the end of the table. Rows will have a unique RowID – the Uniqueness Value is incremented. Benefits • A NoPI table will reduce skew in intermediate ETL tables which have no natural Primary Index. • Loads (FastLoad and TPump Array Insert) into a NoPI staging table are faster. Analyze Primary Index Criteria Page 16-53 Reasons to Consider Using NoPI Tables The facing page identifies various reasons to consider using NoPI tables. Why is a NoPI table useful? A NoPI can be very useful in those situations when the default primary index (first column) causes skewing of data between AMPs and performance degradation. This type of table provides a performance advantage in that data can be loaded and stored quickly into a NoPI table using FastLoad or TPump Array INSERT. Page 16-54 Analyze Primary Index Criteria Reasons to Consider Using NoPI Tables Reasons to consider using a NoPI Table • Utilize NoPI tables instead of arbitrarily defaulting to first table column or creating an unnatural Primary Index from many columns. • Some ETL tools generate intermediate tables to store data without a known distribution of values. If the first column is used (defaults) as the primary index (NUPI), this may lead to skewed data and performance issues. – The system default can be set to create tables without a primary index. • As a staging table to be used with the mini-batch loading technique. • A NoPI table can be used as a Sandbox table (or any table) where data can be inserted until an appropriate indexing method is determined. • A NoPI table can be used as a Log file. • As a Column Partitioned (columnar) table – Teradata 14.0 feature. Analyze Primary Index Criteria Page 16-55 Creating a Table without a PI The facing page identifies the syntax to create a table without a primary index. If you attempt to include the key word SET (set table) and NO PRIMARY INDEX in the same CREATE TABLE statement, you will receive a syntax error. Page 16-56 Analyze Primary Index Criteria Creating a Table without a PI To create a NoPI table, specify the NO PRIMARY INDEX clause in the CREATE TABLE statement. CREATE TABLE ( , , …) NO PRIMARY INDEX; Considerations: – When a table is created with no primary index, the TableKind column is set to 'O' instead of 'T' and appears in the DBC.TVM table. – If PRIMARY KEY or UNIQUE constraints are also defined, these will be implemented as Unique Secondary Indexes. – A NoPI table is automatically created as a MULTISET table. Analyze Primary Index Criteria Page 16-57 How is a NoPI Table Implemented? The NoPI Table feature is another step toward extending or supporting Mini-Batch. By allowing a table with no primary index acting as a staging table, data can be loaded into the table a lot more efficiently and in turn faster. All of the rows in a data request, after being received by Teradata and converted into proper internal format, can be appended to a NoPI table without having to be redistributed to their hash-owning AMPs. Rows in a NoPI table are not hashed based on the primary index because there isn’t one. The hash values are all internally controlled and generated and therefore the rows can be stored in any particular order and in any AMP. That means sorting of the rows is avoided. The performance advantage, especially for FastLoad, from using a NoPI table is most significant for applications that currently load data into a staging table to be transformed or standardized before being stored into another staging table or the target table. For those applications, using a NoPI table can avoid the unnecessary row redistribution and sorting work. Another advantage for FastLoad is that users can quickly load data into a NoPI table and be done with the acquisition phase freeing up Client resources for other work. For TPump, the performance advantage can be much bigger especially for applications that were not able to pack many rows into the same AMP step in a traditional PI table. On a NoPI table, all rows in a data request are packed into the same AMP step independently from the system configuration and the clustering of data. This will generally lead to big reductions in CPU and IO usage. Page 16-58 Analyze Primary Index Criteria How is a NoPI Table Implemented? Rows are distributed between AMPs using a random generator. Within an AMP, rows are simply added to a table in sequential order. • The random generator is designed in such as way that data will be balanced out between the AMPs. • Although there is no primary index in a NoPI table, rows will still have a valid 64-bit RowID. The first part of the RowID is based on a hash bucket value (16 or 20 bits) that is internally generated and controlled by the AMP. • Typically, all the rows in a table on one AMP will have the same hash bucket value, but will have different uniqueness values. There are two separate steps used with a NoPI table. 1. A new internal function (e.g., random generator) is used to choose a hash bucket which effectively determines which AMP the row(s) are sent to. 2. The AMP internally selects a hash bucket value that the AMP owns and uses it as the first part (16 or 20 bits) of the RowID. Analyze Primary Index Criteria Page 16-59 NoPI Random Generator For SQL-based functions, the PE uses the following technique for the random generator. The DBQL Query ID is used by the random generator to select a random row hash. The approach is to generate a random row hash in such a way that for a new request, data will generally be sent to a different AMP from the one that the previous request sent data to. The goal is to balance out the data as much as possible without the use of the primary index. The DBQL Query ID is selected for this purpose because it uses the PE vproc ID in its high digits and a counter-based value in its low digits. There are two cases for INSERT; one is when only one single data row is processed and the other is when multiple data rows are processed with an Array INSERT request. In the case of an Array INSERT request, rows are sorted by their hash-owning AMPs so that the rows going to the same AMP can easily be grouped together into the same step. This random row hash will be generated once per request so that in the case of Array INSERT, the same random row hash is used for all of the rows. This means they all will be sent to the same AMP and usually in the same step. FastLoad sends blocks of data to the AMPs. Each AMP (that receives blocks of data) uses random generator code to distribute blocks of data between all of the AMPs in a round robin fashion. Page 16-60 Analyze Primary Index Criteria NoPI Random Generator How is the AMP selected that will receive the row (or block of rows)? • The random generator can be executed at the PE or at the AMP level depending on the type of request (e.g., SQL versus FastLoad). For SQL-based functions, the PE uses the random generator. • The DBQL Query ID is used by the random generator to select a random hash value. – The approach is to generate a random hash bucket value in such a way that for a new request, data will generally be sent to a different AMP from the one that the previous request sent data to. – In the case of an Array INSERT request, this random hash bucket value will be generated once per request so that in the case of Array INSERT, the same random hash bucket value is used for all of the rows. For FastLoad-based functions, the AMP uses random generator code to distribute blocks of data between the AMPs in a round robin fashion. Analyze Primary Index Criteria Page 16-61 The Row ID for a NoPI Table For a NoPI table, the AMP will assign a RowID (64 bits) for a row or a set of rows using a hash bucket that the AMP owns. For a NoPI table, the RowID will consist of a 20-bit hash bucket followed by 44 bits that are used for the uniqueness part of the RowID. Only the first 20 bits (hash bucket) are used. As more rows are added to the table, the uniqueness value is sequentially incremented. For systems using a 16-bit hash buckets, the RowID for a NoPI table will have 16 bits for the hash bucket value and 48 bits for the uniqueness id. Page 16-62 Analyze Primary Index Criteria The Row ID for a NoPI Table The RowID will still be 64 bits, but it is utilized a little differently in a NoPI table. • The first 20 bits represents the hash bucket that is internally selected by the AMP. • Remaining 44 bits are used for the uniqueness value of rows in a NoPI table. • Note: Systems may be configured to use 16 bits for the hash bucket numbers – if so, then the uniqueness value will utilize 48 bits of the RowID. Row ID for NoPI table Hash Bucket 20 (or 16) bits Row ID Each row still has a Row ID as a prefix. Rows are logically maintained in Row ID sequence. Uniqueness Value 44 (or 48) bits Hash Bucket 000E7 000E7 000E7 000E7 000E7 000E7 : Analyze Primary Index Criteria Row Data Uniqueness 00000000001 00000000002 00000000003 00000000004 00000000005 00000000006 : Cust_No Last_Name First_Name 001018 001020 001031 001014 001012 001021 : Reynolds Davidson Green Jacobs Garcia Carnet : Jane Evan Jason Paul Jose Jean : Page 16-63 The Row ID for a NoPI Table (cont.) For a NoPI table, the AMP will assign a RowID (64 bits) for a row or a set of rows using a hash bucket that the AMP owns. This 64-bit RowID can be used by secondary and join indexes. What is a different about the RowID for a NoPI table is that the uniqueness id is 44 bits long instead of 32 bits. The additional 12 bits available in the row hash are added to the 32-bit uniqueness. This gives a total of 44 bits to use for the uniqueness part of the RowID. For each hash bucket, there can be up to 17 trillion rows per AMP (approximately). For systems using a 16-bit hash buckets, the RowID for a NoPI table will have 16 bits for the hash bucket value and 48 bits for the uniqueness id. The RowID is still 64 bits long and a unique identifier of a row within a table. Page 16-64 Analyze Primary Index Criteria The Row ID for a NoPI Table (cont.) The RowID is 64 bits and can be referenced by secondary and join indexes. • The first 20 (or 16) bits represent the hash bucket value which is internally chosen by and controlled by the AMP. • Remaining 44 (or 48) bits are used for the uniqueness value of rows in a NoPI table. This module assumes that 20-bit hash bucket numbers are used. – The uniqueness value starts from 1 and will be sequentially incremented. – With 44 bits, there can be approximately 17 trillion rows on an AMP. • Normally, all rows in a NoPI table on an AMP will have the same hash bucket value (first 20 bits) and the 44-bit uniqueness value will start at 1 and be sequentially incremented. • Each row in a NoPI table will have a RowID with a hash bucket value that is actually owned by the AMP storing the row. Fallback and index maintenance work the same as if the table is a primary index table. As always, the RowID is transparent to the end-user. Analyze Primary Index Criteria Page 16-65 Multiple NoPI Tables at the AMP Level The facing page illustrates an example of two NoPI tables in a 27-AMP system. Other NoPI considerations include: Archive/Recovery Issues Archive/Restore will be supported for NoPI table. Archiving a table or a database and restoring or copying that to the same system or a different system should work out fine with the existing scheme for NoPI table when no data redistribution takes place (same number of AMPs). Data redistribution takes place when there is a difference in configuration or hash function between the source system and the target system. In the case of a difference in configuration, each row in a table will be looked at and if its hash bucket belongs to some other AMP using the new configuration, that row will be redistributed to its hash-owning AMP. Since one hash bucket is normally enough to use to assign RowID to all of the rows on each AMP, when we restore or copy data to a different configuration with more AMPs, there will be AMPs that will not have any data at all. This means that data in a NoPI table can be skewed after a Restore or Copy. This is because permanent space is divided equally among the AMPs whether or not any of them get any data. As some AMPs not getting any data from a Restore or Copy, some other AMPs will get more data compared to what it was in the source system and this will require more space allocated overall. However, as a staging table, NoPI table is not intended to stay around for too long so it is not expected to have many NoPI tables being restored or copied. Reconfig Issues Reconfig will be supported for NoPI table. The issue with Reconfig is very similar to that of Restore or Copy to a different configuration. Although rows in a NoPI table are not hashed based on the primary index and the AMPs where they reside are arbitrary, but each row does have a RowID with a hash bucket that is owned by the AMP storing that row. Redistributing rows in a NoPI table via Reconfig can be done by sending each row to the AMP that owns the hash bucket in that row based on the new configuration map. As with Restore and Copy, Reconfig can make a NoPI table skewed by going to a configuration with more AMPs. Page 16-66 Analyze Primary Index Criteria Multiple NoPI Tables at the AMP Level AMP 0 ... TableID AMP 3 ... Row ID Hash Uniq Value AMP 17 Row Data TableID ... Row ID Hash Uniq Value NoPI Table1 00089A (Base) 00089A (Base) 00089A (Base) 00089A (Base) 000E7 000E7 000E7 000E7 00000000001 00000000002 00000000003 00000000004 00089A (Base) 00089A (Base) 00089A (Base) 00089A (Base) 0003F 0003F 0003F 0003F 00000000001 00000000002 00000000003 00000000004 NoPI Table2 00089B (Base) 00089B (Base) 00089B (Base) 00089B (Base) 000E7 000E7 000E7 000E7 00000000001 00000000002 00000000003 00000000004 00089B (Base) 00089B (Base) 00089B (Base) 00089B (Base) 0003F 0003F 0003F 0003F 00000000001 00000000002 00000000003 00000000004 AMP 26 Row Data Data within an AMP is logically stored in Table ID / Row ID sequence. Analyze Primary Index Criteria Page 16-67 Loading Data into a NoPI Table The facing page summarizes various techniques of getting data inserted into a NoPI table. Page 16-68 Analyze Primary Index Criteria Loading Data into a NoPI Table Simple INSERTs • For a simple INSERT, the PE selects a random AMP where the row is sent to. That AMP then turns the row into proper internal format and appends it to the end of the NoPI table. INSERT–SELECT • When inserting data from a source PI (or NoPI) table into a NoPI target table, data from the source table will NOT be redistributed and will be locally appended into the target table. INSERT-SELECT to a target NoPI table can result in a skewed NoPI table if the source table is skewed. FastLoad • Blocks of data are sent to the AMP load sessions and the AMP random generator code randomly distributes the blocks between the AMPs usually resulting in even distribution of the data between AMPs. TPump • With TPump Array INSERT, rows are packed together in a request and distributed to an AMP and then appended to the NoPI table on that AMP. Different requests are distributed to different AMPs by the PE. This will usually result in even distribution of the data between the AMPs. Analyze Primary Index Criteria Page 16-69 NoPI Options The following options are available to a NoPI table: • • • • • • FALLBACK Secondary indexes – USI and NUSI Join and reference indexes Primary Key and Foreign Key constraints are allowed on a NoPI table. LOBs are allowed on a NoPI table. INSERT and DELETE trigger actions are allowed on a NoPI table. – UPDATE trigger actions will be allowed starting with Teradata 13.00.00.03. • NoPI table can be a Global Temporary or Volatile table. • COLLECT/DROP STATISTICS are allowed on a NoPI table. • FastLoad – note that duplicate rows are loaded and not deleted with a NoPI table The following limitations apply to a NoPI table: • • • • • • • • • Page 16-70 SET is not allowed. Default is MULTISET for both Teradata and ANSI mode. No columns are allowed to be specified for the primary index. Partitioned primary index is not allowed. Permanent journaling is not allowed. Identity column is not allowed. Cannot be created as a queue or as an error table. Hash index is not allowed on a NoPI table. MultiLoad cannot be used to load a NoPI table. UPDATE, UPSERT, and MERGE-INTO operations are using the NoPI table as the target table. – UPDATE will be available with Teradata 13.00.00.03 Analyze Primary Index Criteria NoPI Table Options Options available with NoPI tables • • • • FALLBACK Secondary indexes – USI and NUSI Join and reference indexes Primary Key and Foreign Key constraints are allowed. • LOBs are allowed on a NoPI table. • INSERT and DELETE trigger actions are allowed on a NoPI table. – UPDATE trigger actions will be allowed starting with Teradata 13.00.00.03. • Can be a Global Temporary or Volatile table. • COLLECT/DROP STATISTICS are Limitations of NoPI tables • • • • • • SET tables are not allowed. Partitioned primary index is not allowed. Permanent journaling is not allowed. Identity column is not allowed. Cannot be a queue or as an error table. Hash index is not allowed on a NoPI table. • MultiLoad cannot be used on a NoPI table. • UPDATE, UPSERT, and MERGE-INTO operations using the NoPI table as the target table are not allowed. – UPDATE will be available with Teradata 13.00.00.03 allowed. • FastLoad – note that duplicate rows are loaded and not deleted with a NoPI table Analyze Primary Index Criteria Page 16-71 Summary The facing page summarizes some of the key concepts covered in this module. Page 16-72 Analyze Primary Index Criteria Summary Tables with a Primary Index: • Base PI on the column(s) most often used for access, provided that the values are unique or nearly unique. • Duplicate values hash to the same AMP and are stored in the same data block when possible. • PRIMARY KEY and/or UNIQUE constraints are always implemented as a unique index (either a UPI or a USI. Tables without a Primary Index: • Although there is no primary index in a NoPI table, rows do have a valid row ID with both hash and uniqueness. – Hash value is internally selected in the AMP • Rows in a NoPI table will be even distributed between the AMPs based upon a new code (i.e., random generator). Analyze Primary Index Criteria Page 16-73 Module 16: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 16-74 Analyze Primary Index Criteria Module 16: Review Questions 1. Which trade-off must be balanced to make the best choice for a primary index? ____ a. b. c. d. Access and volatility Access and block size Block size and volatility Access and distribution 2. When volatility is considered as one of the Primary Index choice criteria, what is analyzed? ____ a. b. c. d. Degree of uniqueness How often the data values will change How often the fixed length rows will change How frequently the column is used for access 3. To optimize the use of disk space, the designer should choose a primary index that ________. a. b. c. d. e. is non-unique consists of one column is unique or nearly unique consists of multiple columns has fewer distinct values than AMPs Analyze Primary Index Criteria Page 16-75 Module 16: Review Questions (cont.) Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 16-76 Analyze Primary Index Criteria Module 16: Review Questions (cont.) 4. For NoPI tables, what are 2 ways in which the Random Generator is executed? a. b. c. d. At the AMP level with FastLoad At the PE level for ad hoc SQL requests At the TPump client level for array insert operations At the AMP level for INSERT-SELECT into an empty NoPI table 5. Assume DBSControl flag #53 (Primary Index Default) is set to N (No Primary Index), which two indexes are created for TableX given the following DDL command? CREATE TABLE TableX (col1 INTEGER NOT NULL UNIQUE ,col2 CHAR(10) NOT NULL PRIMARY KEY ,col3 CHAR(80)); a. b. c. d. col1 will be a UPI col1 will be a USI col2 will be a UPI col2 will be a USI 6. Which two options are permitted for NoPI tables? a. b. c. d. Fallback MultiLoad Hash Index BLOBs and CLOBs Analyze Primary Index Criteria Page 16-77 Lab Exercise 16-1 Check your understanding of the concepts discussed in this module by completing the lab exercise as directed by your instructor. Page 16-78 Analyze Primary Index Criteria Lab Exercise 16-1 Lab Exercise 16-1 Purpose In this lab, you will use Teradata SQL Assistant to evaluate various columns of table as primary index candidates. What you need Populated PD.Employee table; your empty Employee table Tasks 1. INSERT/SELECT all rows from the populated PD.Employee table to your “Employee” table. Verify the number of rows in your table. INSERT INTO Employee SELECT * FROM PD.Employee; SELECT COUNT(*) FROM Employee; Analyze Primary Index Criteria Count = _________ Page 16-79 Lab Exercise 16-1 (cont.) Use the following SQL to determine the column metrics for this Lab. # of Distinct Values for a column: SELECT COUNT(DISTINCT(column_name)) FROM tablename; Max Rows per Value for all values in a column: SELECT column_name, COUNT(*) FROM tablename GROUP BY 1 ORDER BY 2 DESC; Max Rows with NULL in a column: SELECT COUNT(*) FROM tablename WHERE column_name IS NULL; Average Rows per Value for a column (mean value): SELECT COUNT(*) / COUNT(DISTINCT(col_name)) FROM tablename; Typical Rows per Value for a column (median value): SELECT t_count AS "Typical Rows per Value" FROM (SELECT col_name, COUNT(*) FROM tablename GROUP BY 1) t1 (t_colvalue, t_count), (SELECT COUNT(DISTINCT(col_name)) FROM tablename) t2 (num_rows) QUALIFY ROW_NUMBER () OVER (ORDER BY t1.t_colvalue) = t2.num_rows /2 ; Page 16-80 Analyze Primary Index Criteria Lab Exercise 16-1 (cont.) 2. Collect column demographics for each of these columns in Employee and determine if the column would be a primary index candidate or not. By using the SHOW TABLE Employee command, you should be able to complete the Employee_number information without executing any SQL. Distinct Values Max Rows for a Value Max Rows NULL Avg Rows per Value Candidate for PI (Y/N) Employee_Number Dept_Number Job_Code Last_name Analyze Primary Index Criteria Page 16-81 Lab Exercise 16-2 Distribution of table space by AMP: SELECT FROM WHERE AND ORDER BY Page 16-82 Vproc, TableName (CHAR(15)), CurrentPerm DBC.TableSizeV DatabaseName = DATABASE TableName = 'tablename' 1; Analyze Primary Index Criteria Lab Exercise 16-2 Lab Exercise 16-2 Purpose In this lab, you will use the DBC.TableSizeV view to determine space distribution on a per AMP basis. What you need Your populated Employee table. Tasks 1. Use SHOW TABLE command to determine which column is the Primary Index. PI = ______________ Determine the AMP space usage of your Employee table using DBC.TableSizeV. AMP #_____ has the least amount of permanent space – amount __________ AMP #_____ has the greatest amount of permanent space – amount __________ 2. Create a new table named Employee_2 with the same columns as Employee except specify Last_name as the Primary Index. Use INSERT/SELECT to populate Employee_2 from Employee. Determine the AMP space usage of your Employee_2 table using DBC.TableSizeV. AMP #_____ has the least amount of permanent space – amount __________ AMP #_____ has the greatest amount of permanent space – amount __________ Analyze Primary Index Criteria Page 16-83 Notes Page 16-84 Analyze Primary Index Criteria Module 17 Partitioned Primary Indexes After completing this module, you will be able to: Describe the components that comprise a Row ID in a partitioned table. List two advantages of partitioning a table. List two potential disadvantages of partitioning a table. Create single-level and multi-level partitioned tables. Use the PARTITION key word to display partition information. Teradata Proprietary and Confidential Partitioned Primary Indexes Page 17-1 Notes Page 17-2 Partitioned Primary Indexes Table of Contents Partitioning a Table .................................................................................................................... 17-4 How is Partitioning Implemented?............................................................................................. 17-6 Logical Example of NPPI versus PPI ........................................................................................ 17-8 Primary Index Access (NPPI) .................................................................................................. 17-10 Primary Index Access (PPI) ..................................................................................................... 17-12 Why Partition a Table? ............................................................................................................ 17-14 Advantages/Disadvantages of Partitioning .............................................................................. 17-16 Disadvantages of Partitioning .............................................................................................. 17-16 PPI Considerations ................................................................................................................... 17-18 Access of Tables with a PPI ................................................................................................. 17-18 How to Define a PPI ................................................................................................................ 17-20 Partitioning with CASE_N and RANGE_N ............................................................................ 17-22 Partitioning with RANGE_N – Example 1 .............................................................................. 17-24 Access using Partitioned Data – Example 1 (cont.) ............................................................. 17-26 Access Using Primary Index – Example 1 (cont.) ............................................................... 17-28 Place a USI on NUPI – Example 1 (cont.) ........................................................................... 17-30 Place a NUSI on NUPI – Example 1 (cont.) ........................................................................ 17-32 Partitioning with RANGE_N – Example 2 .............................................................................. 17-34 Partitioning – Example 3.......................................................................................................... 17-36 Special Partitions with CASE_N and RANGE_N ................................................................... 17-38 Special Partition Examples ...................................................................................................... 17-40 Partitioning with CASE_N – Example 4 ................................................................................. 17-42 Additional examples: ........................................................................................................... 17-42 SQL Use of PARTITION Key Word ....................................................................................... 17-44 SQL Use of CASE_N .............................................................................................................. 17-46 Using ALTER TABLE with PPI Tables .................................................................................. 17-48 ALTER TABLE – Example 5 .................................................................................................. 17-50 ALTER TABLE – Example 5 (cont.) ...................................................................................... 17-52 ALTER TABLE TO CURRENT ............................................................................................. 17-54 ALTER TABLE TO CURRENT – Example 6 ........................................................................ 17-56 PPI Enhancements.................................................................................................................... 17-58 Multi-level PPI Concepts ......................................................................................................... 17-60 Multi-level PPI Concepts (cont.) ............................................................................................. 17-62 Multi-level Partitioning – Example 7....................................................................................... 17-64 Multi-level Partitioning – Example 7 (cont.) ........................................................................... 17-66 How is the MLPPI Partition # Calculated? .............................................................................. 17-68 Character PPI ........................................................................................................................... 17-70 Character PPI – Example 8 ...................................................................................................... 17-72 Summary .................................................................................................................................. 17-74 Module 17: Review Questions ................................................................................................. 17-76 Lab Exercise 17-1 .................................................................................................................... 17-80 Partitioned Primary Indexes Page 17-3 Partitioning a Table As part of implementing a physical design, Teradata provides numerous indexing options that can improve performance for different types of queries and workloads. For example, secondary indexes, join indexes, or hash indexes may be utilized to improve performance for known queries. Teradata provides additional new indexing options to provide even more flexibility in implementing a Teradata database. One of these new indexing options is the Partitioned Primary Index (PPI). Key characteristics of Partitioned Primary Indexes are listed on the facing page. Primary indexes can be partitioned or non-partitioned. A non-partitioned primary index (NPPI) is the traditional primary index by which rows are assigned to AMPs. Apart from maintaining their storage in row hash order, no additional assignment processing of rows is performed once they are hashed to an AMP. A partitioned primary index (PPI) permits rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries that are predicated on primary index values. The Partitioned Primary Index (PPI) feature allows a class of queries to access a portion of a large table, instead of the whole table. The traditional uses of the Primary Index (PI) for data placement and rapid access of the data when the PI values are specified are retained. Some common business queries generally require a full-table scan of a large table, even though it’s predictable that a fairly small percentage of the rows will qualify. One example of such a query is a trend analysis application that compares current month sales to the previous month, or to the same month of the previous year, using a table with several years of sales detail. Another example is an application that compares customer behavior in one (fairly small) geographic region to another region. Acronyms: PI – Primary Index PPI – Partitioned Primary Index NPPI – Non-Partitioned Primary Index Page 17-4 Partitioned Primary Indexes Partitioning a Table What is a “Partitioned Primary Index” or PPI? • A indexing mechanism in Teradata for use in physical database design. • Data rows are grouped into partitions at the AMP level – partitioning is simply an ordering of the rows within a table on an AMP. What advantages does partitioning provide? • Increases the available options to improve the performance of certain types of queries – specifically range-constrained queries. • Only the rows of the qualified partitions in a query need to be accessed – avoid full table scans. How is a PPI created and managed? • A PPI is easy to create and manage. – The CREATE TABLE and ALTER TABLE statements contain options to create and/or alter partitions. • As always, data is distributed among AMPs and automatically placed within partitions. Partitioned Primary Indexes Page 17-5 How is Partitioning Implemented? The PRIMARY INDEX clause (part of the CREATE TABLE statement) has been extended to include a PARTITION BY clause. This new partition expression definition is the only thing that needs to be done to create a partitioned table. Advantages to this approach are: No separate partition layout No disk layout for partitions No definition of location in the system for partition No need to define/manage separate tables per segment of the table that needs to be accessed Even data distribution and even processing of a logical partition is automatic due to the PI distribution of the rows No query has to be modified to take advantage of a PPI table. For tables with a PPI, Teradata utilizes a 3-level scheme to distribute and later locate the data. The 3 levels are: Rows are distributed across all AMPs (and accessed via the Primary Index) based upon HBN (Hash Bucket Number) portion of the Row Hash. At the AMP level, rows are first ordered by their partition number. Within the partition, data rows are logically stored in Row ID sequence. A new term is associated with PPI tables. The Row Key is a combination of the Partition # and the Row Hash. The term Row Key will appear in EXPLAIN reports. Page 17-6 Partitioned Primary Indexes How is Partitioning Implemented? Provides an additional level of data distribution and ordering. • Rows are distributed across all AMPs (via Primary Index) based upon HBN portion of the Row Hash. • Rows are first ordered by their partition number within the AMP. • Within the partition, data rows are logically stored in Row ID sequence. If a table is partitioned, rows are placed into partitions. • Teradata 13.10 (and before) – partitions are numbered 1 to 65,535. • Teradata 14.0 – maximum combined partitions is increased to 9.223 Quintillion. – If combined partitions is <= 65,535, then 2-byte partition numbers are used. – If combined partitions is > 65,535, then 8-byte partition numbers are used. In a partitioned table, each row is uniquely identified by the following: • Row ID = Partition # + Row Hash + Uniqueness Value • Row Key = Partition # + Row Hash (e.g., Row Key will appear in Explain plans) – In a partitioned table, data rows will have the Partition # included as part of the data row. To help understand how partitioning is implemented, this module will include examples of data access using tables defined with NPPI and PPI. Partitioned Primary Indexes Page 17-7 Logical Example of NPPI versus PPI The facing page provides a logical example of an Orders table implemented with a NPPI (Non-Partitioned Primary Index) and the same table implemented with a PPI (Partitioned Primary Index). Only the Order_Number and a portion (YY/MM) of the Order_Date are shown in the example. The column headings in this example represent the following: RH – Row Hash – the two-digit row hash is used for simplification purposes. A true table would contain a Row ID for each row (Row Hash + Uniqueness Value). Note that as just in a real implementation, two different order numbers happen to hash to the same row hash value. Order numbers 1012 and 1043 on AMP 2 both hash to ‘36’. O_# – Order Number – this example assumes that Order Number is the Primary Index and the data rows are hash distributed based on this value. O_Date – Order Date – another column in the table. This example only contains orders for 4 months – from January, 2012 through April, 2012. For example, an order date, such as 12/01, represents January of 2012 (or 2012/01). Important points to understand from this example: All of the rows in the NPPI table are stored in logical Row ID sequence (row hash + uniqueness value) within each AMP. The rows in the PPI table are first ordered by Partition Number, and then by Row Hash (actually Row ID) sequence within the Partition. This example illustrates 4 partitions – one for each of the 4 months shown in the example. A query that requests “order information” (with a WHERE condition that specifies a range of dates) will result in a full table scan of the NPPI table. The same query will only have to access the required partitions in the PPI table. Page 17-8 Partitioned Primary Indexes Logical Example of NPPI versus PPI 4 AMPs with Orders Table defined with Non-Partitioned Primary Index (NPPI). 4 AMPs with Orders Table defined with PPI on O_Date. SELECT … WHERE O_Date BETWEEN '2012-03-01' AND '2012-03-31'; Partitioned Primary Indexes RH O_# O_Date RH O_# O_Date RH O_# O_Date RH O_# '01' 1028 12/03 '06' 1009 12/01 '04' 1008 12/01 '02' 1024 12/02 '03' 1016 12/02 '07' 1017 12/02 '05' 1048 12/04 '08' 1006 12/01 '12' 1031 12/03 '10' 1034 12/03 '09' 1018 12/02 '11' 1019 12/02 '14' 1001 12/01 '13' 1037 12/04 '15' 1042 12/04 '18' 1041 12/04 '17' 1013 12/02 '16' 1021 12/02 '19' 1025 12/03 '20' 1005 12/01 '23' 1040 12/04 '21' 1045 12/04 '24' 1004 12/01 '22' 1020 12/02 '28' 1032 12/03 '26' 1002 12/01 '27' 1014 12/02 '25' 1036 12/03 '30' 1038 12/04 '29' 1033 12/03 '32' 1003 12/01 '31' 1026 12/03 '35' 1007 12/01 '34' 1029 12/03 '33' 1039 12/04 '38' 1046 12/04 '39' 1011 12/01 '36' 1012 12/01 '40' 1035 12/03 '41' 1044 12/04 '42' 1047 12/04 '36' 1043 12/04 '44' 1022 12/02 '43' 1010 12/01 '48' 1023 12/02 '45' 1015 12/02 '47' 1027 12/03 '46' 1030 12/03 RH O_# O_Date RH O_# O_Date RH O_# O_Date RH O_# O_Date '14' 1001 12/01 '06' 1009 12/01 '04' 1008 12/01 '08' 1006 12/01 '35' 1007 12/01 '26' 1002 12/01 '24' 1004 12/01 '20' 1005 12/01 '39' 1011 12/01 '36' 1012 12/01 '32' 1003 12/01 '43' 1010 12/01 '03' 1016 12/02 '07' 1017 12/02 '09' 1018 12/02 '02' 1024 12/02 '17' 1013 12/02 '16' 1021 12/02 '27' 1014 12/02 '11' 1019 12/02 '48' 1023 12/02 '45' 1015 12/02 '44' 1022 12/02 '22' 1020 12/02 '01' 1028 12/03 '10' 1034 12/03 '19' 1025 12/03 '25' 1036 12/03 '12' 1031 12/03 '29' 1033 12/03 '40' 1035 12/03 '31' 1026 12/03 '28' 1032 12/03 '34' 1029 12/03 '47' 1027 12/03 '46' 1030 12/03 '23' 1040 12/04 '13' 1037 12/04 '05' 1048 12/04 '18' 1041 12/04 '30' 1038 12/04 '21' 1045 12/04 '15' 1042 12/04 '38' 1046 12/04 '42' 1047 12/04 '36' 1043 12/04 '33' 1039 12/04 '41' 1044 12/04 O_Date Page 17-9 Primary Index Access (NPPI) A non-partitioned table (NPPI) has a traditional primary index by which rows are assigned to AMPs. Apart from maintaining their storage in row hash order, no additional assignment processing of rows is performed once they are hashed to an AMP. With a NPPI table, the PARSER will include Partition Number 0 in the request. For a table with a NPPI, all of the rows are assumed to be part of one partition (Partition 0). Assuming that an SQL statement (e.g., SELECT) provides equality value(s) to the column(s) of a Primary Index, the TD Database software retrieves the row or rows from a single AMP as described below. The Parsing Engine (PE) creates a four-part message composed of the Table ID, Partition #0, the Row Hash, and Primary Index value(s). The 48-bit Table ID is located via the Data Dictionary, the 32 bit Row Hash value is generated by the Hashing Algorithm, and the Primary Index value(s) come from the SQL request. The Parsing Engine (via the Data Dictionary) knows if a table has a NPPI and sets the Partition Number to 0. The Message Passing Layer uses a portion of the Row Hash to determine to which AMP to send the request. The Message Passing Layer uses the HBN portion of the Row Hash (first 16 or 20 bits of the Row Hash) to locate a bucket in the Hash Map(s). This bucket identifies to which AMP the PE will send the request. The Hash Maps are part of the Message Passing Layer interface. The AMP uses the Table ID and Row Hash to identify and locate the proper data block, then uses the Row Hash and PI value to locate the specific row(s). The PI value is required to distinguish between Hash Synonyms. The AMP implicitly assumes the rows are part of partition #0. Note: The Partition Number (effectively 0) is not stored within the data rows for a table with a NPPI. The FLAG or SPARE byte (within the row overhead) has a bit set to zero for a NPPI row and it is set to one for a PPI row. Acronyms: HBN – Hash Bucket Number PPI – Partitioned Primary Index NPPI – Non-Partitioned Primary Index Page 17-10 Partitioned Primary Indexes Primary Index Access (NPPI) SQL with primary index values and data. PARSER Hashing Algorithm Base TableID (48 bits) Part. # 0 Row Hash PI values and data Bucket # Message Passing Layer (Hash Maps) AMP 0 AMP 1 ... AMP x ... AMP n - 1 AMP n Data Table Row ID Row Hash Uniq Value x '00000000' P# 0 RH Data x'068117A0' 0000 0001 x'068117A0' 0000 0002 x'068117A0' 0000 0003 x 'FFFFFFFF' Partitioned Primary Indexes Row Data Notes: 1. For tables with a NPPI, the rows are implicitly associated with Partition #0. 2. Partition #0 is not stored within each of the rows. 3. Rows are logically stored in Row ID sequence. Page 17-11 Primary Index Access (PPI) The process to locate a data row(s) via a PPI is similar to the process in retrieving data rows with a table defined with a NPPI – a process described earlier. If the SQL request provides data about columns associated with the partitions, then the PARSER will include specific partition information in the request. The key to remember is that a specific Row Hash value can be found in different partitions on the AMP. The Partition Number, Row Hash, and Uniqueness Value are needed to uniquely identify a row in a PPI-based table. A Row Hash and Uniqueness Value combination is only unique within a partition of a PPI table. The same Row Hash and Uniqueness Value combination can be present in different partitions (e.g., x’068117A0’). Assuming that an SQL statement (e.g., SELECT) provides equality value(s) to the Primary Index, then Teradata software retrieves the row(s) from a single AMP. If the SQL request also provides data for partition columns, then the AMP will only have to access the partition(s) identified in the request sent to it by the PE. If the SQL request only provides Primary Index values and the partitioning columns are outside of the Primary Index (and partitioning information is not included in the SQL request), the AMP will check each of the Partitions for the associated Row Hash. The Parsing Engine (PE) creates a four-part message composed of the Table ID, Partition Information, the Row Hash, and Primary Index value(s). The 48-bit Table ID is located via the Data Dictionary, the 32-bit Row Hash value is generated by the Hashing Algorithm, and the Partition information and Primary Index value(s) come from the SQL request. The Parsing Engine (via the Data Dictionary) knows if a table has a PPI and determines the Partitions to include in the request based on the SQL request. The Message Passing Layer uses a portion of the Row Hash to determine to which AMP to send the request. The Message Passing Layer uses the DSW portion of the Row Hash (first 16 or 20 bits of the Row Hash) to locate a bucket in the Hash Map(s). This bucket identifies to which AMP the PE will send the request. The AMP uses the Table ID, Partition Number(s), and Row Hash to identify and locate the proper data block(s). The AMP then uses the Row Hash and PI value to locate the specific row(s). The PI value is required to distinguish between Hash Synonyms. Each data row will have the Partition Number stored within it. In the general case, there can be up to 65,535 partitions, numbered from one. As rows are inserted into the table, the partitioning expression is evaluated to determine the proper partition placement for that row. The two-byte partition number is embedded in the row, as part of the row identifier, making PPI rows two bytes wider than they would be if the table wasn’t partitioned. Page 17-12 Partitioned Primary Indexes Primary Index Access (PPI) PARSER SQL with primary index values and data, or SQL expressions that include partition related values. Hashing Algorithm Base TableID (48 bits) Part. # Row Hash 1 or more Bucket # PI values and data Message Passing Layer (Hash Maps) AMP 0 AMP 1 ... AMP x ... AMP n - 1 AMP n Data Table P# RH Row ID Data Part # 1 1 1 1 1 1 2 2 2 2 2 3 3 3 Partitioned Primary Indexes Row Hash Row Data Uniq Value x'00000000' x'068117A0' x'068117A0' 0000 0001 0000 0002 1. Within the AMP, rows are ordered first by their partition number. x’FFFFFFFF' x'00000000' x'068117A0' x'FFFFFFFF' x'00000000' Notes: 0000 0001 2. Within each partition, rows are logically stored in row hash and uniqueness value sequence. x'FFFFFFFF' Page 17-13 Why Partition a Table? The decision to define a Partitioned Primary Index (PPI) for a table depends on how its rows are most frequently accessed. PPI tables are designed to optimize range queries while also providing efficient primary index join strategies. For range queries, only rows of the qualified partitions need to be accessed. One of the reasons to define a PPI on a table is to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes. The facing page provides one example using a sales data table that has 5 years of sales history. A PPI is placed on this table which partitions the data into 60 partitions (one for each month of the 5 years). Queries that request a subset of the data (some number of months) only need to access the required partitions instead of the entire table. For example, a query that requests two months of sales data only needs to read 2 partitions of the data from each AMP. This is about 1/30 of the table. Without a PPI or any secondary indexes, this query has to perform a full table scan. Even with a secondary index, a full table scan would probably be done for 1/30 or 3% of the table. The more partitions there are, the greater the potential benefit. Some of the performance opportunities available by using the PPI feature include: Get more efficiency in querying against a subset of large volumes of transactional detail data as well as to manage this data more effectively. – Businesses have recognized the analytic value of detailed transactions and are storing larger and larger volumes of this data. – Increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes. – As the retention volume of detailed transactions increases, the percent of transactions that an “average” query requires for execution decreases. Allow “instantaneous” dropping of “old” data and simple addition of “new” data. – Support a “rolling n periods” methodology for transactional data. The term “partition elimination” refers to an automatic optimization in which the optimizer determines, based on query conditions, that some partitions can't contain qualifying rows, and causes those partitions to be skipped. Partitions that are skipped for a particular query are called excluded partitions. Generally, the greatest benefit of a PPI table is obtained from partition elimination. Page 17-14 Partitioned Primary Indexes Why Partition a Table? • Increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes. – Partition Elimination – the key advantage to partitioning a table is that the optimizer can eliminate partitions for queries. • For example, assume a sales data table has 5 years of sales history. – A PPI is placed on this table which partitions the data into 60 partitions (one for each month of the 5 years). – Assume a query only needs to read 2 months of the data from each AMP. • • Only 1/30 (2 partitions) of the table has to be read. With a NPPI, this query has to perform a full table scan. – A Valued-Ordered NUSI may be used to help performance for this type of query. • However, there is NUSI subtable permanent space and maintenance overhead. • Deleting large volumes of rows in entire partitions can be extremely fast. – ALTER TABLE … DROP RANGE … ; – Disclaimer: Fast deletes assume that the table doesn't have a NO RANGE partition defined and has no secondary indexes, join indexes, or hash indexes. Partitioned Primary Indexes Page 17-15 Advantages/Disadvantages of Partitioning The main advantage of a PPI table is the automatic optimization that occurs for queries that specify a restrictive condition on the partitioning column. For example, a query which examines two months of sales data in a table with two years of sales history can read about one-twelfth of the table, instead of the entire table. The more partitions there are, the greater the potential benefit. Disadvantages of Partitioning The two main potential disadvantages of using a PPI table occur with PI access and direct PI-based joins. The PI access potential disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value. The direct join potential disadvantage occurs when another table with the same PI is joined with an equality condition on every PI column. For two non-PPI tables, the rows of the two tables will be ordered the same, and the join can be performed directly. If one of the tables is partitioned, the rows won’t be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. In both of these situations, the disadvantage is proportional to the number of partitions, with fewer partitions being better than more partitions. With the Aligned Row Format (Linux 64-bit), the two-byte partition number is embedded in the row, as part of the row identifier, plus an additional 2 bytes for a total of 4 additional bytes per data row. With the Packed64 Row Format (Linux 64-bit 13.10 new install), the overhead within in row for a PPI table is only 2 bytes for the partition number. Secondary Indexes referencing PPI tables use the 10-byte row identifier, making those subtable rows 2 bytes wider as well. Join Indexes always use a 10-byte row identifier regardless if the base tables are partitioned or not. When the primary index is unique (but can’t be defined as unique because of the partitioning), a USI or NUSI can be defined on the same columns as the primary index. Access via the secondary index won’t be as fast as non-partitioned access via the primary index, but is fast enough for most applications. Why can't a Primary Index be defined as Unique unless the partitioning expression columns are part of the PI column(s)? It’s because of the difficulty of performing the duplicate PI check for inserts. If there was already a row with that PI, it could be in any partition, so every partition would have to be checked to determine whether the duplicate PI exists. There can be thousands of partitions. An insert-select could take a very long time in such a situation. It’s more efficient to check uniqueness (and it also provides an efficient access path) to define a unique secondary index (USI) on the same columns as the PI in this case. Page 17-16 Partitioned Primary Indexes Advantages/Disadvantages of Partitioning Advantages: • The partition expression definition is the only thing that needs to be done by the DBA or the database designer. No separate partition layout – no disk layout for partitions. – For example, the last row in one partition and the first row in the next partition will usually be in the same data block. – No definition of location in the system for partitions. • Even data distribution and even processing of a logical partition is automatic. – Due to the PI distribution of the rows • No modifications of queries required. Potential disadvantages: • PPI rows are 2 or 8 bytes longer. Table uses more PERM space. – Secondary index subtable rows are also increased in size. • A PI access may be degraded if the partitioning column is not part of the PI. – A query specifying only a PI value must look in each partition for that value. • Joins to non-partitioned tables with the same PI may be degraded. • The PI can’t be defined as unique when the partitioning column is not part of the PI. Partitioned Primary Indexes Page 17-17 PPI Considerations Starting with Teradata V2R6.1, base tables, global temporary tables, and volatile temporary tables can be partitioned. This restriction doesn’t mean that a PPI table can’t have secondary indexes, or can’t be referenced in the definition of a Join Index or Hash Index. It merely means that the PARTITION BY clause is not available on a CREATE JOIN INDEX or CREATE HASH INDEX statement. In Teradata Database V2R6.2, Partitioned Primary Indexes (PPIs) are supported for noncompressed join indexes. In the general case, there can be up to 65,535 partitions, numbered from one. The two-byte partition number is embedded in the data row, as part of the row identifier. Secondary Indexes and Join Indexes referencing PPI tables also use the wider row identifier. Except for the embedded partition number, PPI rows have the same format as non-PPI rows. A data block can contain rows from more than one partition. There are no new control structures needed to implement the partitioning scheme. Access of Tables with a PPI Some of the issues associated with accessing a table that has a defined PPI are listed below: If the SELECT statement does not provide values for any of the partitioning columns, then all of the partitions may be probed to find row(s) with the hash value. If the SELECT statement provides values for some of the partitioning columns, then partition elimination may reduce the number of the partitions that will be probed to find row(s) with the hash value. A common situation is with SQL specifying a range of values for partitioning columns. This allows some partitions to be excluded. If the SELECT statement provides values for all of the partitioning columns, then partition elimination will cause a single partition to be probed to find row(s) with the hash value. In summary, a NUPI access of a PPI table will take longer when a query specifies the PI column values, but doesn’t include the partitioning column(s). In this situation, each partition must be probed for the appropriate PI value. In the worst case, the number of disk reads could increase by a factor equal to the number of partitions. While probing a partition is a fast operation, a table with thousands of partitions might not provide acceptable performance for PI accesses for some applications. Page 17-18 Partitioned Primary Indexes PPI Considerations PPI considerations include … • Base tables are partitioned, secondary indexes are not. • However, a PPI table can have secondary indexes which reference rows in a PPI table via a RowID in the SI subtable. – Global and Volatile Temporary Tables can also be partitioned. – Non-Compressed Join Indexes can also be partitioned. • A join or hash index can also reference rows in a PPI table. A table has a max of 65,535 (or 9.223 Quintillion) partitions. • Partitioning columns do not have to be columns in the primary index. • There are numerous options for partitioning. As rows are inserted into the table, the partitioning expression is evaluated to determine the proper partition placement for that row. Partitioned Primary Indexes Page 17-19 How to Define a PPI Primary indexes can be partitioned or non-partitioned. A primary index is defined as part of the CREATE TABLE statement. The PRIMARY INDEX definition has a new option to create partitioned primary indexes. PARTITION BY A partitioned primary index (PPI) permits rows to be assigned to user-defined data partitions on the AMPs, enabling enhanced performance for range queries that are predicated on partitioning columns(s) values. The is evaluated and Teradata determines the appropriate partition number or assignment. The is a general expression, allowing wide flexibility in tailoring the partitioning scheme to the unique characteristics of the table. Two functions, CASE_N and RANGE_N, are provided to simplify the creation of common partitioning schemes. You can write any valid SQL expression as a partitioning expression with a few exceptions. The reference manual has details on SQL expressions that are not permitted in the . Limitations on PARTITION BY option include: Partitioning expression must be a scalar expression that is INTEGER or can be cast to INTEGER. Multiple columns from the table may be specified in the expression – These are called the partitioning columns. Before Teradata 13.10, expression must not require character/graphic comparison in order to be evaluated. – Expression must not contain aggregate/ordered-analytic/statistical functions, DATE/, TIME, ACCOUNT, RANDOM, HASH, etc. functions. PARTITION BY clause not allowed for global temporary tables, volatile tables, join indexes, hash indexes, and secondary indexes in the first release of PPI. UNIQUE only allowed if all partitioning columns are included in the PI. Partitioning expression limited to approximately 8100 characters. – Stored as an implicit check constraint in DBC.TableConstraints One or more columns can make up the partitioning expression, although it is anticipated that for most tables one column will be specified. The partitioning column(s) can be part of the primary index, but are not required to be. The result of the partitioning expression must be a scalar value that is INTEGER or can be cast to INTEGER. Most deterministic functions can be used within the expression. The expression must not require character or graphic comparisons, although character or graphic columns can be used in some circumstances. Page 17-20 Partitioned Primary Indexes How to Define a PPI The PRIMARY INDEX definition portion of a CREATE TABLE statement has a optional PARTITION BY option. CREATE TABLE … [UNIQUE] PRIMARY INDEX (col1, col2, …) PARTITION BY Options for the include: • Range partitioning • Conditional partitioning, modulo partitioning, and general expression partitioning. • Partitioning columns do not have to be columns in the primary index. If they aren't, then the primary index cannot be unique. Column(s) included in the partitioning expression are called the “partitioning column(s)”. • Two functions, CASE_N and RANGE_N, are provided to simplify the creation of common partitioning schemes. Partitioned Primary Indexes Page 17-21 Partitioning with CASE_N and RANGE_N For many tables, there is no suitable column that lends itself to direct usage as a partitioning column. For these situations, the CASE_N and RANGE_N functions can be used to concisely define partitioning expressions. When CASE_N or RANGE_N is used, two partitions are reserved for specific uses, leaving a maximum of 65,533 user-defined partitions. Note that the table still has a total of 65,535 available partitions. The PARTITION BY phrase requires a partitioning expression that determines the partition assignment of a row. You can use the CASE_N function to construct a partitioning expression such that a row with any value or NULL for the partitioning column is assigned to a partition. The CASE_N function is patterned after the SQL CASE expression. It evaluates a list of conditions and returns the position of the first condition that evaluates to TRUE, provided that no prior condition in the list evaluates to UNKNOWN. The returned value will map directly into a partition number. Another option is to use the RANGE_N function to construct a partitioning expression with a list of ranges such that a row with any value or NULL for the partitioning column is assigned to a partition. If CASE_N or RANGE_N is used in a partitioning expression in a CREATE TABLE or ALTER TABLE statement, it: Must not involve character or graphic comparisons Can specify a maximum of 65,533 user-defined partitions. The table can have a total of 65,535 partitions including the NO CASE (NO RANGE) and UNKNOWN partitions. Page 17-22 Partitioned Primary Indexes Partitioning with CASE_N and RANGE_N The may use one of the following functions to help define partitions. • CASE_N • RANGE_N Use of CASE_N results in the following: • Evaluates a list of conditions and returns the position of the first condition that evaluates to TRUE. • Result is the data row being placed into a partition associated with that condition. • Note: Patterned after SQL CASE expression. Use of RANGE_N results in the following: • The expression is evaluated and is mapped into one of a list of specified ranges. • Ranges are listed in increasing order and must not overlap with each other. • Result is the data row being placed into a partition associated with that range. NO CASE, NO RANGE, and UNKNOWN options are also available. Partitioned Primary Indexes Page 17-23 Partitioning with RANGE_N – Example 1 One of most common partitioning expression is to use RANGE_N partitioning to partition the table based on a group of dates (e.g., month partitions). A range is defined by a starting boundary and an optional ending boundary. If an ending boundary is not specified, the range is defined by its starting boundary, inclusively, up to but not including the starting boundary of the next range. The list of ranges must specify ranges in increasing order, where the ending boundary of a range is less than the starting boundary of the next range. RANGE_N Limitations include: Multiple test values are not allowed in a RANGE_N function. Test value in RANGE_N function must be INTEGER, BYTEINT, SMALLINT, or DATE. Range value and range size in a RANGE_N function must be constant. Ascending ranges only and ranges must not overlap with other. For example, the following CREATE TABLE statement can be used to establish the monthly partitioning. This example does not have the NO RANGE partition defined. CREATE SET TABLE Claim (claim_id INTEGER NOT NULL ,cust_id INTEGER NOT NULL ,claim_date DATE NOT NULL : PRIMARY INDEX (claim_id) PARTITION BY RANGE_N (claim_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' MONTH); To maintain uniqueness on the claim_id, you can include a USI on claim_id by including the following option. UNIQUE INDEX (claim_id) If the claim_date column for an attempted INSERT or UPDATE has a date outside of the partitioning range or NULL, then an error will be returned and the row won’t be inserted or updated. Notes: UPI not allowed because partitioning column is not included in the PI. Unique Secondary Index is allowed on PI to enforce uniqueness. The facing page contains examples of inserting data rows into a table partitioned by month and how the date is evaluated into the appropriate partition. Page 17-24 Partitioned Primary Indexes Partitioning with RANGE_N – Example 1 For example, partition the Claim table by "Claim Date". CREATE TABLE Claim ( claim_id INTEGER NOT NULL ,cust_id INTEGER NOT NULL ,claim_date DATE NOT NULL …) PRIMARY INDEX (claim_id) PARTITION BY RANGE_N (claim_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' MONTH, NO RANGE); The following INSERTs place new rows into the Claim table. The date is evaluated and the rows are placed into the appropriate partitions. INSERT INTO Claim VALUES INSERT INTO Claim VALUES INSERT INTO Claim VALUES INSERT INTO Claim VALUES (100039,1009, '2003-01-13', …); placed in partition #1 (260221,1020, '2012-01-07', …); placed in partition #109 (350221,1020, '2013-01-01', …); placed in no range partition (#121) (100039, 1009, NULL, …); Error 3811 – NOT NULL violation If the table did not have the NO RANGE partition defined, then the following error occurs: INSERT INTO Claim VALUES (100039, 1009, '2013-01-01', …); (5728 – Partitioning violation) Note: claim_id must be defined as a NUPI because claim_date is not part of PI. Partitioned Primary Indexes Page 17-25 Access using Partitioned Data – Example 1 (cont.) The EXPLAIN text for these queries is shown below. EXPLAIN SELECT FROM WHERE BETWEEN * Claim_PPI claim_date DATE '2012-01-01' AND DATE '2012-01-31'; 1) First, we lock a distinct DS."pseudo table" for read on a RowHash to prevent global deadlock for DS.Claim_PPI. 2) Next, we lock DS.Claim_PPI for read. 3) We do an all-AMPs RETRIEVE step from a single partition of DS.Claim_PPI with a condition of ("(DS.Claim_PPI.claim_date <= DATE '2012-01-31') AND (DS.Claim_PPI.claim_date >= DATE '2012-01-01')") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 21,100 rows (2,869,600 bytes). The estimated time for this step is 0.44 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.44 seconds. The table named Claim_NPPI is similar to Claim_PPI except it does not have a Partitioned Primary Index, but does have “claim_id” as a UPI. EXPLAIN SELECT FROM WHERE BETWEEN * Claim_NPPI claim_date DATE '2011-01-01' AND DATE '2011-01-31'; 1) First, we lock a distinct DS."pseudo table" for read on a RowHash to prevent global deadlock for DS.Claim_NPPI. 2) Next, we lock DS.Claim_NPPI for read. 3) We do an all-AMPs RETRIEVE step from DS.Claim_NPPI by way of an all-rows scan with a condition of ("(DS.Claim_NPPI.claim_date <= DATE '2012-01-31') AND (DS.Claim_NPPI.claim_date >= DATE '2012-01-01')") into Spool 1 (group_amps), which is built locally on the AMPs. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 is estimated with high confidence to be 21,100 rows (2,827,400 bytes). The estimated time for this step is 49.10 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 49.10 seconds. Note: Statistics were collected on the claim_id, cust_id, and claim_date of both tables. The Claim table has 1,440,000 rows. Page 17-26 Partitioned Primary Indexes Access using Partitioned Data – Example 1 AMP AMP ... AMP Part 1 – Jan, 03 Part 1 – Jan, 03 Part 1 – Jan, 03 Part 2 Part 2 Part 2 . . . P# 109 Part n . . . P# 109 . . . QUERY – PPI … P# 109 Part n SELECT * FROM Claim_PPI WHERE claim_date BETWEEN DATE '2012-01-01' AND DATE '2012-01-31' ; PLAN – PPI ALL-AMPs – Single Partition Scan EXPLAIN estimated cost – 0.44 sec. Part n AMP AMP AMP QUERY – NPPI SELECT * FROM Claim_NPPI WHERE claim_date BETWEEN DATE '2012-01-01' AND DATE '2012-01-31' ; ... ... PLAN – NPPI ALL-AMPs – Full Table Scan EXPLAIN estimated cost – 49.10 sec. Partitioned Primary Indexes Page 17-27 Access Using Primary Index – Example 1 (cont.) The EXPLAIN text for these queries is shown below. EXPLAIN SELECT FROM WHERE * Claim_PPI claim_id = 260221; 1) First, we do a single-AMP RETRIEVE step from all partitions of DS.Claim_PPI by way of the primary index "DS.Claim_PPI.claim_id = 260221" with a residual condition of ("DS.Claim_PPI.claim_id = 260221") into Spool 1 (one-amp), which is built locally on that AMP. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 (136 bytes) is estimated with high confidence to be 1 row. The estimated time for this step is 0.09 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.09 seconds. The table named Claim_NPPI is similar to Claim_PPI except it does not have a Partitioned Primary Index, but does have “claim_id” as a UPI. EXPLAIN SELECT FROM WHERE * Claim_NPPI claim_id = 260221; 1) First, we do a single-AMP RETRIEVE step from DS.Claim_NPPI by way of the unique primary index "DS.Claim_NPPI.claim_id = 260221" with no residual conditions. The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. Page 17-28 Partitioned Primary Indexes Access Using Primary Index – Example 1 (cont.) AMP AMP AMP Part 1 – Jan, 03 Part 1 – Jan, 03 Part 1 – Jan, 03 Part 2 Part 2 Part 2 . . . SELECT FROM WHERE . . . . . . Part 109 Part 109 Part 109 Part n Part n Part n ... QUERY – PPI … SELECT * FROM Claim_NPPI WHERE claim_id = 260221; ... PLAN – PPI One AMP – All Partitions are probed EXPLAIN estimated cost – 0.09 sec. AMP QUERY – NPPI * Claim_PPI claim_id = 260221; AMP AMP Only one block has to be read to locate the row. ... PLAN – NPPI One AMP – UPI Access EXPLAIN estimated cost – 0.00 sec. Partitioned Primary Indexes Page 17-29 Place a USI on NUPI – Example 1 (cont.) If the partitioning columns are not part of the Primary Index, the Primary Index cannot be unique (e.g., claim_date). To maintain uniqueness on the Primary Index, you can create a USI on the PI (e.g., Claim ID or claim_id). Reasons for this may include: USI access to specific rows may be faster than scanning multiple partitions on a single AMP. Establish the USI as a referenced parent in Referential Integrity. CREATE UNIQUE INDEX (claim_id) ON Claim_PPI; EXPLAIN SELECT FROM WHERE * Claim_PPI claim_id = 260221; 1) First, we do a two-AMP RETRIEVE step from DS.Claim_PPI by way of unique index # 4 "DS.Claim_PPI.claim_id = 260221" with no residual conditions. The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. As an alternative, the SELECT can include the Primary Index values and the partitioning information. This allows the PE to build a request that has the AMP scan a specific partition. However, in this example, the user may not know the claim date in order to include it in the query. EXPLAIN SELECT * FROM Claim_PPI WHERE claim_id = 260221 AND claim_date = DATE '2012-01-11'; 1) First, we do a single-AMP RETRIEVE step from DS.Claim_PPI by way of the primary index "DS.Claim_PPI.claim_id = 260221, DS.Claim_PPI.claim_date = DATE '201201-11'" with a residual condition of ("(DS.Claim_PPI.claim_date = DATE '2012-01-11') AND (DS.Claim_PPI.claim_id = 260221)") into Spool 1 (one-amp), which is built locally on that AMP. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 (136 bytes) is estimated with high confidence to be 1 row. The estimated time for this step is 0.00 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.00 seconds. Page 17-30 Partitioned Primary Indexes Place a USI on NUPI – Example 1 (cont.) Notes: • If the partitioning column(s) are not part of the Primary Index, the Primary Index cannot be unique (e.g., Claim Date is not part of the PI). • To maintain uniqueness on the Primary Index, you can create a USI on the PI (e.g., Claim ID). This is a two-AMP operation. AMP AMP ... USI Subtable USI Subtable AMP USI Subtable ... USI subtable row specifies part #, row hash, & uniq. value of data row. Part 1 Part 1 Part 2 ... Part 2 . . . . . . Part 109 Part 109 Part 109 Part n Part n Part n Partitioned Primary Indexes SELECT * FROM Claim_PPI WHERE claim_id = 260221; Part 1 Part 2 . . . CREATE UNIQUE INDEX (claim_id) ON Claim_PPI ; ... USI Considerations: • Eliminate partition probing • Row-hash locks • 2-AMP operation • Can only be used if values in PI column(s) are unique • Will maintain uniqueness • USI on NUPI only supported on PPI tables Page 17-31 Place a NUSI on NUPI – Example 1 (cont.) If the partitioning columns are not part of the Primary Index, the Primary Index cannot be unique (e.g., Claim ID). You can use a NUSI on the same columns that make up the PI and actually get a single-AMP access operation. This feature only applies to a NUSI created on the same columns as a PI on PPI table. Additionally, instead of table level locks (typical NUSI), row hash locks will be used. Reasons to choose a NUSI for your PI may include: The primary index is non-unique (can’t use a USI) and you need faster access than scanning or probing multiple partitions on a single AMP. MultiLoad can be used to load a table with a NUSI, not a USI. The access time for a USI and NUSI will be similar (each will access a subtable block) – however, the USI is a 2-AMP operation and requires BYNET message passing. The amount of space for a USI and NUSI subtable in this case will be similar. A typical NUSI with duplicate values will have multiple row ids (keys) in a subtable row and will save space per subtable row. However, a NUSI used as an index for columns with unique values will use approximately the same amount of subtable space as a USI. This is because each NUSI subtable row only contains 1 row id. CREATE INDEX (claim_id) ON Claim_PPI; EXPLAIN SELECT * FROM Claim_PPI WHERE claim_id = 260221; 1) First, we do a single-AMP RETRIEVE step from DS.Claim_PPI by way of index # 4 "DS.Claim_PPI.claim_id = 260221" with no residual conditions into Spool 1 (group_amps), which is built locally on that AMP. The input table will not be cached in memory, but it is eligible for synchronized scanning. The size of Spool 1 (136 bytes) is estimated with high confidence to be 1 row. The estimated time for this step is 0.00 seconds. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.00 seconds. Page 17-32 Partitioned Primary Indexes Place a NUSI on NUPI – Example 1 (cont.) Notes: • You can optionally create a NUSI on the same columns as the Primary Index (e.g., Claim ID). The PI may be unique or not. • Optimizer generates a plan for a single-AMP NUSI access with row-hash locking (instead of table-level locking). AMP AMP ... NUSI Subtable NUSI Subtable AMP NUSI Subtable ... NUSI subtable row specifies part#, row hash, & uniq. value of data row. Part 1 Part 1 Part 2 ... Part 2 . . . . . . Part 109 Part 109 Part 109 Part n Part n Part n Partitioned Primary Indexes SELECT * FROM Claim_PPI WHERE claim_id = 260221; Part 1 Part 2 . . . CREATE INDEX (claim_id) ON Claim_PPI; ... NUSI Considerations: • Eliminate partition probing • Row-hash locks • 1-AMP operation • Can be used with unique or non-unique PI columns • Must be equality condition • NUSI Single-AMP operation only supported on PPI tables • Use MultiLoad to load table Page 17-33 Partitioning with RANGE_N – Example 2 This example illustrates that a table can be partitioned with different size intervals. The current Sales data and Sales History data are placed in the same table. It typically is not practical to create a partitioning expression as shown in example #2, but the example is included to show the flexibility that you have with the partitioning expression. For example, you may decide to partition the Sales History by month and the current sales data by day. You may want to do this if users frequently access the Sales History data with range constraints, resulting in full table scans. It may be that users access the current year data frequently looking at data for a specific day. The example on the facing page partitions the years 2002 to 2011 by month and the year 2011 by day. One option may be to partition by week as follows: PARTITION BY RANGE_N (sales_date BETWEEN DATE '2003-01-01' AND DATE '2003-12-31' EACH INTERVAL '7' DAY, DATE '2004-01-01' AND DATE '2004-12-31' EACH INTERVAL '7' DAY, : : DATE '2012-01-01' AND DATE '2012-12-31' EACH INTERVAL '7' DAY); One may think that a simple partitioning scheme to partition by week would be as follows: PARTITION BY RANGE_N (sales_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '7' DAY); This is a simpler PARTITION expression to initially code, but may require more work or thought later. There is a minor drawback to partitioning by weeks because a 7-day partition usually spans one year into the next. Assume that a year from now, you wish to ALTER this table and DROP the partitions for the year 2003. The ALTER TABLE DROP RANGE option has to specify a range of dates that actually represent a complete partition or partitions in the table. A complete partition ends on 2003-12-19, not 2003-12-31. The ALTER TABLE command will be described later in this module. If daily partitions are desired for all of the years, the following partitioning expression can be used to create a partitioned table with daily partitions. PARTITION BY RANGE_N ( sales_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' DAY); Performance Note: Daily partitions for ten years creates 3653 partitions (10 x 365 plus three leap days) and may not useful in many situations. Try to avoid daily partitions over a long period of time. Page 17-34 Partitioned Primary Indexes Partitioning with RANGE_N – Example 2 Notes: • This example places current and history sales data into one table. • Current year data is partitioned on a more granular basis (daily) while historical sales data is placed into monthly partitions. • Partitions of varying intervals can be created on the same PPI for a table. CREATE TABLE Sales_and_SalesHistory ( store_id INTEGER NOT NULL, item_id INTEGER NOT NULL, sales_date DATE FORMAT 'YYYY-MM-DD', total_revenue DECIMAL(9,2), total_sold INTEGER, note VARCHAR(256)) PRIMARY INDEX (store_id, item_id) PARTITION BY RANGE_N ( sales_date BETWEEN DATE '2003-01-01' AND DATE '2011-12-31' EACH INTERVAL '1' MONTH, DATE '2012-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' DAY); To partition by week, the following partitioning can be used. PARTITION BY RANGE_N (sales_date BETWEEN DATE '2003-01-01' AND DATE '2003-12-31' EACH INTERVAL '7' DAY, DATE '2004-01-01' AND DATE '2004-12-31' EACH INTERVAL '7' DAY, : : Partitioned Primary Indexes Page 17-35 Partitioning – Example 3 This example partitions by Store Id (store number). Prior to Teradata 14.0, a table has a maximum limit of 65,535 partitions. Therefore, the partitioning expression value from Store Id or an expression involving Store Id must be between 1 and 65,535. If a company had a small number of stores, you could use the RANGE_N expression to limit the number of possible partitions. The alternative partitioning (that is shown on facing page) expression allows for ten partitions instead of 65,535. The optimizer may be able to more accurately cost join plans when the maximum number of partitions is known and small, making this a better choice than using the column directly. Assume that a company has 1000 stores, and the store numbers (store_id) are from 100001 to 101001. To utilize 1000 partitions, the following partitioning expression could be defined. ... PRIMARY INDEX (store_id, item_id, sales_date) PARTITION BY store_id – 100000; If a company has a small number of stores and a small number of products, another option may be to partition by a combination of Store Id and Item Id. Assume the following: Store numbers – 100001 to 100065 - less than 65 stores Item numbers – 5000 to 5999 - less than 1000 item ids Basically, the table has three-digit item_id codes and less than 65 stores. This table could be partitioned as follows: ... PRIMARY INDEX (store_id, item_id, sales_date) PARTITION BY ((store_id – 100000) * 1000 + (item_id – 5000)); Assume that the store_id is 100009 and the item_id is 5025. This row would be placed in partition # 9025. If many queries specify both a Store Id and an Item Id, this might be a useful partitioning scheme. Even if it wouldn’t be useful, it demonstrates that the physical designers and/or database administrators have wide latitude in defining generalized partitioning schemes to meet the needs of individual tables. Page 17-36 Partitioned Primary Indexes Partitioning – Example 3 Notes: • The simplest partitioning expression uses one column from the row without modification. Before Teradata 14.0, the column values must be between 1 and 65,535. • Assume the store_id is a value between 100001 and 101001. Therefore, a simple calculation can be performed. • This example will partition by data by store_id and effectively utilize 1000 partitions. CREATE TABLE Store_Sales ( store_id INTEGER NOT NULL, item_id INTEGER NOT NULL, sales_date DATE FORMAT 'YYYY-MM-DD', total_revenue DECIMAL(9,2), total_sold INTEGER, note VARCHAR(256)) UNIQUE PRIMARY INDEX (store_id, item_id, sales_date) PARTITION BY store_id - 100000; Alternative Definition: • Assume the customer wishes to group these 1000 stores into 100 partitions. • The RANGE_N expression can be used to identify the number of partitions and group multiple stores into the same partition. PARTITION BY RANGE_N ( (store_id - 100000) BETWEEN 1 AND 1000 EACH 10); Partitioned Primary Indexes Page 17-37 Special Partitions with CASE_N and RANGE_N The keywords, NO CASE (or NO RANGE) [OR UNKNOWN] and UNKNOWN are used to define the specific-use partitions. Even if these options are not specified with the CASE_N (or RANGE_N) expressions, these two specific-use partitions are still reserved in the event the ALTER TABLE command is later used to add these options. If it is necessary to test a CASE_N condition directly as NULL, it needs to be the first condition listed. This following example is correct. NULLs will be placed in partition #1. PARTITION BY CASE_N (col3 IS NULL, col3 < 10, col3 < 100, NO CASE OR UNKNOWN) INSERT INTO PPI_TabA VALUES (1, 'A', NULL, DATE); INSERT INTO PPI_TabA VALUES (2, 'B', 5, DATE); INSERT INTO PPI_TabA VALUES (3, 'C', 50, DATE); INSERT INTO PPI_TabA VALUES (4, 'D', 500, DATE); INSERT INTO PPI_TabA VALUES (5, 'E', NULL, DATE); SELECT PARTITION AS "Part #", COUNT(*) FROM PPI_TabA GROUP BY 1 ORDER BY 1; Part # 1 2 3 4 Count(*) 2 1 1 1 Although you can code an example as follows, it should not be coded this way and will provide inconsistent results. NULLs will be placed in partition #4. PARTITION BY CASE_N (col3 < 10, col3 IS NULL, col3 < 100, NO CASE OR UNKNOWN) SELECT PARTITION AS "Part #", COUNT(*) FROM PPI_TabA GROUP BY 1 ORDER BY 1; Part # 1 3 4 Page 17-38 Count(*) 1 1 3 Partitioned Primary Indexes Special Partitions with CASE_N and RANGE_N The CASE_N and RANGE_N can place rows into specific-use partitions when ... • the expression doesn’t meet any of the CASE and RANGE expressions. • the expression evaluates to UNKNOWN. • two partition numbers are reserved even if the above options are not used. The PPI keywords used to define two specific-use partitions are: • NO CASE (or NO RANGE) [OR UNKNOWN] – If this option is used, then a specific-use partition is used when the expression isn't true for any case (or is out of range). – If OR UNKNOWN is included with the NO CASE (or NO RANGE), then UNKNOWN expressions are also placed in this partition. • UNKNOWN – If this option is specified, a different specific-use partition is used for unknowns. • NO CASE (or NO RANGE), UNKNOWN – If this option is used, then two separate specific-use partitions are used when the expression isn't true for any case (or is out of range) and different special partition is used for NULLs. Partitioned Primary Indexes Page 17-39 Special Partition Examples This example assumes the following CASE_N expression. PARTITION BY CASE_N ( col3 < 10 , col3 < 100 , col3 < 1000 , NO CASE OR UNKNOWN) This statement creates four partitions, conceptually numbered (*Note) from one to four in the order they are defined. The first partition is when col3 is less than 10, the second partition is when col3 is at least 10 but less than 100, and the third partition is when col3 is at least 100 but less than 1,000. The NO CASE OR UNKNOWN partition is for any value which isn't true for any previous CASE_N expression. In this case, it would be when col3 is equal to or greater than 1,000 or when col3 is NULL. This partition is also used for values for which it isn't possible to determine the truth of the previous CASE_N expressions. Usually, this is a case where col3 is NULL or unknown. Internally, UNKNOWN (option by itself) rows are assigned to partition #1. NOCASE (NO RANGE) OR UNKNOWN rows are physically assigned to partition #2. Internally, the first user-defined partition is actually partition #3. The physical implementation in the file system is: col3 < 10 col3 < 100 col3 < 1000 NO CASE or UNKNOWN – partition #1 – partition #2 – partition #3 – partition #4 (internally, rows placed in partition # 3) (internally, rows placed in partition # 4) (internally, rows placed in partition # 5) (internally, rows placed in partition # 2) It is NOT syntactically possible to code a partitioning expression that has both NO CASE OR UNKNOWN, and UNKNOWN in the same expression. UNKNOWN expressions will either be placed in the partition with NO CASE or in a partition of their own. The following SQL is NOT permitted. PARTITION BY CASE_N ( col3 < 10 , : NO CASE OR UNKNOWN, UNKNOWN) Page 17-40 - causes an error Partitioned Primary Indexes Special Partition Examples The following examples illustrate the use of NO CASE and UNKNOWN options. Ex. 1 PARTITION BY CASE_N ( col3 < 10 , col3 < 100 , col3 < 1000 , NO CASE OR UNKNOWN) If col3 = 5, If col3 = 50, If col3 = 500, If col3 = 5000, If col3 = NULL, row is assigned to Partition #1. row is assigned to Partition #2. row is assigned to Partition #3. row is assigned to Partition #4. row is assigned to Partition #4. In summary, NO CASE and UNKNOWN rows are placed into the same partition. Ex. 2 PARTITION BY CASE_N ( col3 < 10 , col3 < 100 , col3 < 1000 , NO CASE, UNKNOWN) If col3 = 5, row is placed in Partition #1. If col3 = 50, row is placed in Partition #2. If col3 = 500, row is placed in Partition #3. If col3 = 5000, row is placed in Partition #4. If col3 = NULL, row is placed in Partition #5. In summary, NO CASE and UNKNOWN rows are placed into separate partitions. Note: RANGE_N works in a similar manner. Partitioned Primary Indexes Page 17-41 Partitioning with CASE_N – Example 4 This example illustrates the capability of partitioning based upon conditions (CASE_N). For example, assume a table has a total revenue column, defined as decimal. The table could be partitioned on that column, so that low revenue products are separated from high revenue products. The partitioning expression could be written as shown on the facing page. In this example, 8 partitions are defined for total revenue values up to 100,000. Two additional partitions are defined – one for revenues greater than 100,000 and another for unknown revenues (e.g., NULL). Teradata 13.10 Note: Teradata 13.10 allows CURRENT_DATE and/or CURRENT_TIMESTAMP with partitioning expressions. However, it is recommended to NOT use these in a CASE expression for a partitioned primary index (PPI). Why? In this case, all rows are scanned during reconciliation. Additional examples: The following examples illustrate the use of the NO CASE option by itself or the UNKNOWN option by itself. Ex.1 PARTITION BY CASE_N ( col3 < 10 , col3 < 100 , col3 < 1000 , NO CASE) If col3 = 5, row is assigned to Partition #1. If col3 = 50, row is assigned to Partition #2. If col3 = 500, row is assigned to Partition #3. If col3 = 5000, row is assigned to Partition #4. If col3 = NULL, Error 5728 5728: Partitioning violation for table DBname.Tablename. Ex. 2 PARTITION BY CASE_N ( col3 < 10 , col3 < 100 , col3 < 1000 , UNKNOWN) If col3 = 5, If col3 = 50, If col3 = 500, If col3 = 5000, If col3 = NULL, row is assigned to Partition #1. row is assigned to Partition #2. row is assigned to Partition #3. Error 5728 row is assigned to Partition #4. 5728: Partitioning violation for table DBname.Tablename. Page 17-42 Partitioned Primary Indexes Partitioning with CASE_N – Example 4 Notes: • Partition the data based on total revenue for the products. • The NO CASE and UNKNOWN options allow for total_revenue >=100,000 or “unknown revenue”. • A UPI is NOT allowed because the partitioning columns are NOT part of the PI. CREATE TABLE Sales_Revenue ( store_id INTEGER NOT NULL, item_id INTEGER NOT NULL, sales_date DATE FORMAT 'YYYY-MM-DD', total_revenue DECIMAL(9,2), total_sold INTEGER, note VARCHAR(256)) PRIMARY INDEX (store_id, item_id, sales_date) PARTITION BY CASE_N ( total_revenue < 2000 , total_revenue < 4000 , total_revenue < 6000 , total_revenue < 8000 , total_revenue < 10000 , total_revenue < 20000 , total_revenue < 50000 , total_revenue < 100000 , NO CASE , UNKNOWN ); Partitioned Primary Indexes Page 17-43 SQL Use of PARTITION Key Word The facing page contains an example of using the key word PARTITION to determine the number of rows there are in physical partitions. This example is based on the Sales_Revenue table is defined on the previous page. The following table shows the same result as the facing page, but also identifies the internal partition #’s as allocated. Part # Row Count 1 169690 2 163810 3 68440 4 33490 5 18640 6 27520 7 1760 internally mapped to partition #3 internally mapped to partition #4 internally mapped to partition #5 internally mapped to partition #6 internally mapped to partition #7 internally mapped to partition #8 internally mapped to partition #9 Note that this table does not have any rows with a total_revenue value greater than 50,000 and less than 100,000. Partition #8 was not assigned. Also, there are no rows with a total_revenue >=100,000 or NULL because the NO CASE and UNKNOWN partitions are not used. Assume the following three SQL INSERT commands are executed: INSERT INTO Sales_Revenue VALUES (1003, 5051, CURRENT_DATE, 51000, 45, NULL); INSERT INTO Sales_Revenue VALUES (1003, 5052, CURRENT_DATE, 102000, 113, NULL); INSERT INTO Sales_Revenue VALUES (1003, 5053, CURRENT_DATE, NULL, NULL, NULL); The result of executing the SQL statement again would now be as follows: Part # Row Count 1 169690 2 163810 3 68440 4 33490 5 18640 6 27520 7 1760 8 1 9 1 10 1 Page 17-44 internally mapped to partition #3 internally mapped to partition #4 internally mapped to partition #5 internally mapped to partition #6 internally mapped to partition #7 internally mapped to partition #8 internally mapped to partition #9 internally mapped to partition # 10 internally mapped to partition # 2 (NO CASE) internally mapped to partition # 1 (UNKNOWN) Partitioned Primary Indexes SQL Use of PARTITION Key Word The PARTITION SQL key word can be used to return partition numbers that have rows and a count of rows that are currently located in partitions of a table. SQL: SELECT PARTITION AS "Part #", COUNT(*) AS "Row Count" FROM Sales_Revenue GROUP BY 1 ORDER BY 1; Result: Part # 1 2 3 4 5 6 7 Row Count 169690 163810 68440 33490 18640 27520 1760 total_revenue < total_revenue < total_revenue < total_revenue < total_revenue < total_revenue < total_revenue < 2,000 4,000 6,000 8,000 10,000 20,000 50,000 SQL - insert two rows: INSERT INTO Sales_Revenue VALUES (1003, 5052, CURRENT_DATE, 102000, 113, NULL); INSERT INTO Sales_Revenue VALUES (1003, 5053, CURRENT_DATE, NULL, NULL, NULL); SQL (same as above): SELECT PARTITION AS "Part #", COUNT(*) AS "Row Count" FROM Sales_Revenue GROUP BY 1 ORDER BY 1; Partitioned Primary Indexes Result: Part # 1 2 : 7 9 10 Row Count 169690 163810 : 1760 1 1 total_revenue < total_revenue < : total_revenue < NO CASE UNKNOWN 2,000 4,000 50,000 Page 17-45 SQL Use of CASE_N The facing page contains an example of using the CASE_N expression with SQL. You may wish to use this function to determine/forecast how rows will be mapped to various partitions in a table. The Sales_Revenue table was created as follows: CREATE TABLE Sales_Revenue ( store_id INTEGER NOT NULL, item_id INTEGER NOT NULL, sales_date DATE FORMAT 'YYYY-MM-DD', total_revenue DECIMAL(9,2), total_sold INTEGER, note VARCHAR(256)) PRIMARY INDEX (store_id, item_id, sales_date) PARTITION BY CASE_N ( total_revenue < 2000, total_revenue < 4000, total_revenue < 6000, total_revenue < 8000, total_revenue < 10000, total_revenue < 20000, total_revenue < 50000, total_revenue < 100000, NO CASE, UNKNOWN); The CASE_N expression in the query on the facing page is simply an SQL statement that shows how the rows would be partitioned. SQL Use of RANGE_N An example of using the RANGE_N expression with SQL is: SELECT RANGE_N ( Calendar_Date BETWEEN DATE '2004-11-28' AND DATE '2004-12-31' EACH INTERVAL '7' DAY, DATE '2005-01-01' AND DATE '2005-01-09' EACH INTERVAL '7' DAY) AS "Part #", MIN (Calendar_Date) AS "Minimum Date", MAX (Calendar_Date) AS "Maximum Date" FROM Sys_Calendar.Calendar WHERE Calendar_Date BETWEEN DATE '2004-11-28' AND DATE '2005-01-09' GROUP BY "Part #" ORDER BY "Part #"; Output from this SQL is: Part # 1 2 3 4 5 6 7 Page 17-46 Minimum Date 2004-11-28 2004-12-05 2004-12-12 2004-12-19 2004-12-26 2005-01-01 2005-01-08 Maximum Date 2004-12-04 2004-12-11 2004-12-18 2004-12-25 2004-12-31 2005-01-07 2005-01-09 Partitioned Primary Indexes SQL Use of CASE_N The CASE_N (and RANGE_N) expressions can be used with SQL to forecast the number of rows that will be placed into partitions. This example uses a different partitioning scheme than the table actually has to determine how many rows would be placed into various partitions. SELECT FROM GROUP BY ORDER BY CASE_N ( total_revenue total_revenue total_revenue total_revenue total_revenue total_revenue total_revenue total_revenue NO CASE, UNKNOWN ) count(*) Sales_Revenue 1 1; < 1500 , < 2000 , < 3000 , < 5000 , < 8000 , < 12000 , < 20000 , < 50000 , AS "Case #", AS "Row Count" Result: Case # 1 2 3 4 5 6 7 8 Row Count 81540 88150 97640 103230 64870 31290 14870 1760 Notes: • Currently, in this table, there are no rows with total_revenue >= 50,000 or NULL. • The Case # would become the Partition # if the table was partitioned in this way. Partitioned Primary Indexes Page 17-47 Using ALTER TABLE with PPI Tables The ALTER TABLE statement has been extended in support of PPI. For empty tables, the primary index and partitioning expression may be re-specified. For tables with rows, the partitioning expression may be modified only in ways that don’t require existing rows to be re-evaluated. The permitted changes for populated tables are to drop ranges at the ends or to add ranges at the ends. For example, a common use of this capability would be to drop ranges for the oldest dates, and to prepare additional ranges for future dates, among other things. Limitations with ALTER TABLE: Primary Index of a non-empty table may not be altered Partitioning of a non-empty table is generally limited to altering the “ends”. If a table has Delete triggers, they must be disabled if the WITH DELETE option is specified. If a save table has Insert triggers, they must be disabled if the WITH INSERT option is specified. For empty tables with a PPI, the ALTER TABLE statement can be used to do the following: Remove partitioning for a partitioned table Establish partitions for a table (adds or replaces) Change the columns that comprise the primary index Change a unique primary index to non-unique. Change a non-unique primary index to unique. For empty or non-empty tables, the ALTER TABLE statement can also be used to name an unnamed primary index or drop the name of a named primary index. To name an unnamed primary index or change the existing name of a primary index to something else, specify … MODIFY PRIMARY INDEX index_name; To drop the name of a named index, specify … MODIFY PRIMARY INDEX NOT NAMED; Assume you have a populated data table (and the table is quite large) defined with a “nonunique partitioned primary index” and all of the partitioning columns are part of the PI. You realize that the table should have been defined with a “unique partitioned primary index”, but the table is already loaded with data. Here is a technique to convert this NUPI into a UPI without copying or reloading the data. Page 17-48 CREATE a USI on the columns making up the PI. ALTER the table, effectively changing the NUPI to a UPI, and the software will automatically drop the USI. Partitioned Primary Indexes Using ALTER TABLE with PPI Tables The ALTER TABLE statement has enhancements for a partitioned table to modify the partitioning properties of the primary index for a table. For populated tables, ... • You are permitted to drop and/or add ranges at the “ends” of existing partitions on a range-partitioned table. – ALTER TABLE includes ADD / DROP RANGE options. – You can also add or drop special partitions (NO RANGE or UNKNOWN). – You cannot drop all the ranges. • Possible use – drop ranges for the oldest dates and prepare additional ranges for future dates. • The set of primary index columns cannot be altered for a populated table. Teradata 13.10 Feature • ALTER TABLE has a new option to resolve partitioned table definitions with DATE, CURRENT_DATE, and CURRENT_TIMESTAMP to their current values. – This feature only applies to partitioned tables and join indexes. To use ALTER TABLE for any purpose other than the above situations, the table must be empty. Partitioned Primary Indexes Page 17-49 ALTER TABLE – Example 5 The DROP RANGE option is used to drop a range set from the RANGE_N function on which the partitioning expression for the table is based. You can only drop ranges if the partitioning expression for the table is derived only from a RANGE_N function. You can drop empty partitions without specifying the WITH DELETE or WITH INSERT option. Some of the ALTER TABLE statement options include: DROP RANGE WHERE conditional_expression – a conditional partitioning expression used to drop a range set from the RANGE_N function on which the partitioning expression for the table is based. You can only drop ranges if the partitioning expression for the table is derived only from a RANGE_N function. You must base conditional_partitioning_expression on the system-derived PARTITION column. DROP RANGE BETWEEN … [NO RANGE [OR UNKNOWN]] – used to drop a set of ranges from the RANGE_N function on which the partitioning expression for the table is based. You can also drop NO RANGE OR UNKNOWN and UNKNOWN specifications from the definition for the RANGE_N function. You can only drop ranges if the partitioning expression for the table is derived exclusively from a RANGE_N function. Ranges must be specified in ascending order. ADD RANGE BETWEEN … [NO RANGE [OR UNKNOWN]] – used to add a set of ranges to the RANGE_N function on which the partitioning expression for the table is based. You can also add NO RANGE OR UNKNOWN and UNKNOWN specifications to the definition for the RANGE_N function. You can only add ranges if the partitioning expression for the table is derived exclusively from a RANGE_N function. DROP Does NOT Mean DELETE If a table does not have the NO RANGE partition, then partitions are dropped from the table without using the Transient Journal and the rows are either deleted or are copied (WITH INSERT) into a user-specified table. If a table has a NO RANGE partition, rows are copied from dropped partition into the NO RANGE partition. Page 17-50 Partitioned Primary Indexes ALTER TABLE – Example 5 To drop/add partitions and NOT COPY the old data to another table: ALTER TABLE Sales MODIFY PRIMARY INDEX DROP RANGE BETWEEN DATE '2003-01-01' AND DATE '2003-12-31' EACH INTERVAL '1' MONTH ADD RANGE BETWEEN DATE '2013-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' MONTH WITH DELETE; To drop/add partitions and COPY the old data to another table: ALTER TABLE Sales MODIFY PRIMARY INDEX DROP RANGE BETWEEN DATE '2003-01-01' AND DATE '2003-12-31' EACH INTERVAL '1' MONTH ADD RANGE BETWEEN DATE '2013-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' MONTH WITH INSERT INTO SalesHistory; Notes: • Ranges are dropped and/or added to the "ends". • DROP does NOT necessarily mean DELETE! – If a table has a NO RANGE partition, rows are moved from the dropped partitions into the NO RANGE partition. This can be time consuming. • • The SalesHistory table must exist before using the WITH INSERT option. The Sales table was partitioned as follows: PARTITION BY RANGE_N (sales_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' MONTH ); Partitioned Primary Indexes Page 17-51 ALTER TABLE – Example 5 (cont.) This page contains notes on the internal implementation. The important point is to understand that dropping or adding partitions (to the “ends” of an already partitioned table with data) does not cause changes to the internal partitioning numbers that are currently implemented. The logical partition numbers change, but the internal partition numbers do not. For this reason, dropping or adding partitions does not cause an undue amount of work. The following table shows the same result as the facing page, but also identifies the internal partition #’s as allocated. PARTITION 1 2 : 13 14 : 119 120 Count(*) 10850 10150 : 12400 11200 : 14800 14950 internally mapped to partition #3 internally mapped to partition #4 : internally mapped to partition #15 internally mapped to partition #16 : internally mapped to partition #121 internally mapped to partition #122 In the example on the facing page, 12 partitions were dropped for the year 2003 and 12 partitions were added for the year 2013. The partitions for 2013 don’t appear because they are empty. The following table shows the same result as the facing page, but also identifies the internal partition #’s as allocated after the partitions for the year 2003 were dropped. PARTITION 1 2 : 107 108 12400 11200 : 14800 14950 internally mapped to partition #15 internally mapped to partition #16 : internally mapped to partition #121 internally mapped to partition #122 You can add the NO RANGE and/or UNKNOWN partitions to an already partitioned table. ALTER TABLE Sales MODIFY PRIMARY INDEX ADD RANGE NO RANGE OR UNKNOWN; If this table had NO RANGE partition defined and the 12 partitions were dropped (as in this example), the data rows from the dropped partitions are moved to the NO RANGE partition. To remove the special partitions and delete the data, use the following command: ALTER TABLE Sales MODIFY PRIMARY INDEX DROP RANGE NO RANGE OR UNKNOWN WITH DELETE; Page 17-52 Partitioned Primary Indexes ALTER TABLE – Example 5 (cont.) Partitions may only be dropped or added from/to the “ends” of a populated table. SQL: SELECT PARTITION, COUNT(*) FROM Sales GROUP BY 1 ORDER BY 1; Result: PARTITION COUNT(*) 1 2 : 119 120 10850 10150 : 14800 14950 Part #1 - January 2003 Part #120 - December 2012 ALTER TABLE Sales MODIFY PRIMARY INDEX DROP RANGE BETWEEN DATE '2003-01-01' AND DATE '2003-12-31' EACH INTERVAL '1' MONTH ADD RANGE BETWEEN DATE '2013-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' MONTH WITH DELETE; SQL: SELECT PARTITION, COUNT(*) FROM Sales GROUP BY 1 ORDER BY 1; Partitioned Primary Indexes Result: PARTITION COUNT(*) 1 2 : 107 108 12400 11200 : 14800 14950 Part #1 - January 2004 Part #108 - December 2012 Page 17-53 ALTER TABLE TO CURRENT Staring with Teradata 13.10, you can now specify CURRENT_DATE and CURRENT_TIMESTAMP functions in a partitioned primary index for base tables and join indexes. Also starting with Teradata 13.10, Teradata provides a new option with the ALTER TABLE statement to modify a partitioned table that has been defined with a moving CURRENT_DATE (or DATE) or moving CURRENT_TIMESTAMP. This new option is called ALTER TABLE TO CURRENT. When you specify CURRENT_DATE and CURRENT_TIMESTAMP as part of a partitioning expression for a partitioned table, these functions resolve to the date and timestamp when you define the PPI. To partition on a new CURRENT_DATE or CURRENT_TIMESTAMP value, submit an ALTER TABLE TO CURRENT request. The ALTER TABLE TO CURRENT syntax is shown on the facing page. The WITH DELETE option is used to delete any row whose partition number evaluates to a value outside the valid range of partitions. The WITH INSERT [INTO] save_table option is used to insert any row whose partition number evaluates to a value outside the valid range of partitions into the table specified by save_table. The WITH DELETE or INSERT INTO save_table clause is sometimes referred to as a null partition handler. You cannot specify a null partition handler for a join index. Save_table and the table being altered must be different tables with different names. Page 17-54 Partitioned Primary Indexes ALTER TABLE TO CURRENT This Teradata 13.10 option allows you to periodically resolve the CURRENT_DATE (or DATE) and CURRENT_TIMESTAMP of a partitioned table to their current values. Benefits include: • You do not have to change the partitioning expression to update the value for CURRENT_DATE or CURRENT_TIMESTAMP. • To partition on a new CURRENT_DATE or CURRENT_TIMESTAMP value, simply submit an ALTER TABLE TO CURRENT request. Considerations: • The ALTER TABLE TO CURRENT request causes the CURRENT_DATE and/or CURRENT_TIMESTAMP to effectively repartition the rows in the table. • If RANGE_N specifies CURRENT_DATE or CURRENT_TIMESTAMP in a partitioning expression, you cannot use ALTER TABLE to add or drop ranges for the table. You must use the ALTER TABLE TO CURRENT statement to achieve this function. ALTER TABLE table_name join_index_name Partitioned Primary Indexes TO CURRENT ; WITH DELETE INSERT [INTO] save_table Page 17-55 ALTER TABLE TO CURRENT – Example 6 The ALTER TABLE TO CURRENT option allows you to periodically modify the partitioning. This option resolves the CURRENT_DATE (or DATE) and CURRENT_TIMESTAMP to their current values. The example on the facing page assumes partitioning begins on a year boundary. Using this example, consideration for the two options are: With hard-coded dates in the CREATE TABLE statement, you must compute the new dates and specify them explicitly in the ADD RANGE clause of the request. This requires manual intervention every year you submit the request. With CURRENT_DATE in the CREATE TABLE statement, you can schedule the ALTER TABLE TO CURRENT request to be submitted annually or simply execute the next year. This request rolls the partition window forward by efficiently dropping and adding partitions. As a result of executing the ALTER TABLE TO CURRENT WITH DELETE, Teradata deletes the rows from the table because they are no longer needed. Considerations: You should evaluate how a DATE, CURRENT_DATE, or CURRENT_TIMESTAMP function will require reconciliation in a partitioning expression before you define such expressions on a table or join index. If you specify multiple ranges using a DATE or CURRENT_DATE function in one of the ranges, and then later reconcile the partitioning the range specified using CURRENT_DATE might overlap one of the existing ranges. If so, reconciliation aborts the request and returns an error to the requestor. If this happens, you must recreate the table with a new partitioning expression based on DATE or CURRENT_DATE. Because of this, you should design a partitioning expression that uses a DATE or CURRENT_DATE function in one of its ranges with care. DATE, CURRENT_DATE, and CURRENT_TIMESTAMP functions in a partitioning expression are most appropriate when the data must be partitioned as one or more Current partitions and one or more History partitions, where the terms Current and History are defined with respect to the resolved DATE, CURRENT_DATE, or CURRENT_TIMESTAMP values in the partitioning expression. This enables you to reconcile a table or join index periodically to move older data from the current partition into one or more history partitions using an ALTER TABLE TO CURRENT request instead of redefining the partitioning using explicit dates that must be determined each time you alter a table using ALTER TABLE requests to ADD or DROP ranges. Page 17-56 Partitioned Primary Indexes ALTER TABLE TO CURRENT – Example 6 This example creates a partitioning expression to maintain the last 8 years of historical data, data for the current year, and data for one future year for a total of 10 years. CREATE TABLE Sales ( store_id INTEGER NOT NULL, item_id INTEGER NOT NULL, sales_date DATE FORMAT 'YYYY-MM-DD', : PRIMARY INDEX (store_id, item_id) PARTITION BY RANGE_N (sales_date BETWEEN DATE '2004-01-01' AND DATE '2013-12-31' EACH INTERVAL '1' MONTH); Assuming the current year is 2012, an equivalent definition using CURRENT_DATE is: PRIMARY INDEX (store_id, item_id) PARTITION BY RANGE_N (sales_date BETWEEN CAST(((EXTRACT(YEAR FROM CURRENT_DATE) - 8 - 1900) * 10000 + 0101) AS DATE) AND CAST(((EXTRACT(YEAR FROM CURRENT_DATE) +1 - 1900) * 10000 + 1231) AS DATE) EACH INTERVAL '1' MONTH); In 2013, execute ALTER TABLE Sales TO CURRENT WITH DELETE; • Teradata deletes the rows from 2004 because they are no longer needed. • To view the date when the table was last resolved, then DBC.IndexConstraintsV provides new columns named "ResolvedCurrent_Date" and "ResolvedCurrent_TimeStamp". Partitioned Primary Indexes Page 17-57 PPI Enhancements The facing page identifies various enhancements with different Teradata releases. Page 17-58 Partitioned Primary Indexes PPI Enhancements Teradata V2R6.0 • Selected Partition Archive, Restore, and Copy • Dynamic partition elimination for merge join • Single-AMP NUSI access when NUSI on same columns as NUPI; • Partition elimination on RowIDs referenced by NUSI Teradata V2R6.1 • PPI for global temporary tables and volatile tables • Collect statistics on system-derived column PARTITION Teradata V2R6.2 • PPI for non-compressed join indexes Teradata 12.0 • Multi-level partitioning Teradata 13.10 • Tables and non-compressed join indexes can now include partitioning on a character column. • PPI tables allow a test value (e.g., RANGE_N) to have a TIMESTAMP(n) data type. • ALTER TABLE tablename TO CURRENT …; Teradata 14.0 • Increased partition limit to 9.223 quintillion • New data types for RANGE_N – BIGINT and TIMESTAMP • ADD option for a partitioning level Partitioned Primary Indexes Page 17-59 Multi-level PPI Concepts The facing page identifies the basic concepts of using a multi-level PPI. Multi-level partitioning allows each partition at a given level to be further partitioned into sub-partitions. Each partition for a level is sub-partitioned the same per a partitioning expression defined for the next lower level. The system hash orders the rows within the lowest partition levels. A multilevel PPI (MLPPI) undertakes efficient searches by using partition elimination at the various levels or combinations of levels. Notes associated with multilevel partitioning: Note that the number of levels of partitioning cannot exceed 15. Each level must define at least two partitions. The number of levels of partitioning may be further restricted by other limits such as the maximum size of the table header, data dictionary entry sizes, etc. The number of partitions in a table cannot exceed 65,535 partitions. The number of partitions in an MLPPI is determined by multiplying the number of partitions at the different levels (d1 * d2 * d3 * …). The specification order of partitioning expressions can be important for multi-level partitioning. The system maps multi-level partitioning expressions into a singlelevel combined partitioning expression. It then maps the resulting combined partition number 1-to-1 to an internal partition number. A usage implication - you can alter only the highest partition level, which by definition is always level 1, to change the number of partitions at that level when the table is populated with rows. Page 17-60 Partitioned Primary Indexes Multi-level PPI Concepts • Allows multiple partitioning expressions instead of only one for a table or a noncompressed join index. • Multilevel partitioning allows each partition at a level to be sub-partitioned. – Each partitioning level is defined independently using a RANGE_N or CASE_N expression. • A multi-level PPI allows efficient searches by using partition elimination at the various levels or combination of levels. • Allows more flexibility in which partitioning expression to use when there are multiple choices for the partitioning expressions. • Teradata 14 allows for a maximum of 9.223 quintillion partitions and 62 levels. • Syntax: PARTITION BY partitioning_expression 14* ( partitioning_expression , partitioning_expression ) 14* – Teradata 13.10 limit. Partitioned Primary Indexes Page 17-61 Multi-level PPI Concepts (cont.) The facing page contains an example showing the benefit of using a multi-level PPI. You can use a multilevel PPI to improve query performance via partition elimination, either at each of the partition levels or by combining all of them. An MLPPI provides multiple access paths to the rows in the base table. As with other indexes, the Optimizer determines if the index is usable for a query and, if usable, whether its use provides the estimated least costly plan for executing the query. The following list describes the various access methods that are available when a multilevel PPI is defined for a table: Page 17-62 If there is an equality constraint on the primary index and there are constraints on the partitioning columns such that access is limited to a single partition at each level, access is as efficient as with an NPPI. This is a single-AMP, single-hash access in a single sub-partition at the lowest level of the partition hierarchy. With constraints defined on the partitioning columns, performance of a primary index access can approach the performance of an NPPI depending on the extent of partition elimination that can be achieved. This is a single-AMP, single-hash access in multiple (but not all) sub-partitions at the lowest level of the partition hierarchy. Access by means of equality constraints on the primary index columns that does not also include all the partitioning columns, and without constraints defined on the partitioning columns, might not be as efficient as access with an NPPI. The efficiency of the access depends on the number of non-empty sub-partitions at the lowest level of the partition hierarchy. This is a single-AMP, single-hash access in all sub-partitions at the lowest level of the partition hierarchy. With constraints on the partitioning columns of a partitioning expression such that access is limited to a subset of, say n percent, of the partitions for that level, the scan of the data is reduced to about n percent of the time required by a full-table scan. This is an all-AMP scan of only the non-eliminated partitions for that level. This allows multiple access paths to a subset of the data: one for each partitioning expression. If constraints are defined on partitioning columns for more than one of the partitioning expressions in the MLPPI definition, partition elimination can lead to even less of the data needing to be scanned. Partitioned Primary Indexes Multi-level PPI Concepts (cont.) Query – Compare District 25 revenue for Week 6 vs. same period last year? Sales for 2 Years Week 6 Sales Only Week 6 Sales for District 25 Only Full File Scan No Partitioning Partitioned Primary Indexes Single Level Partitioning Multi-Level Partitioning Page 17-63 Multi-level Partitioning – Example 7 You create an MLPPI by specifying two or more partitioning expressions, where each expression must be defined using either a RANGE_N function or a CASE_N function exclusively. The system combines the individual partitioning expressions internally into a single partitioning expression that defines how the data is partitioned on an AMP. The first partitioning expression is the highest level partitioning. Within each of those partitions, the second partitioning expression defines how each of the highest-level partitions is sub-partitioned. Within each of those second-level partitions, the third-level partitioning expression defines how each of the second level partitions is sub-partitioned. Within each of these lowest level partitions, rows are ordered by the row hash value of their primary index and their assigned uniqueness value. You define the ordering of the partitioning expressions in your CREATE TABLE SQL text, and that ordering implies the logically ordering by RowID. Because the partitions at each level are distributed among the partitions of the next higher level in the hierarchy, scanning a partition at a certain level requires skipping some internal partitions. Partition expression order does not affect the ability to eliminate partitions, but does affect the efficiency of a partition scan. As a general rule, this should not be a concern if there are many rows, which implies multiple data blocks, in each of the partitions. The facing page contains an example of creating a multi-level PPI. There are two levels of partitioning defined in this example. The first level defines 120 partitions and the second defines 75 partitions. Therefore, the total number of partitions for the combined partitioning expression is the product of 120 * 75, or 9000. Page 17-64 Partitioned Primary Indexes Multi-level Partitioning – Example 7 For example, partition Claim table by "Claim Date" and "State ID". CREATE TABLE Claim ( claim_id INTEGER NOT NULL ,cust_id INTEGER NOT NULL ,claim_date DATE NOT NULL ,state_id BYTEINT NOT NULL ,…) PRIMARY INDEX (claim_id) PARTITION BY ( /* First level of partitioning */ RANGE_N (claim_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' MONTH ), /* Second level of partitioning */ RANGE_N (state_id BETWEEN 1 AND 75 EACH 1) ) UNIQUE INDEX (claim_id); Notes: • For multi-level PPI, the set of partitioning expressions must be enclosed in parentheses. • Each level must define at least two partitions for a multi-level PPI. • The number of defined partitions in this example is (120 * 75) or 9000. Partitioned Primary Indexes Page 17-65 Multi-level Partitioning – Example 7 (cont.) The facing page continues the example of using a multi-level PPI. This example assumes that the query has conditions where only claims for a specific month and for a specific state need to be returned. Teradata only needs to scan the data blocks associated with the specified criteria. Page 17-66 Partitioned Primary Indexes Multi-level PPI Example 7 (cont.) Assume • Eliminating all but one month out of many years of claims history would facilitate scanning less than 2% of the claims history. • Similarly, eliminating all but the California claims out of the many states would facilitate scanning less than 4% of the claims history. Then, combining both of these predicates for partition elimination would facilitate scanning less than 0.08% of the claims history for satisfying the following query. SELECT FROM WHERE AND AND … Claim C, States S C.state_id = S.state_id S.state_name = 'California' C.claim_date BETWEEN DATE '2012-01-01' AND DATE '2012-01-31'; Partitioned Primary Indexes Page 17-67 How is the MLPPI Partition # Calculated? The facing page shows the calculation that is used to determine the partition number for a MLPPI table. Page 17-68 Partitioned Primary Indexes How is the MLPPI Partition # Calculated? Multilevel partitioning is rewritten internally to single-level partitioning to generate a combined partition number as follows: (p1 - 1) * dd1 + (p2 - 1) * dd2 + ... + (pn-1 - 1) * ddn-1 + pn where n is the number of partitioning expressions pi is the value of the partitioning expression for level i di is the number of partitions for level i ddi is the product of di+1 through dn dd = d1* d2 * ... * dn <= 65535 dd is the total number of combined partitions Example: Assume January, 2012 is the 109th first level partition and California is the 6th state code for the second level partition. Also assume that the first level has 120 partitions and the second level has 75 partitions. (109 – 1) * 75 + 6 = 8106 is the logical partition number for claims in California for January of 2012. Partitioned Primary Indexes Page 17-69 Character PPI This Teradata 13.10 feature extends the capabilities and options when defining a PPI for a table or a non-compressed join index. Tables and non-compressed join indexes can now include partitioning on a character column. This feature provides the improved query performance benefits of partitioning when queries have conditions on columns with character (alphanumeric) data. Before Teradata 13.10, customers were limited to creating partitioning on tables that did not involve comparison of character data. Partitioning expressions were limited to numeric or date type data. The Partitioned Primary Index (PPI) feature of Teradata has always allowed a class of queries to access a portion of a large table, instead of the entire table. This capability has simply been extended to include character data. The traditional uses of the Primary Index (PI) for data placement and rapid access of the data when the PI values are specified are still retained. When creating a table or a join index, the PARTITION BY clause (part of PRIMARY INDEX) can now include partitioning on a character column. This allows the comparison of character data. This feature allows a partitioning expression to involve comparison of character data (CHAR, VARCHAR, GRAPHIC, VARGRAPHIC) types. A comparison may involve a predicate (=, >, <, >=, <=, <>, BETWEEN, LIKE) or a string function. The use of a character expression is a PPI table is referred to as CPPI (Character PPI). The most common partitioning expressions utilize RANGE_N or CASE_N expressions. Prior to Teradata 13.10, both the CASE_N and RANGE_N functions did not allow the PPI definition of character data. This limited the useful partitioning that could be done using character columns as a standard ordering (collation) of the character data is not preserved. Both the RANGE_N and CASE_N functions support the definition of character data in Teradata 13.10. The term "character or char" will be used to represent CHAR, VARCHAR, GRAPHIC, or VARGRAPHIC data types. The test value of a RANGE_N function should be a simple column reference, involving no other functions or expressions. For example, if SUBSTR is added, then static partition elimination will not occur. Keep the partitioning expressions as simple as possible. RANGE_N (SUBSTR (state_code, 1, 1) BETWEEN 'AK' and 'CA', … This definition will not allow static partition elimination. Page 17-70 Partitioned Primary Indexes Character PPI Tables and non-compressed join indexes can now include partitioning on a character column. This feature is referred to as CPPI (Character PPI). • Prior to Teradata 13.10, partitioning expressions (RANGE_N and CASE_N) are limited to numeric or date type data. This feature allows a partitioning expression to involve comparison of character data (CHAR, VARCHAR, GRAPHIC, VARGRAPHIC) types. A comparison may involve a predicate (=, >, <, >=, <=, <>, BETWEEN, LIKE) or a string function. Collation and case sensitivity considerations: • The session collation in effect when the character PPI is created determines the ordering of data used to evaluate the partitioning expression. • The ascending order of ranges in a character PPI RANGE_N expression is defined by the session collation in effect when the PPI is created or altered, as well as the case sensitivity of the column or expression in the test value. • The default case sensitivity of character data for the session transaction semantics in effect when the PPI is created will also determine case sensitivity of comparison unless overridden with an explicit CAST to a specific case sensitivity. Partitioned Primary Indexes Page 17-71 Character PPI – Example 8 In this example, the Claim table is first partitoned by claim_date (monthly intervals). Claim_date is then sub-partitioned by state codes. State codes are then sub-partitioned by the first two letters of a city name. The special partitions of NO RANGE and UNKNOWN are defined at the claim_date, state_code, and city levels. Why is the facing page partitioning example defined with intervals of 1 month for claim_date? Teradata 13.10 has a maximum limit of 65,535 partitions in a table and defining 8 years of day partitioning with two levels of sub-partitioning cause this limit to be exceeded. The following queries will benefit from this type of partitioning. SELECT * FROM Claim_MLPPI2 WHERE state_code = 'GA' AND city LIKE 'a%'; SELECT * FROM Claim_MLPPI2 WHERE claim_date = '2012-08-24' AND city LIKE 'a%'; The session mode when these tables were created and when these queries were executed was Teradata mode (BTET). Teradata mode defaults to "not case specific". The session collation in effect when the character PPI is created determines the ordering of data used to evaluate the partitioning expression. The ascending order of ranges in a character PPI RANGE_N expression is defined by the session collation in effect when the PPI is created or altered, as well as the case sensitivity of the column or expression in the test value. The default case sensitivity of character data for the session transaction semantics in effect when the PPI is created will also determine case sensitivity of comparison unless overridden with an explicit CAST to a specific case sensitivity. The default case sensitivity in effect when the character PPI is created will also affect the ordering of character data for the PPI. Default case sensitivity of comparisons involving character constants is influenced by the session mode. String literals have a different default CASESPECIFIC attribute depending on the session mode. – – Page 17-72 Teradata Mode (BTET) is NOT CASESPECIFIC ANSI mode is CASESPECIFIC If any expression in the comparison is case specific, then the comparison is case sensitive. Partitioned Primary Indexes Character PPI – Example 8 In this example, 3 levels of partitioning are defined. CREATE TABLE Claim_MLPPI2 (claim_id INTEGER cust_id INTEGER claim_date DATE city VARCHAR(30) state_code CHAR(2) claim_info VARCHAR (256)) PRIMARY INDEX (claim_id) PARTITION BY ( RANGE_N (claim_date BETWEEN RANGE_N (state_code RANGE_N (city NOT NULL, NOT NULL, NOT NULL, NOT NULL, NOT NULL, DATE '2005-01-01' and DATE '2012-12-31' EACH INTERVAL '1' MONTH, NO RANGE), BETWEEN 'A', 'D', 'I', 'N', 'T' AND 'ZZ', NO RANGE), BETWEEN 'A', 'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S', 'U', 'W' AND 'ZZ', NO RANGE) ) UNIQUE INDEX (claim_id); The following queries will benefit from this type of partitioning. • SELECT * FROM Claim_MLPPI2 WHERE state_code = 'OH'; • SELECT * FROM Claim_MLPPI2 WHERE state_code = 'GA' AND city LIKE 'a%'; • SELECT * FROM Claim_MLPPI2 WHERE claim_date = DATE '2012-08-24' AND city LIKE 'a%'; Partitioned Primary Indexes Page 17-73 Summary The customer (e.g., DBA, Database Designer, etc.) has a flexible and powerful tool to structure tables to allow automatic optimization of frequently used queries. This tool is the Partitioned Primary Index (PPI) feature. This feature allows tables to be partitioned on columns of interest, while retaining the traditional use of the primary index (PI) for data distribution and efficient access when the PI values are specified in the query. The facing page contains a summary of the key customer benefits that can be obtained by using Partitioned Primary Indexes. Whether and how to partition a table is a physical design choice. A well-chosen partitioning scheme may be able to turn many frequently run queries into partial-table scans instead of full-table scans, with much improved performance. However, understand that there are trade-off considerations that must be understood and carefully considered to get the most benefit from the PPI feature. Page 17-74 Partitioned Primary Indexes Summary • Improves the performance of queries that use range constraints on the range partitioning column(s) by allowing for range/partition elimination. – Allows primary index access and range access without a secondary index. • General Recommendations – Collect statistics on system-derived column PARTITION. – Do not define or name a column PARTITION in a PPI table – you won’t be able to reference the system-derived column PARTITION for the table. – If possible, avoid use of NO RANGE, NO RANGE OR UNKNOWN, or UNKNOWN options with RANGE_N partitioning for DATE columns. – Consider only having as many date ranges as needed currently plus some for the future – helps the optimizer cost plans better, especially when partitioning column is not included in the Primary Index. • Note (as with all partitioning/indexing schemes) there are tradeoffs due to performance impacts on table access, joins, maintenance, and other operations. Partitioned Primary Indexes Page 17-75 Module 17: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 17-76 Partitioned Primary Indexes Module 17: Review Questions Partition # Row Hash Uniqueness Value 1. In a PPI table, every row is uniquely identified by its ______ _____ + ______ ______ + _____ ______ . Partition # + _______ Row _______ Hash . 2. The Row Key consists of the ________ ________ FALSE (it only cares about the first part of 4. True or False. For a PPI table, the partition number and the Row Hash are both used by the the row hash) 0 3. In an NPPI table, the partition number defaults to ________ . Message Passing Layer to determine which AMP(s) should receive the request. 5. Which two options apply to the RANGE_N expression in a partitioning expression? ____ ____ a. b. c. d. Ranges can be listed in descending order Allows use of NO RANGE OR UNKNOWN option Partitioning column must be part of the Primary Index Has a maximum of 65,535 partitions with Teradata Release 13.10 6. With a populated table, select 2 actions that are allowed with the ALTER TABLE command. ____ ____ a. b. c. d. Drop all of the ranges Add or drop ranges from the partition “ends” Change the columns that comprise the primary index Add or drop special partitions (NO RANGE, UNKNOWN) 7. Which 2 choices are advantages of partitioning a table? ____ ____ a. b. c. d. Fast delete of rows in partitions Fewer AMPs are involved when accessing data Faster access (than an NPPI table) if the table has a UPI Range queries can be executed without a secondary index Partitioned Primary Indexes Page 17-77 Module 17: Review Questions (cont.) Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 17-78 Partitioned Primary Indexes Module 17: Review Questions (cont.) Given this CREATE TABLE statement, answer the following questions. CREATE TABLE Orders (Order_id INTEGER NOT NULL, Cust_id INTEGER NOT NULL, Order_status CHAR(1), Total_price DECIMAL(9,2) NOT NULL, Order_date DATE FORMAT 'YYYY-MM-DD' NOT NULL, Order_priority SMALLINT, Clerk CHAR(16), Ship_priority SMALLINT, Order_Comment VARCHAR(80) ) PRIMARY INDEX (Order_id) PARTITION BY RANGE_N (Order_date BETWEEN DATE '2003-01-01' AND DATE '2012-12-31' EACH INTERVAL '1' MONTH) UNIQUE INDEX (Order_id); Order_date 8. What is the name of partitioning column? ______________ 1 Month 9. What is the time period for each partition? ______________ 10. Why is there a Unique Secondary Index specified instead of defining Order_id as a UPI? _____ a. This is a coding style choice. b. You cannot have a UPI when using a partitioned primary index. c. You cannot have a UPI if the partitioning column is not part of the primary index. d. This is a mistake. You cannot have a secondary and a primary index on the same column(s). Partitioned Primary Indexes Page 17-79 Lab Exercise 17-1 Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SHOW TABLE table_name; A count of rows in the Orders table is 31,200. A count of rows in the Orders_2012 table is 12,000. Page 17-80 Partitioned Primary Indexes Lab Exercise 17-1 Lab Exercise 17-1 Purpose In this lab, use Teradata SQL Assistant to create tables with primary indexes partitioned in various ways. What you need Populated DS tables and empty tables in your database Tasks 1. Using INSERT/SELECT, populate your Orders and Orders_2012 tables from the DS.Orders and DS.Orders_2012 tables, respectively. Your Orders table will have data for the years 2003 to 2011 and the Orders_2012 table will have data for 2012. Verify the number of rows in your tables. SELECT COUNT(*) FROM Orders; SELECT COUNT(*) FROM Orders_2012; 2. Count = _________ (should be 31,200) Count = _________ (should be 12,000) Use the SHOW TABLE for Orders to help create a new, similar table (same column names and definitions, etc.) named "Orders_PPI" that has a PPI based on the following: Primary Index – orderid Partitioning column – orderdate – From '2003-01-01' through '2012-12-31', partition by month – Include the NO RANGE option (the UNKNOWN option is not needed for orderdate) – Do not create any secondary indexes for this table How many partitions are logically defined for the Orders_PPI table? ______ Partitioned Primary Indexes Page 17-81 Lab Exercise 17-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT FROM COUNT(*) table_name; SELECT FROM GROUP BY ORDER BY PARTITION, COUNT(*) table_name 1 1; SELECT FROM WHERE GROUP BY ORDER BY PARTITION, COUNT(*) table_name orderdate BETWEEN '2012-01-01' AND '2012-12-31' 1 1; SELECT COUNT(DISTINCT(PARTITION)) FROM table_name; Page 17-82 Partitioned Primary Indexes Lab Exercise 17-1 (cont.) 3. INSERT/SELECT all of the rows from your Orders table into the Orders_PPI table. Verify the number of rows in your table. Count = ________ How many partitions would you estimate have data at this time? ________ 4. Use the PARTITION key word to list the partitions and number of rows in various partitions. How many partitions actually have data? ________ How many rows are in each partition for the year 2003? ________ How many rows are in each partition for the year 2011? ________ 5. Use INSERT/SELECT to add the rows from the Orders_2012 table to your Orders_PPI table. Verify the number of rows in your table. Count = ________ Use the PARTITION key word to determine the number of partitions used and the number of rows in various partitions. How many partitions actually have data? ________ How many rows are in each partition for the year 2012? ________ Partitioned Primary Indexes Page 17-83 Lab Exercise 17-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SELECT FROM WHERE COUNT(DISTINCT(PARTITION)) table_name orderdate … ; SELECT FROM MAX (PARTITION) table_name; SELECT FROM GROUP BY ORDER BY PARTITION, COUNT(*) table_name 1 1; SELECT COUNT(DISTINCT(PARTITION)) FROM table_name; The PARTITION key word only returns partition numbers of partitions that contain rows. The following “canned” SQL can be used to return a list of partitions that are not used between the first and last used partitions. SELECT p + 1 AS "The missing partitions are:" FROM (SELECT p1 - p2 AS p, PARTITION AS p1, MDIFF(PARTITION, 1, PARTITION) AS p2 FROM table_name QUALIFY p2 > 1) AS temp; Page 17-84 Partitioned Primary Indexes Lab Exercise 17-1 (cont.) 6. INSERT the following row (using these values) into the Orders_PPI table. (100000, 1000, 'C', 1000, '2000-12-31', 10, 'your name', 5, 20, 'old order') How many partitions are now in Orders_PPI? ____ What is the partition number (highest partition #) of the NO RANGE OR UNKNOWN partition? ____ 7. (Optional) Create a new table named "Orders_PPI_ML" that has a Multi-level PPI based on the following: Primary Index – orderid First Level of Partitioning column – orderdate (use month ranges for all 10 years) Include the NO RANGE option for orderdate Second Level of Partitioning column – location (10 different order locations (1 through 10) Place the NO RANGE and UNKNOWN rows into the same special partition for location Unique secondary index – orderid 8. (Optional) Populate the Orders_PPI_ML table from the Orders and Orders_2012 tables using INSERT/SELECT. Verify the number of rows in Orders_PPI_ML. Count = ________ Partitioned Primary Indexes Page 17-85 Lab Exercise 17-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 VALUES (value1, value2, … ); INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SELECT COUNT(DISTINCT(PARTITION)) FROM table_name; Page 17-86 Partitioned Primary Indexes Lab Exercise 17-1 (cont.) 9. (Optional) For the Orders_PPI_ML table, use the PARTITION key word to answer the following questions. How many partitions actually have data? ________ What is the highest partition number? _________ What is the partition number for orders in January, 2012 and location 1? _____ What is the partition number for orders in February, 2012 and location 1? _____ Is there a difference of 11 partitions between these two months? _____ Why or why not? _________________________________________________________________ 10. (Optional) Before altering the table, verify the number of rows in Orders_PPI. Count = _______ Use the ALTER TABLE command on Orders_PPI to do the following. – – DROP RANGE (with DELETE) for year 2003 ADD RANGE for orders that will be placed in year 2013 with an interval of 1 month Use SHOW TABLE on Orders_PPI to view the altered partitioning. Use the PARTITION key word to list the partitions and the number of rows in various partitions. How many partitions currently have data rows? _______ How many rows now exist in the table? _______ Has the row count changed? ___ If the row count did not change, why not? ____________________________________________ Partitioned Primary Indexes Page 17-87 Notes Page 17-88 Partitioned Primary Indexes Module 18 Teradata Columnar After completing this module, you will be able to: Describe the components that comprise a Row ID in a column partitioned table. Identify two advantages of creating a column partitioned table. Identify two disadvantages of creating a column partitioned table. Identify the recommended way to populate a column partitioned table. Specify how rows are deleted in a column partitioned table. Teradata Proprietary and Confidential Column Partitioning Page 18-1 Notes Page 18-2 Column Partitioning Table of Contents Teradata Columnar ..................................................................................................................... 18-4 Teradata Columnar Benefits ...................................................................................................... 18-6 Columnar Join Indexes........................................................................................................... 18-6 No Primary Index Table DDL ................................................................................................... 18-8 The No Primary Index Table.................................................................................................... 18-10 Column Partition Table DDL (without Auto-Compression) ................................................... 18-12 Characteristics of a Columnar Table .................................................................................... 18-12 Column Partition Container (No Automatic Compression) ..................................................... 18-14 The Column Partition Table (without Auto-Compression) ..................................................... 18-16 CP Table Query #1 (without Auto-Compression) ................................................................... 18-18 CP Table Query #1 (without Auto-Compression) ................................................................... 18-20 Column Partition Table DDL (with Auto-Compression) ........................................................ 18-22 Auto-Compression for CP Tables ............................................................................................ 18-24 Auto-Compression Techniques for CP Tables......................................................................... 18-26 User-Defined Compression Techniques .................................................................................. 18-28 Column Partition Container (Automatic Compression)........................................................... 18-30 The Column Partition Table (with Auto-Compression)........................................................... 18-32 CP Table Query #2 (with Auto-Compression) ........................................................................ 18-34 CP Table with Row Partitioning DDL ..................................................................................... 18-36 Determining the Column Partition Level ............................................................................. 18-36 The Column Partition Table (with Row Partitioning) ............................................................. 18-38 CP Table with Multi-Column Container DDL ........................................................................ 18-40 The CP Table with Multi-Column Container........................................................................... 18-42 CP Table Hybrid Row & Column Store DDL ......................................................................... 18-44 COLUMN Format Considerations ....................................................................................... 18-44 ROW Format Considerations ............................................................................................... 18-44 The CP Table (with Hybrid Row & Column Store) ................................................................ 18-46 Populating a CP Table.............................................................................................................. 18-48 INSERT-SELECT ................................................................................................................ 18-48 Options ................................................................................................................................. 18-48 DELETE Considerations.......................................................................................................... 18-50 The Delete Column Partition ........................................................................................... 18-50 UPDATE Considerations ......................................................................................................... 18-52 USI Access ........................................................................................................................... 18-52 NUSI Access ........................................................................................................................ 18-52 CP Table Restrictions............................................................................................................... 18-54 Summary .................................................................................................................................. 18-56 Module 18: Review Questions ................................................................................................. 18-58 Lab Exercise 18-1 .................................................................................................................... 18-60 Column Partitioning Page 18-3 Teradata Columnar Teradata Column or Column Partitioning (CP) is a new physical database design implementation option (starting with Teradata 14.0) that allows single columns or sets of columns of a NoPI table to be stored in separate partitions. Column partitioning can also be applied to join indexes. Columnar is simply a new approach for organizing the data of a user-defined table or join index on disk. Teradata Columnar offers the ability to partition a table or join index by column. Teradata Columnar can be used alone or in combination with row partitioning in multilevel partitioning definitions. Column partitions may be stored using traditional ‘ROW’ storage or alternatively stored using the new ‘COLUMN’ storage option. In either case, columnar can automatically compress physical rows where appropriate. The key benefit in defining row-partitioned (PPI) tables is when queries access a subset of rows based on constraints on one or more partitioning columns. The major advantage of using column partitioning is to improve the performance of queries that access a subset of the columns from a table, either for predicates (e.g., WHERE clause) or projections (i.e., SELECTed columns). Because sets of one or more columns can be stored in separate column partitions, only the column partitions that contain the columns needed by the query need to be accessed. Just as row-partitioning can eliminate rows that need not be read, column partitioning eliminates columns that are not needed. The advantages of both can be combined, meaning even less data moved and thus reduced I/O. Fewer data blocks need to be read since more data of interest is packed together into fewer data blocks. Columnar makes more sense in CPU-rich environments because CPU cycles are needed to “glue” columns back together into rows, for compression and for different join strategies (mainly hash joins). Page 18-4 Column Partitioning Teradata Columnar • Description – Columnar (or Column Partitioning) is a new physical database design implementation option that allows sets of columns (including just a single column) of a table or join index to be stored in separate partitions. – This is effectively an I/O reduction feature to improve performance for suitable classes of workloads. – This allows the capability for a table or join index to be column (vertically) partitioned, row (horizontally) partitioned or both by using the already existing multilevel partitioning capability. • Considerations – Note that column partitioning is a physical database design choice and may not be suitable for all workloads using that table/join index. – It is especially suitable if both a small number of rows are selected and a few columns are projected. – When individual rows are deleted, they are not physically deleted, but are marked as deleted. Column Partitioning Page 18-5 Teradata Columnar Benefits The facing page lists a number of Teradata Columnar benefits. Columnar Join Indexes A join index can also be created as column-partitioned for either a columnar table or a noncolumnar table. Conversely, a join index can be created as non-columnar for either type of table as well. Sometime within a mixed workload, some queries might perform better if data is not column partitioned and some where it is column partitioned. Or, perhaps some queries perform better with one type of partitioning on a table, whereas other queries do better with another type of partitioning. Join indexes allow creation of alternate physical layouts for the data with the optimizer automatically choosing whether to access the base table and/or one of its join indexes. A column-partitioned join index must be a single-table, non-aggregate, non-compressed, join index with no primary index, and no value-ordering, and must include RowID of the base table. A column-partitioned join index may optionally be row partitioned. It may also be a sparse join index. This module will only describe and include examples of base tables that utilize column partitioning. Page 18-6 Column Partitioning Teradata Columnar Benefits Benefits of using the Teradata Columnar feature include: • Improved query performance Column partitioning can be used to improve query performance via column partition elimination. Column partition elimination reduces the need to access all the data in a row while row partition elimination reduces the need to access all the rows. • Reduced disk space The feature also allows for the possibility of using a new auto-compression capability which allows data to be automatically (as applicable) compressed as physical rows are inserted into a column-partitioned table or join index. • Increased flexibility Provides a new physical database design option to improve performance for suitable classes of workloads. • Reduced I/O Allows fast and efficient access to selected data from column partitions, thus reducing query I/O. • Ease of use Provides simple default syntax for the CREATE TABLE and CREATE JOIN INDEX statements. No change is needed to queries. Column Partitioning Page 18-7 No Primary Index Table DDL The facing page simply illustrates the DDL to create a NoPI table. This example will be as a basis for multiple examples of creating tables with various column partitioning options. Page 18-8 Column Partitioning No Primary Index Table DDL CREATE TABLE Super_Bowl (Winner CHAR(25) ,Loser CHAR(25) ,Game_Date DATE ,Game_Score CHAR(7) ,Attendance INTEGER) NO PRIMARY INDEX; NOT NOT NOT NOT NULL NULL NULL NULL In this module, we will use a example of Super Bowl history information to simply demonstrate column partitioning. Column Partitioning Page 18-9 The No Primary Index Table The No Primary Index table is shown on the facing page. Page 18-10 Column Partitioning The No Primary Index Table Partition: For NOPI tables this number is 0 HB: The lowest number hashbucket on this AMP Row # (Uniqueness ID): The row number of this row P-Bits #: Presence Bits for the nullable columns Partition HB Winner Loser Game_Date Game_Score 0 n Row # P-Bits 1 0 Dallas Cowboys Denver Broncos 01-15-1978 27-10 Attendance (null) 0 n 2 1 Chicago Bears New England Patriots 01-26-1986 46-10 73,818 0 n 3 1 Pittsburgh Steelers Arizona Cardinals 02-01-2009 27-23 70,774 0 n 4 1 Pittsburgh Steelers Minnesota Vikings 01-12-1975 16-6 80,997 0 n 5 1 Pittsburgh Steelers Seattle Seahawks 02-05-2006 21-10 68,206 0 n 6 1 New York Jets Baltimore Colts 01-12-1969 16-7 75,389 0 n 7 0 Dallas Cowboys Buffalo Bills 01-31-1993 52-17 (null) 0 n 8 1 Oakland Raiders Philadelphia Eagles 01-25-1981 27-10 76,135 0 n 9 1 San Francisco 49ers Cincinnati Bengals 01-24-1982 26-21 81,270 Collectively known as the ROWID Column Partitioning Page 18-11 Column Partition Table DDL (without AutoCompression) With column partitioning, each column or specified group of columns in the table can become a partition containing the column partition values of that column partition. This is the simplest partitioning approach since there is no need to define partitioning expressions, as seen in the example on the facing page. The clause PARTITION BY COLUMN specifies that the table has column partitioning. Each column of this table will have its own partition and will be (by default) in column storage since no explicit column grouping is specified. Note that a primary index is not specified since this is NoPI table. A primary index may not be specified if the table is column partitioned. Characteristics of a Columnar Table A table or join index that is partitioned by column has several key characteristics: It does not have a primary index. Each column partition can be composed of single or multiple columns. Each column partition usually consists of multiple physical rows. A new physical row format COLUMN may be utilized for a column partition. Such a physical row is called a ‘container’ and it is used to implement columnar-storage for a column partition. Alternatively, a column partition may also have traditional physical rows with ROW format. Such a physical row for columnar partitions is called a subrow. This is used to implement row-storage for a column partition. Note that in subsequent discussions, when ‘row storage’ or ‘row format’ is stated, it is referring to columnar partitioning with the ROW storage option selected. This is not to be confused with row-partitioning which we associate with a PPI table. In a table with multiple levels of partitioning, only one level may be column partitioned. All other levels must be row-partitioned (i.e., PPI). Page 18-12 Column Partitioning Column Partition Table DDL (without Auto-Compression) CREATE TABLE Super_Bowl (Winner CHAR(25) NOT NULL ,Loser CHAR(25) NOT NULL ,Game_Date DATE NOT NULL ,Game_Score CHAR(7) NOT NULL ,Attendance INTEGER) NO PRIMARY INDEX PARTITION BY COLUMN (Winner NO AUTO COMPRESS ,Loser NO AUTO COMPRESS ,Game_Date NO AUTO COMPRESS ,Game_Score NO AUTO COMPRESS ,Attendance NO AUTO COMPRESS); Defaults for a column partitioned table. • Single-column partitions; options include multicolumn partitions. • Auto compression is on; NO AUTO COMPRESS turns off auto-compression for the column. • System-determined column-store for above column partitions; options include column-store (COLUMN) or row-store (ROW). Column Partitioning Page 18-13 Column Partition Container (No Automatic Compression) In order to support columnar-storage for a column partition, a new format, referred to as a COLUMN format in the syntax, is available for a physical row. A physical row with this format is referred to as a container and each container holds a series of column partition values for a column partition. Each container is assigned a specific partition number which identifies the column or group of columns whose column partition values are held in the container. When a column partition is stored on disk, one or more containers may be needed to hold all the column partition values of the column partition. Since a container is a physical row, the size of a container is limited by the maximum physical row size. The example on the facing page assumes that NO AUTO COMPRESS has been specified for the column. Containers hold multiple values for the same column (or columns) of a table. For purposes of this explanation, the assumption is being made that each partition contains only a single column so a column partition value is the same as a column value. Recall that each column value belongs to a specific row and that each row is identified by a RowID consisting of a row-hash and uniqueness value. Since all of the rows on a single AMP of a NoPI table share the same row hash, the uniqueness value becomes the real differentiator. So the connection between a specific column value for a particular row on a given AMP and its uniqueness value is the key in locating the corresponding column value. Assume that a given container holds 1000 values. The RowID of each container carries a hash bucket and uniqueness which represents the first column value entry in the container. The first value’s hash bucket and uniqueness is explicit while the other values’ hash bucket and uniqueness are implicit and are understood based on their position in their container. The exact location of a column partition value is known based on its relative position within the container. For example, if the uniqueness value in the container’s RowID is 201 and a column partition value with a uniqueness value of 205 needs to be located, the 5th entry in that container is the corresponding column partition value. Page 18-14 Column Partitioning Column Partition Container (No Automatic Compression) Partition Column Store RowID HB Row # Starting row number 1’s & 0’s Auto-Compression & NULL Bits Dallas Cowboys Chicago Bears Pittsburgh Steelers Pittsburgh Steelers Column Data Pittsburgh Steelers New York Jets Dallas Cowboys Oakland Raiders San Francisco 49ers Column Container is effectively a row in the partition. Column Partitioning Page 18-15 The Column Partition Table (without AutoCompression) The result of creating a column partitioned table (as shown previously) is shown on the facing page with some sample data. The clause PARTITION BY COLUMN specifies that the table has column partitioning. Each column of this table will have its own partition and will be (by default) in column storage since no explicit column grouping is specified. The default of auto-compression is overridden for each of the columns. Page 18-16 Column Partitioning The Column Partition Table (without Auto-Compression) Column Containers Winner Loser Game_Date Game_Score Attendance Part 1-HB-Row #1 Part 2-HB-Row #1 Part 3-HB-Row #1 Part 4-HB-Row #1 Part 5-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s Dallas Cowboys Denver Broncos 01-15-1978 27-10 (null) Chicago Bears New England Patriots 01-26-1986 46-10 73,818 Pittsburgh Steelers Arizona Cardinals 02-01-2009 27-23 70,774 Pittsburgh Steelers Minnesota Vikings 01-12-1975 16-6 80,997 Pittsburgh Steelers Seattle Seahawks 02-05-2006 21-10 68,206 75,389 New York Jets Baltimore Colts 01-12-1969 16-7 Dallas Cowboys Buffalo Bills 01-31-1993 52-17 (null) Oakland Raiders Philadelphia Eagles 01-25-1981 27-10 76,135 San Francisco 49ers Cincinnati Bengals 01-24-1982 26-21 81,270 Column containers are effectively separate rows in a NoPI table. Column Partitioning Page 18-17 CP Table Query #1 (without Auto-Compression) One of the key advantages of column partitioning is opportunity for reduced I/O. This can be realized if only a subset of the columns in a table are read and if those column values are held in separate column partitions. Data is stored on disk by partition, so when partition elimination takes place, data blocks in the eliminated partitions are simply not read. There are three ways to initiate read access to data within a column-partitioned table: A full column partition scan Indexed access (using a secondary, join index, or hash index), A RowID join. Both unique and non-unique secondary indexes are allowed on column-partitioned tables, as are join indexes and hash indexes. Queries best suited for scanning a column-partitioned table are queries that: Page 18-18 Contain one or a few predicates that are very selective in combination. Require a small enough number of columns to be accessed that the caches required to support their consolidation can be held in memory. Column Partitioning CP Table Query #1 (without Auto-Compression) Which teams have lost to the "Dallas Cowboys" in the Super Bowl? Winner Loser Game_Date Game_Score Attendance Part 1-HB-Row #1 Part 2-HB-Row #1 Part 3-HB-Row #1 Part 4-HB-Row #1 Part 5-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s Dallas Cowboys Denver Broncos 01-15-1978 27-10 (null) Chicago Bears New England Patriots 01-26-1986 46-10 73,818 Pittsburgh Steelers Arizona Cardinals 02-01-2009 27-23 70,774 Pittsburgh Steelers Minnesota Vikings 01-12-1975 16-6 80,997 Pittsburgh Steelers Seattle Seahawks 02-05-2006 21-10 68,206 75,389 New York Jets Baltimore Colts 01-12-1969 16-7 Dallas Cowboys Buffalo Bills 01-31-1993 52-17 (null) Oakland Raiders Philadelphia Eagles 01-25-1981 27-10 76,135 San Francisco 49ers Cincinnati Bengals 01-24-1982 26-21 81,270 Only the accessed columns are needed. Column Partitioning Page 18-19 CP Table Query #1 (without Auto-Compression) If indexing is not available, Teradata can access data in a CP table by scanning a column partition on all the AMPs in parallel. In the example on the facing page, the “Winner” column containers are scanned for “Dallas Cowboys”. The following describes the scanning of CP data: 1. Columns within the table definition that aren’t referenced in the query are ignored due to partition elimination. 2. If there is a predicate column in the query, its column partition is read. 3. Values within the predicate column partition are examined and compared against the value passed in the query WHERE clause. 4. Each time a qualifying value is located, the next step is building up a row for the output spool. 5. All the column partition values for a logical row have the same RowID except for the column partition number. The RowID associated with each predicate column value that matches the constraint in the query becomes the link to other column partition values of the same logical row by simply modifying the column partition number of the RowID to the column partition number for each of these other column partition values. If there is more than one predicate column in the query that can be used to disqualify rows, the column for one of these predicates is chosen and its column partition is scanned. Statistics, as well as other heuristics, are used by the optimizer to pick the most selective and least costly predicate. Once that decision has been made, only that single column partition needs to be scanned. If there are no useful predicate columns in the query (for instance, OR’ed predicates), one column partition is chosen to be scanned and for each of its column partition values additional corresponding column partition values are accessed until either predicate evaluation disqualifies the logical row or all the projected column values have been retrieved and brought together to form rows for the output spool. Page 18-20 Column Partitioning CP Table Query #1 (without Auto-Compression) Which teams have lost to the "Dallas Cowboys" in the Super Bowl? (1) (7) Winner Loser Part 1-HB-Row #1 Part 2-HB-Row #1 1’s & 0’s 1’s & 0’s Dallas Cowboys (1) Denver Broncos Chicago Bears New England Patriots Pittsburgh Steelers Arizona Cardinals Pittsburgh Steelers Minnesota Vikings Pittsburgh Steelers Seattle Seahawks New York Jets Baltimore Colts Dallas Cowboys (7) Buffalo Bills Oakland Raiders Philadelphia Eagles San Francisco 49ers Cincinnati Bengals The relative row number in each container is used to access the data. Column Partitioning Page 18-21 Column Partition Table DDL (with Auto-Compression) The DDL to create a column partitioned table with auto-compression is shown on the facing page. Each column will be maintained in a separate partition. The clause PARTITION BY COLUMN specifies that the table has column partitioning. Each column of this table will have its own partition and will be (by default) in column storage since no explicit column grouping is specified. Page 18-22 Column Partitioning Column Partition Table DDL (with Auto-Compression) CREATE TABLE Super_Bowl (Winner CHAR(25) ,Loser CHAR(25) ,Game_Date DATE ,Game_Score CHAR(7) ,Attendance INTEGER) NO PRIMARY INDEX PARTITION BY COLUMN; NOT NOT NOT NOT NULL NULL NULL NULL Note: Auto Compression is on by Default. Column Partitioning Page 18-23 Auto-Compression for CP Tables Auto-compression is a completely transparent compression option for column partitions. It is applied to a container when a container is full after appending some number of column partition values without auto-compression by an INSERT or UPDATE statement. Each container is assessed separately to see how, and if, it can be compressed. Several available compression techniques are considered for compressing a container but, unless there is some size reduction, no compression is performed. If a container is compressed, the needed data is automatically uncompressed as it is read. Auto-compression happens automatically and is most effective when the column partition is based on a single column only, and less effectively as more columns are included in the column partition. User-defined compression, such as multi-value or algorithmic compression that is already defined by the user is honored and carried forward if it helps compress the container. If block level compression is specified, it applies for data blocks holding the physical rows of the table independent of whether auto-compression is applied or not. Page 18-24 Column Partitioning Auto-Compression for CP Tables Auto Compression • When a column partition is defined to have auto-compression (i.e., the NO AUTO COMPRESS option is not specified), data is compressed by the system as physical rows that are inserted into a column-partitioned table or join index. • For some values, there is no applicable compression technique that reduces the size of the physical row and the system will determine not to compress the values for that physical row. • The system decompresses any compressed column-partition values when they are retrieved. • Auto-compression is most effective for a column partition with a single column and COLUMN format. • There is overhead in determining whether or not a physical row is to be compressed and, if it is to be compressed, what compression techniques are to be used. • This overhead can be eliminated by specifying the NO AUTO COMPRESS option for the column partition. Column Partitioning Page 18-25 Auto-Compression Techniques for CP Tables The facing page lists and briefly describes each of the auto-compression techniques that Teradata may utilize. Page 18-26 Column Partitioning Auto-Compression Techniques for CP Tables • Run-Length Encoding Each series of one or more column-partition values that are the same are compressed by having the column-partition value occur once with an associated count of the number of occurrences in the series. • Local Dictionary Compression This is similar to a user-specified value-list compression for a column. Often occurring column-partition values within a physical row are placed in a value-list dictionary local to the physical row. • Trim Compression Trim high-order zero bytes of numeric values and trailing pad bytes of character and byte values with bits to indicate how many bytes were trimmed or what the length is after trimming. • Null Compression Similar to null compression (COMPRESS NULL) for a column except applied to a column-partition value. A single-column or multicolumn-partition value is a candidate for null compression if all the column values in the column-partition value are null (this also means all these columns must be nullable). • Delta on Mean Compression Delta on Mean compression computes the mean/average of all the values in the column container. This mean value is saved and stored in the container. After Delta on Mean compression, the value that is stored for a row is the difference with the mean. So for instance, if the average is say 100 and the four values in four different rows are 99, 98, 101, and 102. Then the values stored are -1, -2, 1, and 2. • Unicode to UTF8 Compression For a column defined with UNICODE character set but where the value consists of ASCII characters, compress the Unicode representation (2 bytes per character) to UTF8 (1 byte per character). Column Partitioning Page 18-27 User-Defined Compression Techniques User-defined compression, such as multi-value or algorithmic compression that is already defined by the user is honored and carried forward if it helps compress the container. If block level compression is specified, it applies for data blocks holding the physical rows of the table independent of whether auto-compression is applied or not. Note that auto-compression is applied locally to a container based on column partition values (which may be multicolumn) while user-specified MVC and ALC are applied globally for a column and are applicable to both containers and subrows. Auto-compression is differentiated from block level compression in several key ways: Auto-compression requires no parameter setting, but rather is completely transparent to the user while block level compression is a result of the appropriate settings of parameters. Auto-compression acts on a container (a physical row) while block level compression acts on a data block (which consists of one or more physical rows). Decompressing a column partition value in a container has little overhead while software-based block level compression incurs noticeable decompression overhead. Only column partition values that are needed by the query are decompressed. BLC has to decompress the entire data block even if only one or a few values are needed from the data block. Determining the auto-compression to use for a container, compressing a container, and compressing additional values to be inserted into the container can cause an increase in the CPU needed for appending values to column partitions. You can expect additional CPU to be required when inserting rows into a CP table that uses auto-compression. This is similarly to the increase in CPU if MVC or ALC compression is added to the columns. Page 18-28 Column Partitioning User-Defined Compression Techniques All the current compression techniques available in Teradata today, can be leveraged and used for column partitioned tables. • Dictionary-Based Compression: Allows end-users to identify and target specific values that would be compressed in a given column. Also known as, Multi-Value Compression. • Algorithmic Compression: Allows users the option of defining compression/decompression algorithms that would be implemented as UDFs and that would be specified and applied to data at the column level in a row. Teradata provides three compression/decompression algorithms that offer compression for UNICODE and LATIN data columns. • Block-Level Compression: Feature provides the capability to perform compression on whole data blocks at the file system level before the data blocks are actually written to storage. Column Partitioning Page 18-29 Column Partition Container (Automatic Compression) In order to support columnar-storage for a column partition, a new format, referred to as a COLUMN format in the syntax, is available for a physical row. The example on the facing page assumes that automatic compression is on for the column. Page 18-30 Column Partitioning Column Partition Container (Automatic Compression) Partition Column Store RowID HB Row # Starting row number 1’s & 0’s Auto-Compression & NULL Bits Dallas Cowboys Chicago Bears Pittsburgh Steelers (3) New York Jets Column Data Dallas Cowboys Oakland Raiders San Francisco 49ers Column (Local) Compression Dictionary Column Container is effectively a row in the partition. Column Partitioning Page 18-31 The Column Partition Table (with Auto-Compression) The result of creating a column partitioned table with auto-compression is shown on the facing page. Page 18-32 Column Partitioning The Column Partition Table (with Auto-Compression) Winner Loser Game_Date Game_Score Attendance Part 1-HB-Row #1 Part 2-HB-Row #1 Part 3-HB-Row #1 Part 4-HB-Row #1 Part 5-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s Denver Broncos Denver Broncos 01-15-1978 46-10 27-10 73,818 (Null) New England Patriots New England Patriots 01-26-1986 27-23 46-10 70,774 73,818 Pittsburgh Pittsburgh Steelers Steelers (3) Arizona Cardinals Arizona Cardinals 02-01-2009 27-23 16-6 80,997 70,774 Pittsburgh New York Steelers Jets Minnesota Vikings Minnesota Vikings 01-12-1975 21-10 16-6 68,206 80,997 Pittsburgh Dallas Cowboys Steelers Seattle Seahawks Seattle Seahawks 02-05-2006 21-10 16-7 75,389 68,206 Baltimore Baltimore Colts Colts 01-12-1969 52-17 16-7 76,135 75,389 Bills Buffalo Buffalo Bills 01-31-1993 52-17 26-21 81,270 (Null) Oakland Raiders Philadelphia Eagles Philadelphia Eagles 01-25-1981 27-10 76,135 San Francisco 49ers Cincinnati Bengals Cincinnati Bengals 01-24-1982 26-21 81,270 Dallas Cowboys Chicago Bears Oakland New York Raiders Jets San Dallas Francisco Cowboys 49ers Run-Length Encoding Trim Trailing Spaces No Compression Local Dictionary Compression (27-10 is compressed) Null Compression Columnar compression is based on each Container. Therefore, each Container may have different compression characteristics and even different compression methods. Column Partitioning Page 18-33 CP Table Query #2 (with Auto-Compression) The Pittsburgh Steelers team was compressed, but effectively represented 3 values in the container. These 3 values correspond to 3, 4, and 5 in the other container (Loser column). Page 18-34 Column Partitioning CP Table Query #2 (with Auto-Compression) Which teams have lost to the "Pittsburgh Steelers" in the Super Bowl? Winner Loser Part 1-HB-Row #1 Part 2-HB-Row #1 1’s & 0’s 1’s & 0’s Dallas Cowboys Denver Broncos Chicago Bears (3, 4, 5) Pittsburgh Steelers (3) New York Jets Dallas Cowboys Oakland Raiders San Francisco 49ers New England Patriots (3) Arizona Cardinals (4) Minnesota Vikings (5) Seattle Seahawks Baltimore Colts Buffalo Bills Philadelphia Eagles Cincinnati Bengals Column Partitioning Page 18-35 CP Table with Row Partitioning DDL Row partitioning can be combined with column partitioning on the same table. This allows queries to read only non-eliminated combined partitions. Such partitions are defined by the intersection of the columns referenced in the query and any partitioning column selection criteria. There is usually an advantage to putting the column partitioning at level one of the combined partitioning scheme. The DDL to create a column partitioned table with auto-compression and Row partitioning is shown on the facing page. Determining the Column Partition Level It is initially recommended that column partitioning either be defined as the first level or, if not as the first level, at least as the second level. When column partitioning is defined as the first level it is easier for the file system to locate related data that is from the same logical row of the table. When column partitioning is defined at a lower level, more boundary checks have to be made, possibly causing an impact on performance. If you are inserting a new table row, it takes more effort if the column partitioning is not the first level. Values of columns from the newly-inserted table row need to be appended at the end of each column partition. If column-partitioning is not first, it is necessary to read through several combined partitions to find the correct container that represents the end point. On the other hand, if you have row partitioning at the second or a lower level partitioning level so that column partitioning can be at the first level, this can be less efficient when row partition elimination based on something like a date range is taking place. Page 18-36 Column Partitioning CP Table with Row Partitioning DDL CREATE TABLE Super_Bowl (Winner CHAR(25) NOT NULL ,Loser CHAR(25) NOT NULL ,Game_Date DATE NOT NULL ,Game_Score CHAR(7) NOT NULL ,City CHAR(40)) NO PRIMARY INDEX PARTITION BY (COLUMN ,RANGE_N(Game_Date BETWEEN DATE '1960-01-01' and DATE '2059-12-31' EACH INTERVAL '10' YEAR)); Note: Auto Compression is on by Default. Column Partitioning Page 18-37 The Column Partition Table (with Row Partitioning) The result of creating a column partitioned table with auto-compression and row partitioning is shown on the facing page. Page 18-38 Column Partitioning The Column Partition Table (with Row Partitioning) • In the 1970s, which teams won Super Bowls, who were the losing teams , and what was the date the game was played? Winner Loser Game_Date Game_Score City Part 1-HB-Row #1 Part 2-HB-Row #1 Part 3-HB-Row #1 Part 4-HB-Row #1 Part 5-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s New York Jets Baltimore Colts 01-12-1969 16-7 Miami, FL Winner Loser Game_Date Game_Score Attendance Part 11-HB-Row #1 Part 12-HB-Row #1 Part 13-HB-Row #1 Part 14-HB-Row #1 Part 15-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s Dallas Cowboys Denver Broncos 01-15-1978 27-10 New Orleans, LA Pittsburgh Steelers Minnesota Vikings 01-12-1975 16-6 Winner Loser Game_Date Game_Score Attendance Part 41-HB-Row #1 Part 42-HB-Row #1 Part 43-HB-Row #1 Part 44-HB-Row #1 Part 45-HB-Row #1 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s 1’s & 0’s Pittsburgh Steelers Seattle Seahawks 02-05-2006 21-10 68,206 Column Partitioning Page 18-39 CP Table with Multi-Column Container DDL When a table is defined with column partitioning, by default each column becomes its own column partition. However, it is possible to group multiple columns into a single partition. This has the result of fewer column partitions with more data held within each column partition. Grouping columns into fewer column partitions may be appropriate in these situations: When the table has a large number of columns (having fewer column partitions may improve the performance of INSERT-SELECT and UPDATE statements). When access to the table often involves a large percentage of the columns and the access is not very selective. When a common subset of columns are frequently accessed together. When a multicolumn NUSI is created on a group of columns. There are too few available column partition contexts to access all the needed column partitions for queries. Note that auto-compression will probably be less effective if columns are grouped together instead of being in their own column partitions. Page 18-40 Column Partitioning CP Table with Multi-Column Container DDL CREATE TABLE Super_Bowl ( Winner CHAR(25) NOT NULL ,Loser CHAR(25) NOT NULL ,Game_Date DATE NOT NULL ,Game_Score CHAR(7) NOT NULL ,Attendance INTEGER) NO PRIMARY INDEX PARTITION BY COLUMN (Winner NO AUTO COMPRESS Recommendation: ,Loser NO AUTO COMPRESS The group of multiple ,(Game_Date columns should be less ,Game_Score than 256 bytes. ,Attendance) NO AUTO COMPRESS) ; Note that this example is without Auto-Compression. Watch the difference between 'Projection' and 'Predicate'. If you are always projecting three columns, it may make sense to group these columns into one Container. If, however, one of these columns is used in a WHERE Predicate, then it may be better to place this column into its own Container. Column Partitioning Page 18-41 The CP Table with Multi-Column Container The example on the facing page illustrates a CP table that has a multi-column container. Page 18-42 Column Partitioning The CP Table with Multi-Column Container Single Column Containers Winner Loser Part 1-HB-Row #1 Part 2-HB-Row #1 Multi-Column Container Game_Date Game_Score Attendance Part 3-HB-Row #1 1’s & 0’s 1’s & 0’s Dallas Cowboys Denver Broncos 01-15-1978 1’s & 0’s 27-10 (null) Chicago Bears New England Patriots 01-26-1986 46-10 73,818 Pittsburgh Steelers Arizona Cardinals 02-01-2009 27-23 70,774 Pittsburgh Steelers Minnesota Vikings 01-12-1975 16-6 80,997 Pittsburgh Steelers Seattle Seahawks 02-05-2006 21-10 68,206 75,389 New York Jets Baltimore Colts 01-12-1969 16-7 Dallas Cowboys Buffalo Bills 01-31-1993 52-17 (null) Oakland Raiders Philadelphia Eagles 01-25-1981 27-10 76,135 San Francisco 49ers Cincinnati Bengals 01-24-1982 26-21 81,270 General recommendations: • If you have a lot of columns in a table, then multi-column containers may be needed. • Multi-column containers will not compress as well as single-column containers. • If you select any column in a multi-column container you will get all of the other columns. Column Partitioning Page 18-43 CP Table Hybrid Row & Column Store DDL The example on the facing page illustrates the DDL to create a column partitioned table that has a combination of row and column storage. COLUMN Format Considerations The COLUMN format packs column partition values into a physical row, referred to as a container, up to a system-determined limit. Whether or not to change a column partition to use ROW format depends on the whether the benefit of row header compression and autocompression can be realized or not. A row header occurs once per container, with the RowID of the first column partition value becoming the RowID of the container itself. In a column-partitioned table, each column partition value is assigned its own RowID, but in a container these RowIDs are implicit except for the first one specified in the header. The uniqueness value can be determined from the position of a column partition value relative to the first column partition value. Thus the row id for each value in the container is implicitly available and an explicit RowID does not need be carried for each individual column value in the container. If many column partition values can be packed into a container, this form of compression (referred to as row header compression) can reduce the space needed for a columnpartitioned table compared to the table without column partitioning. If only a few column partition values (because they are wide) can be placed in a container, there can actually be an increase in the space needed for the table compared to the table without column partitioning. In this case, ROW format may be more appropriate. ROW Format Considerations A subrow, on the other hand, has a format that is the same as traditional row (except it only has the values of a subset of the columns). Subrows are appropriate when column partition values are very wide and you expect only one or a few values to fit in a columnar container. In this case, auto-compression and row header compression using COLUMN format might not be very effective. ROW format provides quicker access to specific values because no search through a physical row is required to find only one value. Each column partition value is in it owns subrow with a row header. Subrows are not subject to auto-compression but may be in the future. Page 18-44 Column Partitioning CP Table Hybrid Row & Column Store DDL CREATE TABLE Super_Bowl (Winner CHAR(25) NOT NULL ,Loser CHAR(25) NOT NULL ,Game_Date DATE NOT NULL ,Game_Score CHAR(7) NOT NULL ,Attendance INTEGER ,City CHAR(40)) NO PRIMARY INDEX PARTITION BY COLUMN (Winner NO AUTO COMPRESS ,Loser NO AUTO COMPRESS ,ROW (Game_Date ,Game_Score ,Attendance ,City) NO AUTO COMPRESS); This example illustrates the syntax to create a row store, but in reality you would only define the row format if the set of columns was greater than 256 bytes. General recommendation: • A column (or set of columns) should be at least 256 bytes wide before considering ROW format. • Row stores will take up more space, but act like a row in terms of retrieving data. • Each row will have a row header and require more space. Column Partitioning Page 18-45 The CP Table (with Hybrid Row & Column Store) As an alternative to COLUMN format, column partition values may be held in a physical row using what is known in Teradata syntax as ROW format. The type of physical row supports row-storage for a column partition and is referred to as a subrow. Each subrow holds one column partition value for a column partition. A subrow has the same format as regular row except that it is generally a subset of the columns for a table row instead of all the columns. Just like a container, each subrow is assigned to a specific partition. One or more subrows may be needed to hold the entire column partition. Since a subrow is a physical row, the size of a subrow is limited by the maximum physical row size. A column partition may have COLUMN format or ROW format but not a mix of both. However, different column partitions in column-partitioned table may have different formats. Page 18-46 Column Partitioning The CP Table (with Hybrid Row & Column Store) Column and Row Store in "one" table. Column Store Containers Winner Loser Part 1-HB-Row #1 Part 2-HB-Row #1 Row Store 1’s & 0’s 1’s & 0’s Partition HB Row # Game_Date Dallas Cowboys Denver Broncos 0 n 1 01-15-1978 Game_Score Attendance 27-10 (null) New Orleans, LA City New Orleans, LA Chicago Bears New England Patriots 0 n 2 01-26-1986 46-10 73,818 Pittsburgh Steelers Arizona Cardinals 0 n 3 02-01-2009 27-23 70,774 Tampa, FL Pittsburgh Steelers Minnesota Vikings 0 n 4 01-12-1975 16-6 80,997 New Orleans, LA Pittsburgh Steelers Seattle Seahawks 0 n 5 02-05-2006 21-10 68,206 Detroit, MI New York Jets Baltimore Colts 0 n 6 01-12-1969 16-7 75,389 Miami, FL Dallas Cowboys Buffalo Bills 0 n 7 01-31-1993 52-17 (null) Pasadena, CA Oakland Raiders Philadelphia Eagles 0 n 8 01-25-1981 27-10 76,135 New Orleans, LA San Francisco 49ers Cincinnati Bengals 0 n 9 01-24-1982 26-21 81,270 Pontiac, MI Column Partitioning Page 18-47 Populating a CP Table INSERT-SELECT INSERT-SELECT is the expected and most efficient method of loading data into a columnpartitioned table. If the data originates from an external source, FastLoad can be used to load it into a staging table from which the INSERT-SELECT can take place. If the source was a SELECT that included several joins and as a result skewed data was produced, options can be added to the INSERT-SELECT statement to avoid a skewed column-partitioned table and improve the effectiveness of auto-compression: Options HASH BY (RANDOM or hash_spec_list): The selected rows are redistributed by the hash value of the expressions in the hash_spec_list. Alternatively, HASH BY RANDOM can be specified to have data blocks redistributed randomly. It is important that a column or columns be selected that distributes well if the HASH BY option is used. LOCAL ORDER BY: A local sort is done on each AMP before physically storing the rows. This could help autocompression to be more effective by ensuring that like values of the sorting columns appear together. During an INSERT-SELECT process, each source row is read, and its columns individually appended to the column partitions to which they belong. As many column partition values as can fit are built up simultaneously in memory, and written out to disk when the buffer is filled. If the column-partitioned table being loaded has a large number of columns, additional passes of the source table may be required to append all of the columns to their respective column partitions. Page 18-48 Column Partitioning Populating a CP Table CREATE TABLE Super_Bowl_Staging (Winner CHAR(25) NOT NULL ,Loser CHAR(25) NOT NULL ,Game_Date DATE NOT NULL ,Game_Score CHAR(7) NOT NULL ,Attendance INTEGER) NO PRIMARY INDEX; CREATE TABLE Super_Bowl (Winner CHAR(25) ,Loser CHAR(25) ,Game_Date DATE ,Game_Score CHAR(7) ,Attendance INTEGER) NO PRIMARY INDEX PARTITION BY COLUMN; NOT NOT NOT NOT NULL NULL NULL NULL 1. Load data into staging table. 2. INSERT INTO Super_Bowl ….. SELECT * FROM Super_Bowl_Staging … Column Partitioning Page 18-49 DELETE Considerations Rows can be deleted from a column-partitioned table using the DELETE ALL, or selectively using DELETE. DELETE ALL uses the standard fast-path delete as would be done on a primary-indexed table. If a column-partitioned table also happens to include row partitioning, the same fast-path delete can be applied to one or more row partitions. Space is immediately reclaimed. The selective DELETE, in which only one or a few rows of the table are deleted, requires a scan of a column partition or indexed access to the column-partitioned table. In this case, the row being deleted is not physically removed, but only flagged as having been deleted. The space taken by a row being deleted is scattered across multiple column partitions and is not reclaimed at the time of the deletion. This form of delete should only be used to delete a small percentage of rows. During a delete operation, all large objects are immediately deleted, as are entries in secondary indexes. Join indexes are updated to reflect the change as it happens. The Delete Column Partition Each column-partitioned table has one delete column partition, in addition to the userspecified column partitions. It holds information about deleted rows so they do not get included in an answer set. When a single row delete takes place in a column-partitioned table, rather than removing each deleted value across all the column partitions, which would involve multiple physical row updates, a single action is performed. One bit in the delete column partition is set as an indication that the hash bucket and uniqueness associated with the table row has been deleted. This delete column partition is accessed any time a query is made against a columnpartitioned table without indexed access. At the time a column partition is scanned, the delete column partition is checked to make sure a table row being requested by the query has not been deleted (if it has, the value is skipped). This additional partition access can be observed in the EXPLAIN text. Page 18-50 Column Partitioning DELETE Considerations • DELETE ALL uses the standard fast-path delete as would be done on a primary-indexed table. – If a CP table also happens to include row partitioning, the same fast-path delete can be applied to one or more row partitions. Space is immediately reclaimed. • The selective DELETE, in which only one or a few rows of the table are deleted, requires a scan of a column partition or indexed access to the column-partitioned table. – In this case, the row being deleted is not physically removed, but only flagged as having been deleted. – This form of delete should only be used to delete a small percentage of rows. • The Delete Column Partition - each column-partitioned table has one delete column partition, in addition to the user-specified column partitions. It holds information about deleted rows so they do not get included in an answer set. – One bit in the delete column partition is set as an indication that the hash bucket and uniqueness associated with the table row has been deleted. Column Partitioning Page 18-51 UPDATE Considerations Updating rows in column partitioned table requires a delete and an insert operation. It involves marking the appropriate bit in the delete column partition, and then re-inserting columns for the new updated version of the table row. The cost of this update is less severe than a Primary Index update (also a delete plus insert) because in the column-partitioned table update, the deletion and reinsertion takes place on the same AMP. The part of the update that re-inserts a new table row is essentially a re-append. The highest uniqueness value on that AMP is incremented by one, and all the column values for that updated row are appended to their corresponding column partitions. Because multiple I/Os are performed in doing this re-append, row-at-a-time updates on column-partitioned tables should be approached with caution. The space that is being used by the old row is not reclaimed, but a delete bit is turned on in the delete column partition, indicating that the old version of the row is obsolete. An UPDATE statement should only be used to update a small percentage of rows. USI Access For example, consider a unique secondary index (USI) access. The USI subtable provides the specific RowID of the base table row. In the columnar case, the base table row is located on a specific AMP which can be stored in multiple containers. As it is for a PI or NoPI tables, the hash bucket in the RowID carried in the USI is used to locate the AMP that contains the base table row. The column values from the multiple containers are located the same as using a RowID retrieved from a NUSI which is described below. NUSI Access With non-unique secondary indexes (NUSIs), a row-id list is retrieved from the NUSI subtable. In the case of a column-partitioned table, the table row has been decomposed into columns that are located in different column partitions on disk. Several different internal partition numbers come into play in reconstructing the table row. Rather than relying on the column partition number, it is only the hash bucket and uniqueness that is of importance in the NUSI subtable RowID list. The hash bucket and uniqueness identifies the table row on that AMP, while the column partition number plays no meaningful role. Because column partition numbers in the NUSI subtable are not useful in the case of a columnpartitioned table, all RowIDs in a NUSI carry a column partition number of 1. The hash bucket and uniqueness value are the important link between the NUSI subtable and the column values of interest for the query. Once the hash bucket and uniqueness value is known, a RowID is formulated using the internal partition number of the column of interest. This RowID is then used to read the containing physical row/container. A relative position is determined using the uniqueness value which is then used to access the appropriate column value. This process is repeated to locate the column value of each of the remaining columns for this row. These individual column values are brought together to formulate the row. This process occurs for each RowID in the NUSI subtable entry. Page 18-52 Column Partitioning UPDATE and USI/NUSI Considerations UPDATE Considerations • Updating rows in column partitioned table requires a delete and an insert operation. • It involves marking the appropriate bit in the delete column partition, and then reinserting columns for the new updated version of the table row. • An UPDATE statement should only be used to update a small percentage of rows. USI/NUSI Considerations • For a USI on a CP table, the base table row is located on a specific AMP which can be stored in multiple containers. The hash bucket in the RowID carried in the USI is used to locate the AMP that contains the base table row. • For a NUSI on a CP table, the table row has been decomposed into columns that are located in different column partitions on disk. Several different internal partition numbers come into play in reconstructing the table row. • Rather than relying on the column partition number, it is only the hash bucket and uniqueness that is of importance in the NUSI subtable RowID list. Column Partitioning Page 18-53 CP Table Restrictions The following limitations apply: Column partitioning for join indexes is restricted to single-table, non-aggregate, non-compressed join indexes with no PI and no ORDER BY clause – ROWID of base table must be included in a CP join index Column partitioning is not allowed for the following: – Primary index (PI) base tables – Global temporary, volatile, and queue tables – Secondary indexes Column partitioning is not applicable for the following: – Global temporary trace tables – Error tables – Compressed join indexes NoPI table with only row partitioning is not allowed A column cannot be specified to be in more than one column partition Column grouping cannot be specified in both the column definition list of CREATE TABLE statement and in the COLUMN clause. Column grouping cannot be specified in both the select list of a CREATE JOIN INDEX statement and in the COLUMN clause. Page 18-54 Column Partitioning CP Table Restrictions • Column Partitioning is predicated on the NoPI table structure and as such the following restrictions apply: – – – – – – – – – Set Tables Queue tables Global Temporary Tables Volatile Tables Derived Tables Multi-table or Aggregate Join Index Compressed Join Index Hash Index Secondary Index are not column partitioned • Column Partitioned tables cannot be loaded with either the FastLoad or the MultiLoad utilities. • Merge-Into and UPSERT statements are not supported. • Population of Column Partition tables will require an Insert-Select process after data has been loaded into a staging table. • No synchronized scanning with Columnar Tables. • Since Columnar Tables are NoPI Tables, Teradata cannot use Full Cylinder Reads. Column Partitioning Page 18-55 Summary The facing page contains a summary of key concepts from this module. Page 18-56 Column Partitioning Summary When is column partitioning useful? • Queries access varying subsets of the columns of table or Queries of the table are selective (Best if both occur for queries) • For example, ad hoc, data analytics • Data can be loaded with large INSERT-SELECTs • There is no or little update/delete maintenance between refreshes or appends of the data for the table or for row partitions. Do NOT use this feature when: • Queries need to be run on current data that is changing (deletes and updates). • Performing tactical queries or OLTP queries. • Workload is CPU bound such that a trade-off of reduced I/O with increased CPU does not improve the throughput. – Column partitioning is not intended to be a CPU savings feature. Column Partitioning Page 18-57 Module 18: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 18-58 Column Partitioning Module 18: Review Questions 1. Which two choices apply to Column Partitioning? a. SET table b. NoPI table c. Table with multi-level partitioning d. Table with existing row partitioning 2. What are two benefits of Column Partitioning? a. b. c. d. Reduced Reduced Reduced Reduced I/O CPU disk space usage tactical query response times 3. True or False? Deleting a row in a column partitioned table will reclaim table space. 4. True or False? In a multi-level partitioned table, only one level may be column partitioned. 5. True or False? The preferred way to load a columnar table is using INSERT/SELECT. Column Partitioning Page 18-59 Lab Exercise 18-1 Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SHOW TABLE table_name; A count of rows in the Orders table is 31,200. A count of rows in the Orders_2012 table is 12,000. Page 18-60 Column Partitioning Lab Exercise 18-1 Lab Exercise 18-1 Purpose In this lab, you will use Teradata SQL Assistant to create tables with column partitioning in various ways. What you need Populated DS tables and Orders and Orders_2012 tables in your database Tasks 1. Use the SHOW TABLE for Orders to help create a new, similar table (same column names and definitions, etc.) that does NOT have a primary index and name this table "Orders_NoPI". 2. Populate the Orders_NoPI table (via INSERT/SELECT) with all of the rows from the DS.Orders and DS.Orders_2012 tables. Verify the number of rows in your table. Count = ________ (count should be 43,200) 3. Use the SHOW TABLE for Orders_NoPI to create a new column partitioned table named "Orders_CP" based on the following: – Each column is created as a separate partition – Utilize auto compression for every column Populate the Orders_CP table (via INSERT/SELECT) from the Orders_NoPI table. Column Partitioning Page 18-61 Lab Exercise 18-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SELECT COUNT(DISTINCT(PARTITION)) FROM table_name; Page 18-62 Column Partitioning Lab Exercise 18-1 (cont.) 4. Verify the number of rows in your table. Count = ________ (count should be 43,200) How many partitions actually have data? ________ Note: The table only has 1 logical partition. 5. Use the SHOW TABLE for Order_CP to create a new column partitioned table named "Orders_CP_noAC" based on the following: – Each column is created as a separate partition – Turn off auto compression for every column Populate the Orders_CP_noAC table (via INSERT/SELECT) from the Orders_NoPI table. Column Partitioning Page 18-63 Lab Exercise 18-1 (cont.) Check your understanding of the concepts discussed in this module by completing the lab exercises as directed by your instructor. SQL hints: INSERT INTO table_1 SELECT * FROM table_2; SELECT COUNT(*) FROM table_name; SELECT COUNT(DISTINCT(PARTITION)) FROM table_name; SELECT FROM WHERE AND GROUP BY ORDER BY Page 18-64 TableName, SUM(CurrentPerm) DBC.TableSizeV DatabaseName = DATABASE TableName In ('tablename1', 'tablename2', …) 1 1; Column Partitioning Lab Exercise 18-1 (cont.) 6. (Optional) Use the SHOW TABLE for Order_CP to create a new column partitioned table named "Orders_CP_TP based on the following: – – – Each column is created as a separate partition (COLUMN partitioning is the first level) Utilize auto compression for every column Incorporate table partitioning with orderdate as the partitioning column • • From '2003-01-01' through '2012-12-31', partition by month Do not use the NO RANGE or UNKNOWN options. Populate the Orders_CP_TP table (via INSERT/SELECT) from the Orders_NoPI table. 7. (Optional) Use the PARTITION key word to determine the number of partitions defined in the Orders_CP_TP. How many partitions actually have data? ________ 8. (Optional) Determine the AMP space usage of the Orders_CP, Orders_CP_noAC, and Orders_CP_TP tables using DBC.TableSizeV. CurrentPerm of Orders_CP ________________ CurrentPerm of Orders_CP_noAC ________________ CurrentPerm of Orders_CP_TP ________________ Column Partitioning Page 18-65 Notes Page 18-66 Column Partitioning Module 19 Secondary Index Usage After completing this module, you will be able to: Describe USI and NUSI implementations. Describe dual NUSI access. Describe NUSI bit mapping. Explain NUSI and Aggregate processing. Compare NUSI vs. full table scan (FTS). Teradata Proprietary and Confidential Secondary Index Usage Page 19-1 Notes Page 19-2 Secondary Index Usage Table of Contents Secondary Indexes ..................................................................................................................... 19-4 Defining Secondary Indexes ...................................................................................................... 19-6 Secondary Index Subtables ........................................................................................................ 19-8 Primary Indexes (UPIs and NUPIs) ....................................................................................... 19-8 Unique Secondary Indexes (USIs) ......................................................................................... 19-8 Non-Unique Secondary Indexes (NUSIs) .............................................................................. 19-8 USI Subtable General Row Layout .......................................................................................... 19-10 USI Change for PPI.............................................................................................................. 19-10 USI Hash Mapping................................................................................................................... 19-12 NUSI Subtable General Row Layout ....................................................................................... 19-14 NUSI Change for PPI ........................................................................................................... 19-14 NUSI Hash Mapping ................................................................................................................ 19-16 Table Access – A Complete Example...................................................................................... 19-18 Secondary Index Considerations .............................................................................................. 19-20 Single NUSI Access (Between, Less Than, or Greater Than) ................................................. 19-22 Dual NUSI Access ................................................................................................................... 19-24 AND with Equality Conditions ............................................................................................ 19-24 OR with Equality Conditions ............................................................................................... 19-24 NUSI Bit Mapping ................................................................................................................... 19-26 Example ............................................................................................................................... 19-26 Value-Ordered NUSIs .............................................................................................................. 19-28 Value-Ordered NUSIs (cont.) .................................................................................................. 19-30 Covering Indexes ..................................................................................................................... 19-32 Join Index Note: ............................................................................................................... 19-32 Example ........................................................................................................................... 19-32 Covering Indexes (cont.) .......................................................................................................... 19-34 NUSIs and Aggregate Processing ........................................................................................ 19-34 Example ............................................................................................................................... 19-34 NUSI vs. Full Table Scan (FTS) .............................................................................................. 19-36 Example ............................................................................................................................... 19-36 Full Table Scans – Sync Scans ................................................................................................ 19-38 Module 19: Review Questions ................................................................................................. 19-40 Secondary Index Usage Page 19-3 Secondary Indexes Secondary Indexes are generally defined to provide faster set selection. The Teradata Database allows up to 32 Secondary Indexes per table. Teradata handles Unique Secondary Indexes (USIs) and Non-Unique Secondary Indexes (NUSIs) very differently. The diagram illustrates how Secondary Index values are stored in subtables. Secondary Index values, like Primary Index values, are input to the Hashing Algorithm. As with Primary Indexes, the Hashing Algorithm takes the Secondary Index value and outputs a Row Hash. These Row Hash values point to a subtable which stores index rows containing the base table SI column values and Row IDs which point to the row(s) in the base table with the corresponding SI value. The Teradata Database can determine the difference between a base table and a SI subtable by checking the Subtable ID, which is part of the Table ID. Page 19-4 Secondary Index Usage Secondary Indexes Secondary indexes provide faster set selection. • • • • They may be unique (USI) or non-unique (NUSI). A USI may be used to maintain uniqueness on a column. The system maintains a separate subtable for each secondary index. A secondary index can consist of 1 to 64 columns. Subtables keep base table secondary index row hash, column values, and RowID (which point to the row(s) in the base table with that value). • The implementation of a USI is different than a NUSI. • Users cannot access subtables directly. SI Subtable Secondary Index Value Hashing Algorithm C Row ID Base Table Table_X A Primary Index Value Secondary Index Usage Hashing Algorithm PI B C D E SI Page 19-5 Defining Secondary Indexes Use the CREATE INDEX statement to create a secondary index on an existing table or join index. The index can be optionally named. Notes on ORDER BY option: If the ORDER BY option is not used, the default is to order by hash. If the ORDER BY option is specified and neither of the keywords (HASH or VALUES) is specified, then the default is to order by values. Recommendation: If the ORDER BY option is used, specify one of the keywords – HASH or VALUES. Notes on the ALL option: The ALL option indicates that a NUSI should retain row ID pointers for each logical row of a join index (as opposed to only the compressed physical rows). ALL also ignores the NOT CASESPECIFIC attribute of data types so a NUSI can include case-specific values. ALL enables a NUSI to cover a join index, enhancing performance by eliminating the need to access a join index when all values needed by a query are in the secondary index. However, ALL might also require the use of additional index storage space. Use this keyword when a NUSI is being defined for a join index and you want to make it eligible for the Optimizer to select when covering reduces access plan cost. ALL can also be used for an index on a table, however. You cannot specify multiple indexes that differ only by the presence or absence of the ALL option. The use of the ALL option for a NUSI on a data table does not cause a syntax error. Additional notes: column_name_2 specifies the sort order to be used. column_name_2 is a column name that must appear in the column_name_1 list. You can put two NUSI secondary indexes on the same column (or set of columns) if one of the indexes is ordered by hash and the other index is ordered by value. You cannot define a USI on a join index. Other secondary indexes are allowed. Page 19-6 Secondary Index Usage Defining Secondary Indexes Secondary indexes can be defined … • when a table is created (CREATE TABLE). • for an existing table (CREATE INDEX). , CREATE INDEX UNIQUE A , ( col_name_1 A index_name ALL ) B ORDER BY (col_name_2) VALUES HASH B ON table_name TEMPORARY ; join_index_name Examples: Unnamed USI: Named Value-Ordered NUSI: CREATE UNIQUE INDEX (item_id, store_id, sales_date) ON Daily_Sales; CREATE INDEX ds_vo_nusi (sales_date) ORDER BY VALUES ON Daily_Sales; Secondary Index Usage Page 19-7 Secondary Index Subtables The diagram on the facing page illustrates how the Teradata Database retrieves rows based upon their index values. It compares and contrasts examples of Primary (UPIs and NUPIs), Unique Secondary (USIs) and Non-Unique Secondary Indexes (NUSIs). Primary Indexes (UPIs and NUPIs) As you have seen previously, in the case of a Primary Index, the Teradata Database hashes the value and uses the Row Hash to find the desired row. This is always a one-AMP operation and is shown in the top diagram on the facing page. Unique Secondary Indexes (USIs) The middle diagram illustrates the process of a USI row retrieval. An index subtable contains index rows, which in turn point to base table rows matching the supplied index value. USI rows are globally hash- distributed across all AMPs, and are retrieved using the same procedure for Primary Index data row retrieval. Since the USI row is hash-distributed on different columns than the Primary Index of the base table, the USI row typically lands on an AMP other than the one containing the data row. Once the USI row is located, it “points” to the corresponding data row. This requires a second access and usually involves a different AMP. In effect, a USI retrieval is like two PI retrievals: Master Index - Cylinder Index - Index Block Master Index - Cylinder Index - Data Block Non-Unique Secondary Indexes (NUSIs) NUSIs are implemented on an AMP-local basis. Each AMP is responsible for maintaining only those NUSI subtable rows that correspond to base table rows located on that AMP. Since NUSIs allow duplicate index values and are based on different columns than the PI, data rows matching the supplied NUSI value could appear on any AMP. In a NUSI retrieval (illustrated at the bottom of the facing page), a message is sent to all AMPs to see if they have an index row for the supplied value. Those that do use the “pointers” in the index row to access their corresponding base table rows. Any AMP that does not have an index row for the NUSI value will not access the base table to extract rows. Page 19-8 Secondary Index Usage Secondary Index Subtables One AMP Operation Primary Index Value Hashing Algorithm Base Table Two AMP Operation Unique Secondary Index Value Hashing Algorithm USI Subtable Base Table All AMP Operation Non-Unique Secondary Index Value Hashing Algorithm NUSI Subtable Base Table Secondary Index Usage Page 19-9 USI Subtable General Row Layout The layout of a USI subtable row is illustrated at the top of the facing page. It is composed of several sections: The first two bytes designate the row length. The next 8 bytes contain the Row ID of the row. Within this Row ID, there are 4 bytes of Row Hash and 4 bytes of Uniqueness Value. The following 2 bytes are additional system bytes that will be explained later as will be the 7 bytes for row offsets. The next section contains the SI value. This is the value that was used by the Hashing Algorithm to generate the Row Hash for this row. This section varies in length depending on the index. Following the SI value are 8 bytes containing the Row ID of the base table row. The base table Row ID tells the system where the row corresponding to this particular USI value is located. If the table is partitioned, then the USI subtable row needs 10 or 16 bytes to identify the Row ID of the base table row. The Row ID (of the base table row) is combination of the Partition Number, Row Hash, and Uniqueness Value. The last two bytes contain the reference array pointer at the end of the block. The Teradata Database creates one index subtable row for each base table row. USI Change for PPI For tables defined with a PPI, a two-byte or optionally eight-byte (TD 14.0) partition number is embedded in the data row. Therefore, the unique row identifier is comprised of the Partition Number, the Row Hash, and the Uniqueness Value. The USI subtable rows use the wider row identifier to identify the base table row, making these subtable rows wider as well. Except for the embedded partition number, USI subtable rows (for a PPI table) have the same format as non-PPI rows. The facing page shows the row layout for USI subtable rows. Page 19-10 Secondary Index Usage USI Subtable General Row Layout Base Table Row Identifier Row ID of USI USI Subtable Row Layout Row Length Row Hash Uniq. Value 4 4 Secondary Index Value Part. # Row Hash Uniq. Value 2 or 8 4 4 Ref. Array Pointer Notes: • USI subtable rows are distributed by the Row Hash, like any other row. • The Row Hash is based on the unique secondary index value. • The subtable row includes the secondary index value and a second Row ID which identifies a single base table row (usually on a different AMP). • There is one index subtable row for each base table row. • For PPI tables, a two-byte (or optionally eight-byte with Teradata 14.0) partition number is embedded in the base table row identifier. – Therefore, the base table row identifier is comprised of the Partition Number, Row Hash, and the Uniqueness Value. Secondary Index Usage Page 19-11 USI Hash Mapping The example on the facing page shows the three-part message that is put onto the Message Passing Layer for USI access. The only difference between this and the three-part message used in PI access (previously discussed) is that the Subtable ID portion of the Table ID references the USI subtable not the data table. Using the DSW for the Row Hash, the Message Passing Layer (a.k.a., Communication Layer) directs the message to the correct AMP which uses the Table ID and Row Hash as a logical index block identifier and the Row Hash and USI value as the logical index row identifier. If the AMP succeeds in locating the index row, it extracts the base table Row ID (“pointer”). The Subtable ID portion of the Table ID is then modified to refer to the base table and a new three-part message is put onto the Communications Layer. Once again, the Message Passing Layer uses the DSW to identify the correct AMP. That AMP uses the Table ID and Row ID to locate the correct data block and then uses the Row ID to locate the correct row. Page 19-12 Secondary Index Usage USI Hash Mapping PARSER SELECT FROM WHERE Hashing Algorithm USI TableID Row Hash * Table_Name USI_col = 'siv'; siv Message Passing Layer (Request is sent to a specific AMP) AMP 0 AMP 1 USI Subtable ... AMP 2 USI Subtable USI Subtable USI RID RH USI Value (Base Table) Row ID siv RIDx Message Passing Layer (Request is sent to a specific AMP) Base Table Base Table Data Rows RID (8-16) Data Columns Data Rows RID (8-16) Data Columns RIDx Base Table Data Rows RID (8-16) Data Columns siv Secondary Index Usage Page 19-13 NUSI Subtable General Row Layout The layout of a NUSI subtable row is illustrated on the facing page. It is almost identical to the layout of a USI subtable row. There are, however, two major differences: First, NUSI subtable rows are not distributed across the system via AMP number in the Hash Map. NUSI subtable rows are built from the base table rows found on that particular AMP and refer only to the base rows of that AMP. Second, NUSI rows may point to or reference more than one base table row. There can be many base table Row IDs (8, 10, or 16 bytes) in a NUSI subtable row. Because NUSIs are always AMP-local to the base table rows, it is possible to have the same NUSI value represented on multiple AMPs. A NUSI subtable is just another table from the perspective of the file system. NUSI Change for PPI For tables defined with a PPI, the two-byte partition number is embedded in the data row. Therefore, the unique row identifier is comprised of the Partition Number, Row Hash, and Uniqueness Value. PPI data rows are two bytes wider than they would be if the table was not partitioned. If the base table is partitioned, then the NUSI subtable row needs 10 or 16 bytes for each RowID entry to identify the Row ID of the base table row. The Row ID (of the base table row) is combination of the Partition Number, Row Hash, and Uniqueness Value. The NUSI subtable rows use the wider row identifier to identify the base table row, making these subtable rows wider as well. Except for the embedded partition number, NUSI subtable rows (for a PPI table) have the same format as non-PPI rows. The facing page shows the row layout for NUSI subtable rows. Page 19-14 Secondary Index Usage NUSI Subtable General Row Layout Row ID of NUSI NUSI Subtable Row Layout Row Length Row Hash Uniq. Value 4 4 Secondary Index Value Table Row ID List P RH U 2/8 4 4 P RH U 2/8 4 Ref. Array Pointer 4 Notes: • The Row Hash is based on the base table secondary index value. • The NUSI subtable row contains Row IDs that identify the base table rows on this AMP that carry the Secondary Index Value. • The Row IDs reference (or "point") to base table rows on this AMP only. • There are one (or more) subtable rows for each secondary index value on the AMP. – One NUSI subtable row can hold approximately 4000 – 8000 Row IDs; assuming a NUSI data type less than 200 characters (CHAR(200)). – If an AMP has more than 4000 – 8000 rows with the same NUSI value, another NUSI subtable row is created for the same NUSI value. • The maximum size of a single NUSI row is 64 KB. Secondary Index Usage Page 19-15 NUSI Hash Mapping The example on the facing page shows the standard, three-part Message Passing Layer rowaccess message. Because NUSIs are AMP-local indexes, this message gets broadcast to all AMPs. Each AMP uses the values to search the appropriate index block for a corresponding NUSI row. Only those AMPs with one or more of the desired rows use the base table Row IDs to access the proper data blocks and data rows. In the example, the SELECT statement is designed to find those rows with a NUSI value of ‘siv’. Examination of the NUSI subtables on each AMP shows that AMPs 0, 2 and 3 (not shown) all have a subtable index row, and, therefore, base table rows satisfying this condition. These AMPs then participate in the base table access. The NUSI subtable on AMP 1, on the other hand, shows that there are no rows with a NUSI value of ‘siv’ located on this AMP. AMP 1 does not participate in the base table access process. If the table is not partitioned, the subtable rows will identify the 8-byte Row IDs of the base table rows. If the table is partitioned with less than (or equal) 65,535 partitions, the subtable rows will identify the 10-byte Row IDs of the base table rows. This Row ID includes the Partition Number. If the table is partitioned with more than 65,535 partitions, the subtable rows will identify the 16-byte Row IDs of the base table rows. This Row ID includes the Partition Number. Page 19-16 Secondary Index Usage NUSI Hash Mapping PARSER SELECT * FROM Table_Name WHERE NUSI_col = 'siv'; Hashing Algorithm NUSI TableID Row Hash siv Message Passing Layer (Broadcast to all AMPs) AMP 0 AMP 1 NUSI Subtable NUSI RID RH NUSI Value siv (Base Table) Row IDs AMP 2 NUSI Subtable NUSI Subtable NUSI RID NUSI Value (Base Table) Row IDs RID1 RID2 NUSI RID RH NUSI Value siv (Base Table) Row IDs RID3 Base Table Base Table Base Table Data Rows RID (8-16) Data Columns RID1 siv Data Rows RID (8-16) Data Columns Data Rows RID (8-16) Data Columns RID3 RID2 ... siv siv Secondary Index Usage Page 19-17 Table Access – A Complete Example The example on the facing page shows a four-AMP configuration with Base Table Rows, NUSI Subtable rows, and USI Subtable Rows. The table and index can be used to answer the following queries without having to do a full table scan: SELECT * FROM Customer WHERE Phone = '666-5555' ; SELECT * FROM Customer WHERE Cust = 80; SELECT * FROM Customer WHERE Name = 'Rice' ; Page 19-18 Secondary Index Usage Table Access – A Complete Example CUSTOMER Cust Name USI NUSI NUPI 37 98 74 95 27 56 45 31 40 72 80 49 12 62 77 51 White Brown Smith Peters Jones Smith Adams Adams Smith Adams Rice Smith Young Black Jones Rice 555-4444 333-9999 555-6666 555-7777 222-8888 555-7777 444-6666 111-2222 222-3333 666-7777 666-5555 111-6666 777-7777 444-5555 777-6666 888-2222 Example: SELECT * FROM Customer WHERE Phone = '666-5555' ; SELECT * FROM Customer WHERE Cust = 80; SELECT * FROM Customer WHERE Name = 'Rice' ; Phone AMP 1 USI Subtable RowID 244, 1 505, 1 744, 4 757, 1 Cust 74 77 51 27 RowID 884, 1 639, 1 915, 1 388, 1 NUSI Subtable RowID 432, 8 448, 1 567, 3 656, 1 Name RowID Smith 640, 1 White 107, 1 Adams 638, 1 Rice 536, 5 Base Table RowIDCust Name Phone USI NUSI NUPI 107, 1 37 White 555-4444 536, 5 80 Rice 666-5555 638, 1 31 Adams 111-2222 640, 1 40 Smith 222-3333 Secondary Index Usage AMP 2 USI Subtable RowID 135, 1 296, 1 602, 1 969, 1 Cust 98 80 56 49 RowID 555, 6 536, 5 778, 7 147, 1 NUSI Subtable RowID Name 432, 3 Smith 567, 2 Adams 852, 1 Brown RowID 884, 1 471,1 717,2 555, 6 Base Table RowID Cust Name Phone USI NUSI NUPI 471, 1 45 Adams 444-6666 555, 6 98 Brown 333-9999 717, 2 72 Adams 666-7777 884, 1 74 Smith 555-6666 AMP 3 USI Subtable RowID 288, 1 339, 1 372, 2 588, 1 Cust 31 40 45 95 RowID 638, 1 640, 1 471, 1 778, 3 NUSI Subtable RowID 432, 1 448, 4 567, 6 770, 1 Name Smith Black Jones Young RowID 147, 1 822, 1 338, 1 147, 2 Base Table RowIDCust Name Phone USI NUSI NUPI 147, 1 49 Smith 111-6666 147, 2 12 Young 777-4444 388, 1 27 Jones 222-8888 822, 1 62 Black 444-5555 AMP 4 USI Subtable RowID 175, 1 489, 1 838, 1 919, 1 Cust 37 72 12 62 RowID 107, 1 717, 2 147, 2 822, 1 NUSI Subtable RowID 262, 1 396, 1 432, 5 656, 1 Name RowID Jones 639, 1 Peters 778, 3 Smith 778, 7 Rice 915, 1 Base Table RowID Cust Name USI NUSI 639, 1 77 Jones 778, 3 95 Peters 778, 7 56 Smith 915, 1 51 Rice Phone NUPI 777-6666 555-7777 555-7777 888-2222 Page 19-19 Secondary Index Considerations As mentioned at the beginning of this module, a table may have up to 32 Secondary Indexes that can be created and dropped dynamically. It is probably not a good idea to create 32 SIs for each table just to speed up set selection because SIs consume the following extra resources: SIs require additional storage to hold their subtables. In the case of a Fallback table, the SI subtables are Fallback also. Twice the additional storage space is required. SIs require additional I/O to maintain these subtables. When deciding whether or not to define a NUSI, there other considerations. The Optimizer may choose to do a Full Table Scan rather than utilize the NUSI in two cases: When the NUSI is not selective enough. When no COLLECTed STATISTICS are available. As a guideline, choose only those rows having frequent access as NUSI candidates. After the table has been loaded, create the NUSI indexes, COLLECT STATISTICS on the indexes, and then do an EXPLAIN referencing each NUSI. If the Parser chooses a Full Table Scan over using the NUSI, drop the index. Page 19-20 Secondary Index Usage Secondary Index Considerations • A table may have up to 32 secondary indexes. • Secondary indexes may be created and dropped dynamically. – They require additional storage space for their subtables. – They require additional I/Os to maintain their subtables. • • If the base table is Fallback, the secondary index subtable is Fallback as well. • Without COLLECTed STATISTICS, the Optimizer often does a FTS. • The following approach is recommended: – Create the index. – COLLECT STATISTICS on the index (or column). – Use EXPLAIN to see if the index is being used. The Optimizer may, or may not, use a NUSI, depending on its selectivity. Secondary Index Usage Page 19-21 Single NUSI Access (Between, Less Than, or Greater Than) The Teradata Database accesses data from a NUSI-defined column in three ways: If the NUSI is not ordered by value, utilize the NUSI and do a Full Table Scan (FTS) of the NUSI subtable. In this case, the Row IDs of the qualifying base table rows would be retrieved into spool. The Teradata Database would use those Row IDs in spool to access the base table rows themselves. If the NUSI is ordered by values, the NUSI subtable may be used to locate matching base table rows. Ignore the NUSI and do an FTS of the base table itself. In order to make this decision, the Optimizer requires COLLECTed STATISTICS. REMEMBER The only way to determine for certain whether an index is being used is to utilize the EXPLAIN facility. Page 19-22 Secondary Index Usage Single NUSI Access (Between, Less Than, or Greater Than) If the NUSI is not value-ordered, the system may do a FTS of the NUSI subtable. • Retrieve Row IDs of qualifying base table rows into spool. • Access the base table rows from the spooled Row IDs. The Optimizer requires COLLECTed STATISTICS to make this choice. • CREATE • SELECT FROM WHERE • SELECT FROM WHERE • SELECT FROM WHERE INDEX (hire_date) ON Employee; last_name, first_name, hire_date Employee hire_date BETWEEN DATE '2012-01-01' AND DATE '2012-12-31'; last_name, first_name, hire_date Employee hire_date < DATE '2012-01-01'; last_name, first_name, hire_date Employee hire_date > DATE '1999-12-31'; If the NUSI is ordered by values, the NUSI subtable is much more likely be used to locate matching base table rows. Use EXPLAIN to see if, and how, indexes are being used. Secondary Index Usage Page 19-23 Dual NUSI Access In the example on the facing page, two NUSIs are CREATEd on separate columns of the EMPLOYEE TABLE. The Teradata Database decides how to use these NUSIs based on their selectivity. AND with Equality Conditions If one of the two indexes is strongly selective, the system uses it alone for access. If both indexes are weakly selective, but together they are strongly selective, the system does a bit-map intersection. If both indexes are weakly selective separately and together, the system does an FTS. In any case, any conditions in the SQL statement not used for access (residual conditions) become row qualifiers. OR with Equality Conditions When accessing data with two NUSI equality conditions joined by the OR operator (as shown in the last example on the facing page), the Teradata Database may do one of the following: Do a FTS of the base table. If each of the NUSIs is strongly selective, it may use each of the NUSIs to return the appropriate rows. Do an FTS of the two NUSI subtables and do the following steps. – – – Retrieve Rows IDs of qualifying base table rows into two separate spools. Eliminate duplicates from the two spools of Row IDs. Access the base rows from the resulting spool of Row IDs. If only one of the two columns joined by the OR is indexed, the Teradata Database always does an FTS of the base tables. Page 19-24 Secondary Index Usage Dual NUSI Access Each column is a separate NUSI: CREATE INDEX (department_number) ON Employee; CREATE INDEX (job_code) ON Employee; AND with Equality Conditions: SELECT FROM WHERE AND last_name, first_name, … Employee department_number = 500 job_code = 2147; OR with Equality Conditions: SELECT FROM WHERE OR last_name, first_name, ... Employee department_number = 500 job_code = 2147; Secondary Index Usage Optimizer options with AND: • Use one of the two indexes if it is strongly selective. • If the two indexes together are strongly selective, optionally do a bit-map intersection. • If both indexes are weakly selective separately and together, the system does a FTS. Optimizer options with OR: • • Do a FTS of the base table. • Do a FTS of the two NUSI subtables and retrieve Rows IDs of qualifying rows into spool and eliminate duplicate Row IDs from spool. If each of the NUSIs is strongly selective, it may use each of the NUSIs to return the appropriate rows. Page 19-25 NUSI Bit Mapping NUSI Bit Mapping is a process that determines common Row IDs between multiple NUSI values by a process of intersection: It is much faster than copying, sorting and comparing the Row ID lists. It dramatically reduces the number of base table I/Os. NUSI bit mapping can be used with conditions other than equality if all of the following conditions are satisfied: All conditions must be linked by the AND operator. At least two NUSI equality conditions must be specified. The Optimizer is more likely to consider if you have COLLECTed STATISTICS on the NUSIs. Even when the above conditions are satisfied, the only way to be absolutely certain that NUSI bit mapping is occurring is to use the EXPLAIN facility. Example The SQL statement and diagram on the facing page show how NUSI bit-map intersections can narrow down the number of rows even though each condition is weakly selective. In this example, the designer wants to access rows from the employee table. There are three NUSIs defined: salary_amount, country_code, and job_code. All three of these NUSIs are weakly selective. You can see that 7% of the employees earn more than $75,000 per year (>75000), 40% of the employees are located in the USA, and 12% of the employees have a job code of IT. In this case, the bit map intersection of these three NUSIs has an aggregate selectivity of .3%. That is, only .3% of the employees satisfy all three conditions: earning over $75,000, USA based, and work in IT. Page 19-26 Secondary Index Usage NUSI Bit Mapping • • • • • Determines common Row IDs between multiple NUSI values. • • Use EXPLAIN to see if bit mapping is being used. Faster than copying, sorting, and comparing the Row ID lists. Dramatically reduces the number of base table I/Os. All NUSI conditions must be linked by the AND operator. The Optimizer is much more likely to consider bit mapping if you COLLECT STATISTICS. Requires at least 2 NUSI equality conditions. .3% SELECT * FROM Employee WHERE salary_amount > 75000 AND country_code = 'USA' AND job_code = 'IT'; Secondary Index Usage 7% 40% 12% Page 19-27 Value-Ordered NUSIs NUSIs are maintained as separate subtables on each AMP. Their index entries point to base table or Join Index rows residing on the same AMP as the index. The row hash for NUSI rows is based on the secondary index column(s). Unlike row hash values for base table rows, this row hash does not determine the distribution of subtable rows; only the local sort order of each subtable. Enhancements have been made to support the user-specified option of sorting the index rows by data value rather than by hash code. This is referred to as "value ordered" indexes and is presented to the user in the form of new syntax options in the CREATE INDEX statement. By using the “value-ordered” indexes feature, this option can be specified to sort the index rows by data value rather than by hash code. The typical use of a hash-ordered NUSI is with an equality condition on the secondary index column(s). The specified secondary index value is hashed and then each NUSI subtable is probed for rows with the same row hash. For each matching NUSI entry, the corresponding Row IDs are used to access the base rows on the same AMP. Because the NUSI rows are stored in row hash order, searching the NUSI subtable for a particular row hash is very efficient. Value-ordered NUSIs, on the other hand, are useful for processing range conditions and conditions with an inequality on the secondary index column set. Although hash-ordered NUSIs can be selected by the Optimizer to access rows for range conditions, a far more common response is to specify a full table scan of the NUSI subtable to find the matching secondary key values. Therefore, depending on the size of the NUSI subtable, this might not be very efficient. By sorting the NUSI rows by data value, it is possible to search only a portion of the index subtable for a given range of key values. The major advantage of a value-ordered NUSI is in the performance of range queries. Value-ordered NUSIs have the following limitations. The sort key is limited to a single numeric column. The sort key column must be four or fewer bytes. The following query is an example of the sort of SELECT statement for which valueordered NUSIs were designed. SELECT FROM WHERE BETWEEN Page 19-28 * Orders orderdate DATE '2012-02-01' AND DATE '2012-02-29'; Secondary Index Usage Value-Ordered NUSIs A Value-Ordered NUSI is limited to a single column numeric (4-byte) value. Some benefits of using value-ordered NUSIs: • • • • Index subtable rows are sorted (sequenced) by data value rather than hash value. Optimizer can search only a portion of the index subtable for a given range of values. Can provide major advantages in performance of range queries. Even with PPI, the Value-Ordered NUSI is still a valuable index selection for other columns in a table. Example of creating a Value-Ordered NUSI: CREATE INDEX (sales_date) ORDER BY VALUES (sales_date) ON Daily_Sales; SELECT sales_date ,SUM (sales) Daily_Sales sales_date DATE '2012-02-09' AND DATE '2012-02-15' 1 1; FROM WHERE BETWEEN GROUP BY ORDER BY Secondary Index Usage The optimizer may choose to transverse the NUSI using a range constraint rather than do a FTS. Page 19-29 Value-Ordered NUSIs (cont.) column_1_name The names of one or more columns whose field values are to be indexed. You can specify up to 64 columns for the new index. The index is based on the combined values of each column. Unless you use the ORDER BY clause, all columns are hash-ordered. ORDER BY Multiple indexes can be defined on the same columns as long as each index differs in its ordering option (VALUES versus HASH). Row ordering on each AMP by a single NUSI column: either valueordered or hash-ordered. VALUES Rules for using an ORDER BY clause are shown in the following table. Value-ordering for the ORDER BY column. HASH Select VALUES to optimize queries that return a contiguous range of values, especially for a covered index or a nested join. Hash-ordering for the ORDER BY column. Select HASH to limit hash-ordering to one column, rather than all columns (the default). Hash-ordering a multi-column NUSI on one of its columns allows the NUSI to participate in a nested join where join conditions involve only that ordering column. Note: A Value-Ordered NUSI actually reserves two subtable IDs and this counts as 2 secondary indexes in the maximum count of 32 for a table. Page 19-30 Secondary Index Usage Value-Ordered NUSIs (cont.) • Option that increases the ability of a NUSI to “cover” SQL queries without having to access the base table. • Value-Ordered is sorted by the ‘ORDER BY VALUES’ clause and the sort column is limited to a single numeric column that cannot exceed 4 bytes. – Value-Ordered is useful for range constraint queries. • The ‘ORDER BY HASH’ clause provides the ability to create a multi-valued index, but have the NUSI hashed based on a single attribute within the index, not the entire composite value. – Hash-Ordered is useful for equality searches based on a single attribute. – Example: A NUSI may contain 10 columns for covering purposes and a single value 'ORDER BY HASH' for equality searches on that NUSI value. • Optimizer is much more likely to use a value-ordered NUSI if you have collected statistics on the value-ordered NUSI. Secondary Index Usage Page 19-31 Covering Indexes If the query references only columns of that table that are fully contained within a given index, the index is said to "cover" the table in the query. In these cases, it is often more efficient to access only the index subtable and avoid accessing the base table rows altogether. Covering will be considered for any table in the query that references only columns defined in a given NUSI. These columns can be specified anywhere in the query including the: SELECT list WHERE clause Aggregate functions GROUP BY expressions The presence of a WHERE condition on each indexed column is not a prerequisite for using the index to cover the query. The optimizer will consider the legality and cost of covering versus other alternative access paths and choose the optimal plan. Many of the potential performance gains from index covering require no user intervention and will be transparent except for the execution plan returned by EXPLAIN. Join Index Note: This course hasn’t covered Join Indexes to this point, but it is possible to create a NUSI on top of a Join Index. The CREATE INDEX has a special option of ALL which is required if these columns will be potentially used for covering. The class of indexed data that will require user intervention to take advantage of covering is NUSIs, which may be defined on a Join Index. By default, a NUSI defined on a Join Index will maintain RowID pointers to only physical rows. In order to use the NUSI to cover the data stored in a Join Index, Row IDs must be kept for each associated logical row. As a result, when defining a potential covering NUSI on top of a Join Index, users should specify the ALL option to indicate the NUSI rows should point to logical rows. Example Defining a NUSI on top of a Join Index CREATE JOIN INDEX OrdCustIdx as SELECT (custkey, custname), (orderstatus, orderdate, ordercomment) FROM Orders O LEFT JOIN Customer C ON O.custkey = C.custkey ORDER BY custkey PRIMARY INDEX (custname); CREATE INDEX idx_name_stat ALL (custname, orderstatus) on OrdCustIdx; Page 19-32 Secondary Index Usage Covering Indexes • The optimizer considers using a NUSI subtable to “cover” any query that references only columns defined in a given NUSI. • These columns can be specified anywhere in the query including: – – – – – SELECT list WHERE clause Aggregate functions GROUP BY clauses Expressions • Presence of a WHERE condition on each indexed column is not a prerequisite for using the index to cover the query. • NUSIs (especially a covering NUSI) are considered by the optimizer in join plans and can be joined to other tables in the system. Query considered for index covering: CREATE INDEX IdxOrd (orderkey, orderdate, totalprice) ON Orders ; Query to access the table via the OrderKey. Secondary Index Usage SELECT FROM WHERE GROUP BY Query considered for index covering and ordering: CREATE INDEX IdxOrd2 (orderkey, orderdate, totalprice) ORDER BY VALUES (orderkey) ON Orders ; orderdate, AVG(totalprice) Orders orderkey >1000 orderdate ; Page 19-33 Covering Indexes (cont.) NUSIs and Aggregate Processing When aggregation is performed on a NUSI column, the Optimizer accesses the NUSI subtable that offers much better performance than accessing the base table rows. Better performance is achieved because there should be fewer index blocks and rows in the subtable than data blocks and rows in the base table, thus requiring less I/O. Example In the example on the facing page, there is a NUSI defined on the state column of the location table. Aggregate processing of this NUSI column produces much faster results for the SELECT statement, which counts the number of rows for each state. Page 19-34 Secondary Index Usage Covering Indexes (cont.) • • • • The Optimizer uses NUSI subtables for aggregation when possible. If the aggregated column is a NUSI, subtable access may be sufficient. The system counts Row ID List entries for each AMP for each value. Also referred to as a “Covered NUSI”. SELECT FROM GROUP BY NUSI Subtable NUSI Subtable COUNT (*), state Location state; NUSI Subtable = subtable Row ID NUSI Subtable NY NY NY NY OH OH OH OH GA GA GA GA CA CA CA CA Secondary Index Usage Page 19-35 NUSI vs. Full Table Scan (FTS) The Optimizer generally chooses an FTS over a NUSI when one of the following occurs: Rows per value is greater than data blocks per AMP. It does not have COLLECTed STATISTICS on the NUSI. The index is too weakly selective. The Optimizer determines this by using COLLECTed STATISTICS. Example The table on the facing page shows how the access method chosen affects the number of physical I/Os per AMP. In the case of a NUSI, there is ONE I/O necessary to read the Index Block on each AMP plus 0-ALL (where ALL = Number of Data Blocks) I/Os required to read the Data Blocks for a possible total ranging from the Number of AMPs - (Number of AMPs + ALL) I/Os. In the case of a Full Table Scan, there are no I/Os required to read any Index Blocks, but the system reads ALL Data Blocks. The only way to tell whether or not a NUSI is being used is by using EXPLAIN. COLLECT STATISTICS on all NUSIs. Use EXPLAIN to see whether a NUSI is being used. Do not define NUSIs that will not be used. Page 19-36 Secondary Index Usage NUSI vs. Full Table Scan (FTS) The Optimizer generally chooses a FTS over a NUSI when: • It does not have COLLECTed STATISTICS on the NUSI. • The index is too weakly selective. • Small tables. Access Method Physical I/Os per AMP NUSI 1 0 – Many Index Subtable Block(s) Data Blocks Full Table Scan 0 ALL Index Subtable Blocks Data Blocks General Rules: • COLLECT STATISTICS on all NUSIs. • USE EXPLAIN to see whether a NUSI is being used. • Do not define NUSIs that will not be used. Secondary Index Usage Page 19-37 Full Table Scans – Sync Scans In the case of multiple users that access the same table at the same time, the system can do a synchronized scan (sync scan) on the table. Page 19-38 Secondary Index Usage Full Table Scans – Sync Scans In the case of multiple users that access the same table at the same time, the system can do a synchronized scan (sync scan) on the table. • Multiple simultaneous scans share reads – this is a sync scan at the block level. • New query joins scan at the current scan point. Table Rows 112747 Query 1766 1 034982 2212 Begins 310229 2231 100766 106363 108222 3001 3005 3100 Frankel Bench Palmer Allan John Carson 209181 123881 223431 1235 2433 2500 108221 101433 105200 3001 3007 3101 Smith Walton Brooks Buster Sam Steve 221015 121332 118314 1019 2281 2100 108222 3199 Query 2 101281 Begins 3007 101100 3002 Woods Walton Ramon Tiger John Anne 104631 210110 210001 1279 1201 1205 100279 101222 105432 3002 3003 3022 Roberts Douglas Morgan Julie Michael Joe 100076 100045 319116 1011 1012 1219 104321 101231 121871 3021 3087 3025 Anderson Sparky Query 3 Michelson Phil Begins Crawford Cindy : : : : : : : : Secondary Index Usage : : : : Page 19-39 Module 19: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 19-40 Secondary Index Usage Module 19: Review Questions 1. Because the row is hash-distributed on different columns, the subtable row will typically land on an AMP other than the one containing the data row. This index would be: a. UPI or NUPI b. USI c. NUSI d. None of the above 2. The Teradata DBS hashes the value and uses the Row Hash to find the desired rows. This is always a one-AMP operation. This index would be: a. b. c. d. UPI or NUPI USI NUSI None of the above 3. ___________________ is a process that determines common Row IDs between multiple NUSI values by a process of intersection. a. NUSI Bit Mapping b. Dual NUSI Access c. Full Table Scan d. NUSI Read 4. If aggregation is performed on a NUSI column, the Optimizer accesses the NUSI subtable and returns the result without accessing the base table. This is referred to as: a. b. c. d. NUSI bit mapping Full table scan Dual NUSI access Covering Index Secondary Index Usage Page 19-41 Notes Page 19-42 Secondary Index Usage Module 20 Analyze Secondary Index Criteria After completing this module, you will be able to: Describe Composite Secondary Indexes. Choose columns as candidate Secondary Indexes. Analyze Change Rating, Value Access, and Range Access. Teradata Proprietary and Confidential Analyze Secondary Index Criteria Page 20-1 Notes Page 20-2 Analyze Secondary Index Criteria Table of Contents Accessing Rows ......................................................................................................................... 20-4 Row Selection ............................................................................................................................ 20-6 Secondary Index Considerations ................................................................................................ 20-8 Secondary Index Usage ............................................................................................................ 20-10 Secondary Index Candidate Guidelines ................................................................................... 20-12 Exercise 3 – Sample ................................................................................................................. 20-14 Secondary Index Candidate Guidelines ............................................................................... 20-14 Exercise 3 – Choosing SI Candidates ...................................................................................... 20-16 Exercise 3 – Choosing SI Candidates (cont.) ....................................................................... 20-18 Exercise 3 – Choosing SI Candidates (cont.) ....................................................................... 20-20 Exercise 3 – Choosing SI Candidates (cont.) ....................................................................... 20-22 Exercise 3 – Choosing SI Candidates (cont.) ....................................................................... 20-24 Exercise 3 – Choosing SI Candidates (cont.) ....................................................................... 20-26 Change Rating .......................................................................................................................... 20-28 Value and Range Access .......................................................................................................... 20-30 Exercise 4 – Sample ................................................................................................................. 20-32 Exercise 4 – Eliminating Index Candidates ............................................................................. 20-34 Exercise 4 – Eliminating Index Candidates (cont.) .............................................................. 20-36 Exercise 4 – Eliminating Index Candidates (cont.) .............................................................. 20-38 Exercise 4 – Eliminating Index Candidates (cont.) .............................................................. 20-40 Exercise 4 – Eliminating Index Candidates (cont.) .............................................................. 20-42 Exercise 4 – Eliminating Index Candidates (cont.) .............................................................. 20-44 Module 20: Review Questions ................................................................................................. 20-46 Analyze Secondary Index Criteria Page 20-3 Accessing Rows Three SQL commands require that rows be physically read. They are SELECT, UPDATE, and DELETE. Their syntax and use are described below: SELECT [expression] FROM tablename ... UPDATE tablename SET col_name = [expression] ... DELETE FROM tablename ... The SELECT command returns the value(s) from the table(s) for display or processing. Many people confuse the SQL SELECT statement with a READ command (e.g., COBOL). SELECT simply asks for the column values expressed in the project list to be returned for display or processing. The rows which have their values returned, deleted or updated are identified by the WHERE clause (when present). It is the WHERE clause that controls File System reads. The UPDATE command changes one or more column values to new values. The DELETE command removes rows from a table. Any of the three SQL statements can be modified with a WHERE clause. Values specified in a WHERE clause tell Teradata which rows should be acted upon. Proper use of the WHERE clause will improve throughput by limiting the number of rows that must be handled. Page 20-4 Analyze Secondary Index Criteria Accessing Rows SELECT {expression} FROM tablename… • Returns the value(s) from the table(s) for display or processing. • The row(s) must be physically read first. UPDATE tablename SET columns = {expression}… • Changes one or more column values to new values. • The rows(s) must be physically located (read) first. DELETE FROM tablename… • Removes rows from a table. • The row(s) must be physically located (read) first. Any of the above SQL statements can contain a WHERE clause. • Values in the WHERE cause tell Teradata what set of rows to act on. • Without a WHERE clause, all rows participate in the operation. • Limiting the number of rows Teradata must handle improves throughput. Analyze Secondary Index Criteria Page 20-5 Row Selection When TERADATA processes an SQL statement with a WHERE clause, it examines the clause and builds an execution plan and access method to satisfy the clause conditions. Certain conditions contained in the WHERE clause take advantage of indexing (assuming that the appropriate index is in place). These conditions are shown in the upper box on the facing page. Notice that these conditions all ask the RDBMS to locate a specific value or set of values. Application programmers should use these conditions whenever possible as they offer the best performance. Other WHERE clause conditions are not able to take advantage of indexing and will always cause a Full Table Scan of either the Base Table or a SI subtable. Though they benefit from the Teradata distributed architecture, they are less desirable from a performance standpoint. These kind of conditions are listed in the middle box on the opposite page and do not focus on a specific value or set of values, thus forcing the system to do a Full Table Scan to find all the values to satisfy them. Note that poor relational models severely limit physical design choices and generally force more Full Table Scans. Maximum number of ORed conditions or IN list values per request can't exceed 1,048,576. There really no fixed limit on the number of entries in an IN list; however, the maximum SQL text size is 1MB and this places a request-specific upper bound on this number. NOTE The small box at the bottom of the facing page lists commands that operate on the answer sets generated by previous conditions, such as those shown in the boxes above. Page 20-6 Analyze Secondary Index Criteria Row Selection WHERE clause conditions that may use indexing if available*: colname = value colname IS NULL colname IN (subquery) * colname IN (explicit list of values) t1.col_x = t1.col_y t1.col_x = t2.col_x condition1 AND condition2 condition1 OR condition2 colname = ANY, SOME or ALL Access methods for the above depend on whether the column(s) are indexed, type of index, and selectivity of the index. WHERE clause conditions that typically cause a Full Table Scan: non-equality comparisons colname IS NOT NULL colname NOT IN (explicit list of values) colname NOT IN (subquery) colname BETWEEN ... AND … Join condition1 OR Join condition2 t1.col_x [ computation ] = value t1.col_x [ computation ] = t1.col_y INDEX (colname) SUBSTRING (colname) SUM MIN MAX AVG COUNT The following functions affect output only, not base row selection. Poor relational models severely limit physical design choices and generally force more Full Table Scans. Analyze Secondary Index Criteria DISTINCT ANY ALL NOT (condition1) col1 || col2 = value colname LIKE ... missing a WHERE clause GROUP BY HAVING WITH WITH … BY ... ORDER BY UNION INTERSECT EXCEPT Page 20-7 Secondary Index Considerations The facing page describes key considerations involved in decisions regarding the use of Secondary Indexes. It is important to weigh the costs of Secondary Indexes against the benefits. Some of these costs are increased use of disk space and increased I/O. The main benefit of Secondary Indexes is faster set selection. Choose them on frequently used set selections. REMEMBER Data demographics change over time. Revisit all index choices regularly to make sure that they remain appropriate and serve you well. Page 20-8 Analyze Secondary Index Criteria Secondary Index Considerations • Secondary Indexes consume disk space for their subtables. • INSERTs, DELETEs, and UPDATEs (sometimes) cost double the I/Os. • Choose Secondary Indexes on frequently used set selections. – Secondary Index use is typically based on an Equality search. – A NUSI may have multiple rows per value. • • • • • The Optimizer may not use a NUSI if it is too weakly selective. Avoid choosing Secondary Indexes with volatile data values. Weigh the impact on Batch Maintenance and OLTP applications. USI changes are Transient Journaled. NUSI changes are not. Remove or drop NUSIs that are not used. Data demographics change over time. Revisit ALL index (Primary and Secondary) choices regularly. Make sure they are still serving you well. Analyze Secondary Index Criteria Page 20-9 Secondary Index Usage The facing lists common usage for a USI and a NUSI. Page 20-10 Analyze Secondary Index Criteria Secondary Index Usage Unique Secondary Index (USI) Usage • A USI is used to maintain uniqueness in a column or columns. • Usage is determined by specifying the USI value in an equality condition in the • WHERE clause or ON clause. Unique Secondary Indexes support … – Nested Joins – Row-hash locking Non-unique Secondary Index (NUSI) Usage • Usage is determined by specifying the NUSI value in an equality condition in the WHERE clause or ON clause. • Non-Unique Secondary Indexes support Nested Joins and Merge Joins • Optimizer can choose to use bit-mapping for weakly selective (>10%) NUSIs which can alleviate limitations associated with composite NUSIs. • In some cases, it may be better to use multiple single-column NUSIs (City, State) instead a single composite NUSI. – User has to balance the overhead of multiple NUSIs as compared to a single composite NUSI. • Can be used to “cover” a query, avoiding base table access. • Can significantly reduce base table I/O during value and join operations. Analyze Secondary Index Criteria Page 20-11 Secondary Index Candidate Guidelines All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: The optimizer does not only look at selectively of a column to determine if a FTS or an indexed access will be used in a given plan. The decision is made based after comparing the total cost of both approaches, after considering multiple factors, including row size, block size, number of rows in the table, and also the I/O and CPU cost (based on the current hardware cost factors). In this course, we are going to 5% as a guideline for NUSI selectivity. Example 1: Assume a table has 100M rows and a column has 50 distinct values that are evenly distributed (each value has the same number of rows). Therefore, each value has 2M rows and effectively represents 2% of the rows. The NUSI would be used. Example 2: Assume a table has 100M rows and a column has 20 distinct values that are evenly distributed (each value has the same number of rows). Therefore, each value has 5M rows and effectively represents 5% of the rows. The NUSI would be used. Example 3: Assume a table has 100M rows and a column has 10 distinct values that are evenly distributed (each value has the same number of rows). Therefore, each value has 10M rows and effectively represents 10% of the rows. The NUSI would not be used. The greater the discrepancy between typical rows per value and max rows per value, the higher the probability the NUSI would not be used based on the max value used to qualify the rows. Page 20-12 Analyze Secondary Index Criteria Secondary Index Candidate Guidelines • All Primary Index (PI) candidates are Secondary Index candidates. – A UPI is a USI candidate and a NUPI is a NUSI candidate. • Columns that are not PI candidates should also be considered as NUSI candidates. • A NUSI will be used depending on the percentage of table rows that will be accessed. For example: – If the number of rows accessed via a NUSI is ≤ 5%, the NUSI will be used. – If the number of rows accessed via a NUSI is 5 – 10%, the NUSI may or may not be used. – If the number of rows accessed via a NUSI is > 10%, the NUSI will not be used. • If 5% is used as the guideline, then any column with 20 or more distinct values is considered as a NUSI candidate. – The optimizer (based on statistics) will decide to use (or not) the NUSI for specific values. – The greater the discrepancy between typical rows per value and max rows per value, the higher the probability the NUSI would not be used based on the max value used to qualify the rows. • These are only guidelines for candidate selection. Validate (via Explain and testing) that the NUSI will be chosen AND that it will provide better performance. Analyze Secondary Index Criteria Page 20-13 Exercise 3 – Sample In this exercise, you will work with the same tables you used to identify PI candidates in Exercise 2 in Module 17. Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. The table on the facing page provides you with an example of how to apply the Secondary Index Candidate Guidelines. You will make further index choices for these tables in following exercises. Note: These exercises do not provide row sizes. Therefore, assume that the rows could be as large as 960 bytes and assume a typical block size of 96 KB. Secondary Index Candidate Guidelines All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-14 Analyze Secondary Index Criteria Exercise 3 – Sample Secondary Index Guidelines • All PI candidates are Secondary Index candidates. • Other columns are NUSI candidates if typical rows/value is ≤ 5% or # of distinct values ≥ 20. On the following pages, there are sample tables with typical rows per value demographics. • Indicate ALL possible Secondary Index candidates (USI and NUSI). • Later exercises will guide your final choices. Example 60,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating PI/SI A B PK,SA 5K 12 1M 50M 60M 1 0 1 0 UPI USI 2.6K 0 0 0 7M 12 5 7 1 NUPI NUSI C D FK,NN NN,ND 0 0 1K 5K 1.5M 500 0 35 5 NUPI? NUSI 500K 0 0 0 60M 1 0 1 3 UPI USI E F G H 0 0 0 0 8 8M 0 7M 0 0 0 0 0 15M 9 725K 3 4 0 0 0 0 15M 725K 5 3 4 52 4K 0 0 700 90K 10K 80K 9 NUSI NUSI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-15 Exercise 3 – Choosing SI Candidates In this exercise, you will work with the same tables you used to identify PI candidates in Exercise 2 in Module 17. Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-16 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates ENTITY 1 100,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating A B C D E F 0 0 0 0 95M 2 0 1 3 NUPI 0 0 0 0 300K 400 0 325 2 NUPI 0 0 0 0 250K 350 0 300 1 NUPI 0 0 0 0 40M 3 1.5M 2 1 0 0 0 0 1M 110 0 90 1 NUPI PK,UA 50K 0 10M 10M 100M 1 0 1 0 UPI PI/SI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-17 Exercise 3 – Choosing SI Candidates (cont.) Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-18 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates (cont.) ENTITY 2 10,000,000 Rows G PK/FK PK,SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 5K 12 100M 100M 10M 1 0 1 0 UPI H I J K L 365 0 0 0 100K 200 0 100 0 NUPI 12 0 0 0 9M 2 100K 1 9 12 0 0 0 12 1M 0 800K 1 0 0 0 0 50 240K 0 190K 2 0 260 0 0 180K 60 0 50 0 NUPI PI/SI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-19 Exercise 3 – Choosing SI Candidates (cont.) Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-20 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates (cont.) DEPENDENT 5,000,000 Rows A PK/FK M N O PK P Q NN,ND FK SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 0 0 700K 1M 2M 4 0 1 0 0 0 0 0 50 200K 0 60K 0 PI/SI NUPI 0 0 0 0 90K 75 0 50 3 UPI 0 0 0 0 3M 2 390K 1 1 0 0 0 0 5M 1 0 1 0 UPI 0 0 0 0 2M 5 1M 1 1 NUPI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-21 Exercise 3 – Choosing SI Candidates (cont.) Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-22 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates (cont.) ASSOCIATIVE 1 300,000,000 Rows A PK/FK G R S PK FK FK,SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 260 0 0 0 100M 5 0 3 0 0 0 8M 300M 10M 50 0 30 0 0 0 0 0 15K 21K 0 19K 0 0 0 0 0 800K 400 0 350 0 PI/SI NUPI NUPI NUPI? NUPI UPI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-23 Exercise 3 – Choosing SI Candidates (cont.) Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-24 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates (cont.) ASSOCIATIVE 2 100,000,000 Rows A M PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating G T U 0 0 0 0 750 135K 0 100K 0 PK FK FK 0 0 7M 800M 50M 3 0 1 0 0 0 250K 20M 10M 150 0 8 0 0 0 0 0 560K 180 0 170 0 NUPI NUPI UPI PI/SI NUPI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-25 Exercise 3 – Choosing SI Candidates (cont.) Use the Secondary Index Candidate Guidelines below to identify all USI and NUSI candidates. All Primary Index candidates are Secondary Index candidates. Columns that are not Primary Index candidates have to also be considered as NUSI candidates. A NUSI will be used by the Optimizer to select data if it is strongly selective. A guideline to use in initially selecting NUSI candidates is the following: If the number of distinct values ≥ 20, then the column is a NUSI candidate. Page 20-26 Analyze Secondary Index Criteria Exercise 3 – Choosing SI Candidates (cont.) HISTORY 730,000,000 Rows A PK/FK DATE D E F 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A PK FK SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 10M 0 800M 2.4B 100M 18 0 3 0 5K 20K 0 0 730 1100K 0 900K 0 PI/SI NUPI UPI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-27 Change Rating Change Rating is a number that comes from Application & Transaction Modeling (ATM). Change Rating indicates how often the values in a column, or columns, are updated. It is a value from 0 to 10, with 0 describing those columns which never change and 10 describing those columns which change with every write operation. The Change Rating values of various types of columns are shown on the facing page. Change Rating has nothing to do with the SQL INSERT or DELETE statements. A table may be subject to frequent INSERTs and/or DELETEs, but the Change Ratings of columns will be low as long as the values within those columns remain stable throughout the lifetime of the row. Change Rating is dependent only on the SQL UPDATE statement. Change Rating is affected when column values are UPDATEd. Utilize Change Rating when choosing indexes. Primary Indexes must be based on columns with very stable data values. PI columns should never have Change Ratings higher than 1. Secondary Indexes should be based on columns with at least fairly stable data values. You should not choose columns with Change Ratings higher than 3 for SIs. Page 20-28 Analyze Secondary Index Criteria Change Rating Change Rating indicates how often values in a column are UPDATEd: • 0 = column values never change. • 10 = column changes with every write operation. PK columns are always 0. Historical data columns are always 0. Data that does not normally change = 1. Update tracking columns = 10. All other columns are rated 2 - 9. Base Primary Index choices on columns with very stable data values: • A change rating of 0 - 1 is reasonable. Base Secondary Index choices on columns with fairly stable data values: • A change rating of 0 - 3 is reasonable. Analyze Secondary Index Criteria Page 20-29 Value and Range Access Value Access Frequency is a numeric rating which tells you how many times all known transactions access the table in a given time interval (e.g., a one-year period). It measures how frequently a column, or columns, is accessed by SQL statements containing an equality value. Range Access Frequency is a numeric rating which tells you how many times all known transactions access the table in a given time interval (e.g., a one-year period). It measures how frequently a column, or columns, is accessed by SQL statements that access a range of values such as a DATE range. These types of queries may contain inequality or BETWEEN expressions. A Value Access or Range Access of 0 implies that there is no need to access the table through that column. Since NUSIs require system resources to maintain them (INSERTs and DELETEs require additional I/O to update the SI subtables), there is no point in having a NUSI if it is not used for access. All NUSI candidates with very low Value Access or Range Access Frequency should be eliminated. Page 20-30 Analyze Secondary Index Criteria Value and Range Access Value Access: • How often a column appears with an equality value. For example: WHERE column_name = hardcoded_value or substitutable_value Range Access: • How often a column is used to access a range of data values (e.g., range of dates). For example: WHERE column_name BETWEEN value AND value or WHERE column_name > value Value Access or Range Access Frequency: • How often in a given time interval (e.g., annually) all known transactions access rows from the table through this column either with an equality value or with a range of values. Notes: • The above demographics result from Activity Modeling. • Low Value Access or Range Access Frequency: – Secondary Index overhead may cost more than doing the FTS. • NUSIs may be considered by the optimizer for joins. In the following exercises, we are going to eliminate NUSIs with a value access of 0, but we may need to reconsider the NUSI as an index choice depending on join access (when given join metrics). • EXPLAINs indicate if the index choices are utilized or not. Analyze Secondary Index Criteria Page 20-31 Exercise 4 – Sample In this exercise, you will again work with the same tables that you used in Exercises 2 and 3. In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. The table on the right provides you with an example of how to apply these guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. The table on the facing page provides you with an example of how to apply these guidelines. You will make final index choices for these tables in Exercise 5 (later module). Page 20-32 Analyze Secondary Index Criteria Exercise 4 – Sample Change Rating Guidelines: • PI – change rating 0 - 1. • SI – change rating 0 - 3. Value Access Guideline: • NUSI Value Access > 0 • VONUSI Range Access > 0 On the following pages, there are sample tables with change row and value access demographics. • Eliminate Index candidates based on change rating and value access. • Identify any VONUSI candidates with a Range Access > 0 • Later exercises will guide your final choices. Example 60,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating PI/SI A B PK,SA 5K 12 1M 50M 60M 1 0 1 0 UPI USI 2.6K 0 0 0 7M 12 5 7 1 NUPI NUSI C D FK,NN NN,ND 0 0 1K 5K 1.5M 500 0 35 5 NUPI? NUSI 500K 0 0 0 60M 1 0 1 3 UPI USI E F G H 0 0 0 0 8 8M 0 7M 0 0 0 0 0 15M 9 725K 3 4 0 0 0 0 15M 725K 5 3 4 52 4K 0 0 700 90K 10K 80K 9 NUSI NUSI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-33 Exercise 4 – Eliminating Index Candidates In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-34 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates ENTITY 1 100,000,000 Rows PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating PI/SI A B C D E F 0 0 0 0 95M 2 0 1 3 NUPI NUSI 0 0 0 0 300K 400 0 325 1 NUPI NUSI 0 0 0 0 250K 350 0 300 1 NUPI NUSI 0 0 0 0 40M 3 1.5M 2 1 0 0 0 0 1M 110 0 90 1 NUPI NUSI PK,UA 50K 0 10M 10M 100M 1 0 1 0 UPI USI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-35 Exercise 4 – Eliminating Index Candidates (cont.) In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-36 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates (cont.) ENTITY 2 10,000,000 Rows G PK/FK PK,SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 5K 12 100M 100M 10M 1 0 1 0 UPI USI PI/SI H I J K L 365 0 0 0 100K 200 0 100 0 NUPI NUSI 12 0 0 0 9M 2 100K 1 9 12 0 0 0 12 1M 0 800K 1 0 0 0 0 50 240K 0 190K 2 0 260 0 0 180K 60 0 50 0 NUPI NUSI NUSI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-37 Exercise 4 – Eliminating Index Candidates (cont.) In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-38 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates (cont.) DEPENDENT 5,000,000 Rows A PK/FK M N O PK Q NN,ND FK SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 0 0 700K 1M 2M 4 0 1 0 0 0 0 0 50 200K 0 60K 0 PI/SI NUPI 0 0 0 0 90K 75 0 50 3 0 0 0 0 3M 2 390K 1 1 UPI 0 0 0 0 5M 1 0 1 0 UPI 0 0 0 0 2M 5 1M 1 1 NUPI USI USI NUSI P NUSI NUSI NUSI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-39 Exercise 4 – Eliminating Index Candidates (cont.) In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-40 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates (cont.) ASSOCIATIVE 1 300,000,000 Rows A PK/FK G R S PK FK FK,SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 260 0 0 0 100M 5 0 3 0 0 0 8M 300M 10M 50 0 30 0 0 0 0 0 15K 21K 0 19K 0 0 0 0 0 800K 400 0 350 0 PI/SI NUPI NUPI NUPI? NUPI NUSI NUSI NUSI UPI USI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-41 Exercise 4 – Eliminating Index Candidates (cont.) In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-42 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates (cont.) ASSOCIATIVE 2 100,000,000 Rows A M PK/FK Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating G T U 0 0 0 0 750 135K 0 100K 0 PK FK FK 0 0 7M 800M 50M 3 0 1 0 0 0 250K 20M 10M 150 0 8 0 0 0 0 0 560K 180 0 170 0 NUPI NUPI NUSI NUSI UPI PI/SI NUPI USI Collect Statistics (Y/N) Analyze Secondary Index Criteria NUSI NUSI Page 20-43 Exercise 4 – Eliminating Index Candidates (cont.) In this exercise, you will look at three additional demographics to eliminate potential index candidates and to possibly choose Value-Ordered NUSI candidates. The three additional data demographics that you will look at are: Change Rating Value Access Range Access Use the following Change Rating demographics guidelines to eliminate those candidates that do not fit the guidelines. PI candidates should have Change Ratings from 0 - 1. SI candidates should have Change Ratings from 0 - 3. Also, eliminate those NUSI candidates which have Value Access = 0 and Range Access = 0. If a Range Access is greater than 0, then consider the column as a possible Value-Ordered NUSI (VONUSI) candidate. Page 20-44 Analyze Secondary Index Criteria Exercise 4 – Eliminating Index Candidates (cont.) HISTORY 730,000,000 Rows A PK/FK DATE D E F 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A 0 0 0 0 N/A N/A N/A N/A N/A PK FK SA Value Access Range Access Join Access Join Rows Distinct Values Max Rows/Value Max Rows/NULL Typical Rows/Value Change Rating 10M 0 800M 2.4B 100M 18 0 3 0 5K 20K 0 0 730 1100K 0 900K 0 PI/SI NUPI UPI USI NUSI NUSI Collect Statistics (Y/N) Analyze Secondary Index Criteria Page 20-45 Module 20: Review Questions Check your understanding of the concepts discussed in this module by completing the review questions as directed by your instructor. Page 20-46 Analyze Secondary Index Criteria Module 20: Review Questions 1. With a NUPI, a technique to avoid a duplicate row check is to ________. a. b. c. d. use set tables use the NOT NULL constraint on the column create the table as a MULTISET table compare data values byte-by-byte within a Row Hash in order to ensure uniqueness 2. Which type of usage normally applies to a USI? ____ a. b. c. d. Range access NOT condition Equality value access Inequality value access 3. Which two types of usage normally apply to a composite NUSI that is hash-ordered? ____ ____ a. b. c. d. Covering index Equality value access Inequality value access Non-covering range access Analyze Secondary Index Criteria Page 20-47 Notes Page 20-48 Analyze Secondary Index Criteria Module 21 Access Considerations and Constraints After completing this module, you will be able to: Analyze Optimizer Access scenarios. Explain partial value searches and data conversions. Identify the effects of conflicting data types. Determine the cost of I/Os. Identify column level attributes and constraints. Identify table level attributes and constraints. Add, modify and drop constraints from tables. Explain how the Identity column allocates new numbers. Teradata Proprietary and Confidential Access Considerations and Constraints Page 21-1 Notes Page 21-2 Access Considerations and Constraints Table of Contents Access Method Comparison ...................................................................................................... 21-4 Unique Primary Index (UPI) .................................................................................................. 21-4 Non-Unique Primary Index (NUPI) ....................................................................................... 21-4 Unique Secondary Index (USI) .............................................................................................. 21-4 Non-Unique Secondary Index (NUSI) ................................................................................... 21-4 Full-Table Scan (FTS)............................................................................................................ 21-4 Optimizer Access Scenarios ....................................................................................................... 21-6 Data Conversions ....................................................................................................................... 21-8 Storing Numeric Data .............................................................................................................. 21-10 Data Conversion Example........................................................................................................ 21-12 Matching Data Types ............................................................................................................... 21-14 Counting I/O Operations .......................................................................................................... 21-16 Additional I/O ...................................................................................................................... 21-16 Transient Journal I/O ............................................................................................................... 21-18 INSERT and DELETE Operations .......................................................................................... 21-20 UPDATE Operations ............................................................................................................... 21-22 Primary Index Value UPDATE ............................................................................................... 21-24 Table Level Attributes ............................................................................................................. 21-26 Example of Column and Table Level Constraints ................................................................... 21-28 Table Level Constraints ....................................................................................................... 21-28 Example (13.0) – SHOW Department Table ........................................................................... 21-30 Example (13.10) – SHOW Department Table ......................................................................... 21-32 Altering Table Constraints ....................................................................................................... 21-34 Identity Column – Overview.................................................................................................... 21-36 Business Value ..................................................................................................................... 21-36 Business Usage .................................................................................................................... 21-36 Identity Column – Implementation .......................................................................................... 21-38 Performance ......................................................................................................................... 21-38 Process for Generating Identity Column Numbers .............................................................. 21-38 Identity Column – Example 1 .................................................................................................. 21-40 Identity Column – Example 2 .................................................................................................. 21-42 Identity Column – Considerations ........................................................................................... 21-44 Limited to DECIMAL(18,0) ................................................................................................ 21-44 Restrictions........................................................................................................................... 21-44 Module 21: Review Questions ................................................................................................. 21-46 Access Considerations and Constraints Page 21-3 Access Method Comparison We have seen in preceding modules that Teradata can access data (through indexes or Partition, or Full Table Scans). The facing page illustrates these various access methods in order of number of AMPs affected. Unique Primary Index (UPI) The UPI is the most efficient way to access data. Accessing data through a UPI is a oneAMP operation that leads directly to the single row with the desired UPI value. The system does not have to create a Spool file during a UPI access. Non-Unique Primary Index (NUPI) Accessing data through a NUPI is a one-AMP operation that may lead to multiple rows with the desired NUPI value. The system creates a spool file during a NUPI access, if needed. NUPI access is efficient if the number of physical block reads is small. Unique Secondary Index (USI) A USI is a very efficient way to access data. Data access through a USI is usually a twoAMP operation, which leads directly to the single row with the desired USI value. The system does not have to create a spool file during a USI access. There are cases where a USI is actually more efficient than a NUPI. In these cases, the optimizer decides on a case-by-case basis which method is more efficient. Remember: the optimizer can only make informed decisions if it is provided with statistics. Non-Unique Secondary Index (NUSI) As we have seen, the non-unique secondary index (NUSI) is efficient only if the number of rows accessed is a small percentage of the total data rows in the table. NUSI access is an all-AMPs operation since the NUSI subtables must be scanned on each AMP. It is a multiple rows operation since there can be many rows per NUSI value. A spool file will be created if needed. Full-Table Scan (FTS) The Full-Table Scan is efficient in that each row is scanned only once. Although index access is generally preferred to a FTS, there are cases where they are the best way to access the data. Like the situation with NUPIs and USIs, Full Table Scans can sometimes be more efficient than a NUSI. The optimizer decides on a case-by-case basis which is more efficient (assuming that it has been provided with statistics). The Optimizer chooses what it thinks is the fastest access method. COLLECT STATISTICS to help the Optimizer make good decisions. Page 21-4 Access Considerations and Constraints Access Method Comparison Unique Primary Index Non-Unique Secondary Index • Very efficient • One AMP, one row • No spool file • Efficient only if the number of rows accessed • • Non-Unique Primary Index • Efficient if the number of rows per value • • is reasonable and there are no severe spikes. One AMP, multiple rows Spool file if needed No Primary Index • Access is a full table scan without secondary indexes. is a small percentage of the total data rows in the table. All AMPs, multiple rows Spool file if needed Partition Scan • Efficient since because of partition elimination. • All AMPs; all rows in specific partitions Full-Table Scan • Efficient since each row is touched only once. • All AMPs, all rows • Spool file may equal the table in size Unique Secondary Index • Very efficient • Two AMPs, one row • No spool file Access Considerations and Constraints The Optimizer chooses the fastest access method. COLLECT STATISTICS to help the Optimizer make good decisions. Page 21-5 Optimizer Access Scenarios Given the SQL WHERE clause on the facing page, the Optimizer decides which column it will use to access the data. This decision is based upon what indexes have been defined on the two columns (Col_1 and Col_2). When you examine the table, you can see that the optimizer chooses the most efficient access method depending on the situation. Interesting cases to note are as follows: If Col_1 is a NUPI and Col_2 is a USI, the Optimizer chooses Col_1 (NUPI) if its selectivity is close to a UPI (nearly unique). Otherwise, it accesses via Col_2 (USI) since only one row is involved, even though it is a two-AMP operation. If both columns are NUSIs, the Optimizer must determine the how selective each of them is. Depending on the relative selectivity, the Optimizer may choose to access via Col_1, Col_2, NUSI Bit Mapping or a FTS. If either one of the columns is a NUSI or the other column is not indexed, the Optimizer determines the selectivity of the NUSI. Depending on this selectivity, it chooses either to utilize the NUSI or to do a FTS. Whenever one of the columns is used to access the data, the remaining condition is used as a row qualifier. This is known as a residual condition. Page 21-6 Access Considerations and Constraints Optimizer Access Scenarios Col_2 Col_1 SINGLE TABLE CASE WHERE AND Table_1.Col_1 = :value_1 Table_1.Col_2 = :value_2 ; USI NOT INDEXED NUSI UPI UPI UPI NUPI NUPI or USI 1 NUPI NUPI USI USI USI USI NUSI USI NOT INDEXED USI UPI Either, Both, or FTS 2 NUSI or FTS Column the Optimizer uses for access. NUSI or FTS 3 FTS 3 Notes: 1. The Optimizer prefers Primary Indexes over Secondary Indexes. It chooses the NUPI if only one I/O (block) is accessed. The Optimizer prefers Unique indexes over non-unique indexes. Only one row is involved with USI even though it is a two-AMP operation. 2. Depending on relative selectivity, the Optimizer may use either NUSI, may use both with NUSI Bit Mapping, or may do a FTS. 3. It depends on the selectivity of the index. Access Considerations and Constraints Page 21-7 Data Conversions Operands in an SQL statement must be of the same data type to be compared. If operands differ, internal data conversion is performed. Data conversion is expensive in terms of system overhead and adversely affects performance. The physical designer should make every effort to minimize the need for data conversion. The best way to do this is to implement data types at the Domain level which should eliminate comparisons across data type. If data values come from the same Domain, they must be of the same data type and therefore, can be compared without conversion. Columns used in addition, subtraction, comparison, and join operations should always be from the same domain. Multiplication and division operations involve columns from two or three domains. In the Teradata Database the Byte data types can only be compared to a column with the Byte data type or a character string of XB'_ _ _ _...' For example, the system converts a CHARACTER value to a DATE value using the DATE conversion. On the other hand, converting from BYTE to NUMERIC is not possible (indicated by "ERROR"). Page 21-8 Access Considerations and Constraints Data Conversions • Columns (or values) must be of the same data type to be compared without necessary conversion. • Character data is compared using the host’s collating sequence. – Unequal-length character strings are converted by right-padding the shorter one with blanks. • If column (or values) types differ, internal conversion is performed. – Numeric values are converted to the same underlying representation. – Character to numeric comparison requires the character value to be converted to a numeric value. • Data conversion is expensive and generally unnecessary. • Implement data types at the Domain level. – Comparison across data types may indicate that Domain definitions are not clearly understood. Access Considerations and Constraints Page 21-9 Storing Numeric Data You should always store numeric data in numeric data types. Teradata will always convert character data to numeric data prior to doing a comparison. When Teradata is asked to do a comparison, it will always apply the following rules: To compare 2 columns, they must be of the same data type. Character data types will always be converted to numeric. The example on the slide page demonstrates the potential performance hit that could occur, when you store numeric data as a character data type. In Case 1 (Numeric values stored as Character Data Type): Statement 1 uses a character literal – Teradata will do a PI access (no data conversion required) to perform the comparison. Statement 2 uses a numeric value – Teradata will do a Full Table Scan (FTS) against the EMP1 table converting Emp_no to a numeric value and then do the comparison. In Case 2 (Numeric values stored as Numeric Data Type): Statement 1 uses a numeric value – Teradata will do a PI access (no data conversion required) to perform the comparison. Statement 2 uses a character literal – Teradata will convert the character literal to a numeric value, then do a PI access to perform the comparison. Page 21-10 Access Considerations and Constraints Storing Numeric Data When comparing character data to numeric, Teradata will always convert character to numeric, then do the comparison. Comparison Rules: To compare columns, they must be of the same Data types. Character data types will always be converted to numeric (when comparing character to numeric). Bottom Line: Always store numeric data in numeric data types to avoid unnecessary and costly data conversions. Case 1 Case 2 CREATE TABLE Emp1 (Emp_no CHAR(6), Emp_name CHAR(20)) UNIQUE PRIMARY INDEX (Emp_no); CREATE TABLE Emp2 (Emp_no INTEGER, Emp_name CHAR(20)) UNIQUE PRIMARY INDEX (Emp_no); Statement SELECT FROM WHERE 1 * Emp1 Emp_no = '1234'; Statement SELECT FROM WHERE 1 * Emp2 Emp_no = 1234; Statement SELECT FROM WHERE 2 * Emp1 Emp_no = 1234; Statement SELECT FROM WHERE 2 * Emp2 Emp_no = '1234'; Results in Full Table Scan Access Considerations and Constraints Results in unnecessary conversion Page 21-11 Data Conversion Example The example on the facing page illustrates how data conversion adversely affects system performance. You can see the results of the first EXPLAIN. Note that total estimated time to perform this SELECT is minimal. The system can process this request quickly because the data type of the literal value matches the column type. A character column value (col1) is being compared to a character literal (‘8’) which allows TERADATA to use the UPI defined on c1 for access and for maximum efficiency. The query executes as a UPI SELECT. In the second SELECT statement, the character column value (col1) is compared with a numeric value (8). You should notice that the total “cost” for this SELECT is nearly 30 times the estimate for the preceding SELECT. The system must do a Full Table Scan and convert the character values in col1 to numeric to compare them against the numeric literal (8). If the column was numeric and the literal value was character, the literal would convert to numeric and the result could be hashed, allowing UPI access. Page 21-12 Access Considerations and Constraints Data Conversion Example CREATE SET TABLE TFACT01.Table1 (col1 CHAR(12) NOT NULL) UNIQUE PRIMARY INDEX (col1); EXPLAIN SELECT * FROM Table1 WHERE col1 = '8'; 1) First, we do a single-AMP RETRIEVE step from TFACT01.Table1 by way of the unique primary index "TFACT01.Table1.col1 = '8' " with no residual conditions. The estimated time for this step is 0.00 seconds. -> The row is sent directly back to the user as the result of statement 1. The total estimated time is 0.00 seconds. EXPLAIN SELECT * FROM Table1 WHERE col1 = 8; 1) First, we lock a distinct TFACT01."pseudo table" for read on a RowHash to prevent global deadlock for TFACT01.Table1. 2) Next, we lock TFACT01.Table1 for read. 3) We do an all-AMPs RETRIEVE step from TFACT01.Table1 by way of an all-rows scan with a condition of ("(TFACT01.Table1.col1 (FLOAT, FORMAT '-9.99999999999999E-999')UNICODE)= 8.00000000000000E 000") into Spool 1, which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1,001 rows. The estimated time for this step is 0.28 seconds. 4) Finally, we send out an END TRANSACTION step to all AMPs involved in processing the request. -> The contents of Spool 1 are sent back to the user as the result of statement 1. The total estimated time is 0.28 seconds. Access Considerations and Constraints Page 21-13 Matching Data Types There are a few data types that the hashing algorithm treats identically. The best way to make sure that you don't run into this problem is to administer the data type assignments at the Domain level. Designing a system around domains helps ensure that you give matching Primary Indexes across tables the same data type. Page 21-14 Access Considerations and Constraints Matching Data Types The following data types are identical to the hashing algorithm: BYTEINT = SMALLINT = INTEGER = BIGINT = DATE = DECIMAL (x,0) CHAR = VARCHAR = LONG VARCHAR BYTE = VARBYTE GRAPHIC = VARGRAPHIC Administer data type assignments at the domain level. Give matching Primary Indexes across tables the same data type. Access Considerations and Constraints Page 21-15 Counting I/O Operations Understanding the cost of various Teradata transactions in terms of I/O will help you avoid unnecessary I/O overhead when doing your physical design. Many factors can influence the number of physical I/Os in a transaction. Some are listed on the facing page. The main concept of the next few pages is to help you understand the relative cost of doing INSERT, DELETE, and UPDATE operations. This understanding enables you to detect subsequent problems when doing performance analysis on a troublesome application. When counting I/O, it is important to remember that all such calculations give you a relative – not the absolute – cost of the transaction. Any given I/O operation may or may not cause any actual physical I/O. – Normally, when making a change to a table (INSERT, UPDATE, and DELETE), not only does the actual table have to be updated, but before-images have to be written in the Transient Journal to maintain transaction integrity. Transient Journal space is automatically allocated and is integrated with the WAL (Write-Ahead-Logic) Log, which has its own cylinders and file system. Additional I/O A table may also have Join Indexes, Hash indexes, or a Permanent Journal associated with it. Join Indexes can also have secondary indexes. In additional the number of I/Os for changes to a table, these options will result in additional I/Os. Whenever Permanent Journaling is used, additional I/O is incurred. The amount of this I/O varies according to whether you are using Before Imaging, After Imaging, or both, and whether the imaging is single or dual. The table on the facing page shows how many I/O operations are involved in writing the Permanent Journal block and the Cylinder Index. To calculate the Total Permanent Journal I/O for PJ INSERTs, DELETEs and UPDATEs, you apply the appropriate formula shown on the facing page. Permanent Journal I/O is in addition to any I/O incurred during the operation itself. In order to calculate the TOTAL I/O for an operation, you must sum the I/Os from the operation with the Total PJ I/O corresponding to that operation. Page 21-16 Access Considerations and Constraints Counting I/O Operations • Many factors influence the number of physical I/Os in a transaction: – – – – – Cache hits Rows per block Cylinder migrates Mini-Cylpacks Number of spool files and spool file sizes • I/Os may be done serially or in parallel. • Data and index block I/O may or may not require Cylinder Index I/O. • Changes to data rows and USI rows require before-images (undo rows) and after-images (redo rows) to be written to the WAL log. • Logical I/O counts indicate the relative cost of a transaction. – A given I/O operation may not cause any actual physical I/O. • A table may also have Secondary, Join/Hash indexes, or a Permanent Journal associated with it. Join Indexes can also have secondary indexes. – In additional to the number of I/Os for changes to a table, these options will result in additional I/O. Access Considerations and Constraints Page 21-17 Transient Journal I/O The Transient Journal (TJ) exists to permit the successful rollback of a failed transaction. Transactions are not committed to the database until an End Transaction request has been received by the AMPs, either implicitly or explicitly. Until that time, there is always the possibility that the transaction may fail in which case the participating table(s) must be restored to their pre-transaction state. The Transient Journal maintains a copy of all before-images of all rows affected by the transaction. If the event of transaction failure, the before images are reapplied to the affected tables, the images are deleted from the journal and a rollback operation is completed. When the transaction completes (assume successfully), at the point of transaction commit, the before-images for the transaction are discarded from the journal. Normally, when making a change to a table (INSERT, UPDATE, and DELETE), not only does the actual table have to be updated, but before-images have to be written in the TJ to maintain transaction integrity. – The preservation of the before-change row images for a transaction is the task of the Write Ahead Logic (WAL) component of the Teradata database management software. The system maintains a separate TJ (undo records) entry in the WAL log for each individual database transaction whether it runs in ANSI or Teradata session mode. – The WAL Log includes the following: – Before-image or undo records used for transaction rollback. After-image or redo records for updating disk blocks and insuring file system consistency during restarts, based on operations performed in cache during normal operation. – – The WAL Log is conceptually similar to a table, but the log has a simpler structure than a table. Log data is a sequence of WAL records, different from normal row structure and not accessible via SQL. When are transient journal rows actually written to the WAL log? This occurs BEFORE the modification is made to the base table row. Some situations where Transient Journal is not used when updating a table include: INSERT / SELECT into an empty table DELETE FROM tablename ALL; Utilities such as FastLoad and MultiLoad When a DELETE ALL is done, the master index and the cylinder indexes are updated. An entry is actually placed in the Transient Journal indicating that a “DELETE ALL” has been issued. Before-images of the individual deleted rows are not stored in the TJ. In the event a node happens to fail in the middle of a DELETE ALL, the TJ is checked for the deferred action that indicates a DELETE ALL was issued. The system checks to ensure that the DELETE ALL has completed totally as part of the restart process. Page 21-18 Access Considerations and Constraints Transient Journal I/O The Transient Journal (TJ) is … • • • • • A journal of transaction before-images (or undo records) maintained in the WAL log. Provides for automatic rollback in the event of TXN failure. Provides “Transaction Integrity”. Is automatic and transparent. TJ images are maintained in the WAL Log. The WAL Log includes the following: – Before-images or undo records used for transaction rollback. – After-images or redo records for updating disk blocks and insuring file system consistency during restarts, based on operations performed in cache (FSG) during normal operation. Therefore, when modifying a table, there are I/O's for the data table and the WAL log (undo and redo records). Some situations where Transient Journal is not used include: • • • • INSERT / SELECT into an empty table DELETE tablename; (Deletes all the rows in a table) Utilities such as FastLoad and MultiLoad ALTER TABLE Access Considerations and Constraints Page 21-19 INSERT and DELETE Operations To calculate the number of I/Os required to INSERT a new data row or DELETE an existing row, it is necessary to do three subsidiary calculations. They are: Number of I/Os required to INSERT or DELETE the row itself = five. Number of I/Os required for each Unique Secondary Index (USI) = five. Number of I/Os required for each Non-Unique Secondary Index (NUSI) = three. The overall formula for counting I/Os for INSERT and DELETE operations is shown at the bottom of the facing page. The number of I/Os must be doubled if Fallback is used. Page 21-20 Access Considerations and Constraints INSERT and DELETE Operations INSERT INTO tablename . . . ; DELETE FROM tablename . . . ; (* is an I/O operation) DATA ROW * * * * * READ Data Block WRITE Transient Journal record (UNDO row) to WAL Log INSERT or DELETE the data row, and WRITE REDO row (after-image) to WAL Log WRITE new Data Block WRITE Cylinder Index For each USI * * * READ USI subtable block WRITE Transient Journal record (UNDO index row) to WAL Log INSERT or DELETE the new USI subtable row, and WRITE REDO row (after-image) to WAL Log for the USI subtable row WRITE new USI subtable block WRITE Cylinder Index * * For each NUSI * * * READ NUSI subtable block ADD or DELETE the ROWID on the ROWID LIST or ADD or DELETE the NUSI subtable row WRITE new NUSI subtable block WRITE Cylinder Index I/O operations per row = 5 + [ 5 * (#USIs) ] + [ 3 * (#NUSIs) ] Double for FALLBACK Access Considerations and Constraints Page 21-21 UPDATE Operations To calculate the number of I/Os required when updating a data column, it is necessary to perform three subsidiary calculations. They are: The number of I/Os required to UPDATE the column in the data row itself = five. The number of I/Os required to change any USI subtable containing the particular column which was updated = ten (five to remove the old subtable row and five to add the new subtable row). The number of I/Os required to change the subtable of any NUSI containing the particular column which was updated = six (three to remove the old Row ID or subtable row and three to add the new Row ID or subtable row). The overall formula for counting I/Os for UPDATE operations is shown at the bottom of the facing page. REMEMBER You are simply estimating the relative cost of a transaction. Page 21-22 Access Considerations and Constraints UPDATE Operations UPDATE tablename SET colname = exp . . . (other than PI column) DATA ROW * * * * * READ Data Block WRITE Transient Journal record (UNDO row) to WAL Log UPDATE the data row, and WRITE REDO row (after-image) to WAL Log WRITE new Data Block WRITE Cylinder Index If colname = USI column * * * * * * * * * * (* = I/O Operations) READ current USI subtable block WRITE TJ record (UNDO row) into WAL Log DELETE USI subtable row, and WRITE REDO row (after-image) to WAL Log WRITE USI subtable block WRITE Cylinder Index READ new USI subtable block WRITE TJ record (UNDO row) into WAL Log INSERT new Index Subtable row, and WRITE REDO row (after-image) to WAL Log WRITE new USI subtable block WRITE Cylinder Index If colname = NUSI column * * * * * * READ current NUSI subtable block REMOVE data row's RowID from RowID list or REMOVE NUSI subtable row if last RowID WRITE NUSI subtable block WRITE Cylinder Index READ new NUSI subtable block ADD data row's RowID to RowID list or ADD new NUSI subtable row WRITE new NUSI subtable block WRITE Cylinder Index I/O operations per row = 5 + [ 10 * (#USIs) ] + [ 6 * (#NUSIs) ] Double for FALLBACK Access Considerations and Constraints Page 21-23 Primary Index Value UPDATE Updating the Primary Index Value is the most I/O intensive operation of all. This is due to the fact that any change to the PI invalidates all existing secondary index “pointers.” To calculate the number of I/Os required to UPDATE a PI column, it is necessary to perform three subsidiary calculations: The number of I/Os required to UPDATE the PI column in the data row itself. The number of I/Os required to change any USI subtable The number of I/Os required to change any NUSI subtable Study the steps on the facing page. Notice that updating a PI value is equivalent to first deleting and then inserting a row. All the steps necessary to do a DELETE are performed, and then all the steps necessary to do an INSERT are performed. Changing the PI value involves actually moving the row to the location determined by the new hash value. Thus, the number of steps involved in this process is exactly double the number of steps to perform either an INSERT or a DELETE. The formula for calculating the number of I/Os involved in a PI value update (shown at the bottom of the facing page) can be derived by doubling the formula for INSERTing or DELETing: Formula for PI Value Update = 10 + (5 * # USIs) + (6 * # NUSIs) Remember to double the number if Fallback is used. Note: If the USI changes, then the number of I/O’s for each changed USI is 8 in the preceding formula. Page 21-24 Access Considerations and Constraints Primary Index Value Update UPDATE tablename SET PI_column = new_value . . . ; (* = I/O Operations) Note: Assume only PI value is changed – all Secondary Index subtable rows are updated. DATA ROW ** * ** ** * ** READ current Data Block, WRITE TJ record (UNDO row) to WAL Log DELETE the Data Row, and WRITE REDO row (after-image) to WAL Log WRITE new Data Block, WRITE Cylinder Index READ new Data Block, WRITE TJ record (UNDO row) to WAL Log INSERT the DATA ROW, and WRITE REDO row (after-image) to WAL Log WRITE new Data Block, WRITE Cylinder Index For each USI * * * READ USI subtable block WRITE TJ record (UNDO row) into WAL Log UPDATE the USI subtable row with the new RowID, and WRITE REDO row (afterimage) to WAL Log WRITE new USI subtable block WRITE Cylinder index * * For each NUSI * * ** ** Read NUSI subtable block on AMP for current PI value Read NUSI subtable block on AMP for new value UPDATE the RowID list for both of the subtable blocks WRITE new NUSI subtable blocks WRITE Cylinder Indexes I/O operations per row = 10 + [ 5 * (#USIs) ] + [ 6 * (#NUSIs) ] Double for FALLBACK` Access Considerations and Constraints Page 21-25 Table Level Attributes Because ANSI permits the possibility of duplicate rows in a table, a table level attribute (SET, MULTISET) specifies whether or not to allow duplicates. Maximum data block sizes can now be specified as part of a table creation, thus allowing for smaller or larger blocks depending on the needs of the processing environment. Typically, decision support applications prefer larger block sizes while on-line transaction processing applications generally use smaller block sizes. Additionally, a parameter may be set to allow for a particular cylinder fill factor during table loading (FREESPACE). This factor may be set high for high subsequent file maintenance activity, or low for relatively static tables. The Checksum parameter (table level attribute not listed on facing page) feature improves Teradata’s ability to detect data corruption in user data at the earliest occurrence. The higher levels of checksums cause more sampling of data and more performance impact. The default system value is normally NONE which has no performance impact. The CHECKSUM is a calculated value (XOR logic) and is stored separate from the data segment. It is stored in the Cylinder Index. This option is not as necessary with latest Disk Array Controller's DAP-3 protection. When a CHECKSUM value other than NONE is used, the data rows (in blocks) are not updated in place. These “safe” writes prevent the system from not being able to recover from an interrupted write corruption error. Options for this parameter are: DEFAULT Calculate (or not) checksums based on system defaults as specified with the DBS Control utility and the Checksum fields. NONE Do not calculate checksums. LOW Calculate checksums by sampling a low percentage of the disk block. Default is to sample 2% of the disk block, but this value is determined by the value in the DBS Control Checksum definitions. MEDIUM Calculate checksums by sampling a medium percentage of the disk block. Default is to sample 33% of the disk block, but this value is determined by the value in the DBS Control Checksum definitions. HIGH Calculate checksums by sampling a high percentage of the disk block. Default is to sample 67% of the disk block, but this value is determined by the value in the DBS Control Checksum definitions. ALL Calculate checksums using the entire disk block (sample 100% of the disk block to generate a checksum). Page 21-26 Access Considerations and Constraints Table Level Attributes CREATE MULTISET TABLE Table_1, FALLBACK, DATABLOCKSIZE = 64 KBYTES, FREESPACE = 15, MERGEBLOCKRATIO = 60 (column1 INTEGER NOT NULL, column2 CHAR(5) NOT NULL, CONSTRAINT table_constraint CHECK (column1 > 0) ) PRIMARY INDEX (column1) INDEX (column2); SET MULTISET DATABLOCKSIZE = BYTES or KBYTES Don’t allow duplicate rows Allow duplicate rows (ANSI default) Maximum multi-row block size for table in: BYTES Rounded to nearest sector (512) KILOBYTES (or KBYTES) Increments of 1024 MINIMUM DATABLOCKSIZE MAXIMUM DATABLOCKSIZE IMMEDIATE (7168) (130,560) May be used to immediately re-block the data with ALTER. FREESPACE = integer [PERCENT] Percent of freespace to keep on cylinder during load operations (0 - 75%). DEFAULT MERGEBLOCKRATIO MERGEBLOCKRATIO = integer [PERCENT] NO MERGEBLOCKRATIO The merge block ratio to be used for this table when when Teradata combines smaller data blocks into a single larger data block (13.10). Typical system default is 60%. Access Considerations and Constraints Page 21-27 Example of Column and Table Level Constraints Constraints can be placed at the column or the table level. Constraints may be named or unnamed. PRIMARY KEY May only be defined on NOT NULL columns; guarantees uniqueness. UNIQUE May only be defined on NOT NULL columns; guarantees uniqueness. CHECK Allows range or value constraints to be placed on the column. REFERENCES Requires values to be referenced checked before being allowed. Note: Columns with a REFERENCES constraint must refer to a column that has been defined either with a PRIMARY KEY or UNIQUE constraint. With Teradata, attributes and/or constraints can be assigned at the column when the table is created (CREATE TABLE) or altered (ALTER TABLE). Some examples of attributes/constraints that can be implemented include: No Nulls – e.g., NOT NULL No duplicates – e.g., UNIQUE Data type – e.g., INTEGER Size – e.g., VARCHAR(30) Check – e.g., CHECK (col2 > 0) Default – e.g., DEFAULT CURRENT_DATE References – e.g., REFERENCES parent(col4) Table Level Constraints Constraints may also be specified at the table level. This is the only way to implement constraints that involve more than one column. Table level constraints follow all column level definitions. As previously, constraints may be either named or unnamed. Page 21-28 Access Considerations and Constraints Example of Column and Table Level Constraints There are four types of constraints. PRIMARY KEY UNIQUE CHECK REFERENCES No Nulls, No Duplicates No Nulls, No Duplicates Verify values or range Relates to other columns Constraints can be defined at the column or table level. Notes for the following example: • Some constraints are named, some are not. • Some constraints are at column level, some are defined at the table level. • The SHOW TABLE command will display this table differently for 13.0 and 13.10. CREATE TABLE Department ( dept_number INTEGER ,dept_name CHAR(20) ,dept_mgr_number INTEGER ,budget_amount DECIMAL (10,2) ,CONSTRAINT refer_1 ,CONSTRAINT ); dn_gt_1000 Access Considerations and Constraints NOT NULL CONSTRAINT primary_1 PRIMARY KEY NOT NULL UNIQUE COMPRESS 0 FOREIGN KEY (dept_mgr_number) REFERENCES Employee (employee_number) CHECK (dept_number > 1000) Page 21-29 Example (13.0) – SHOW Department Table The SHOW TABLE command shows a definition that is slightly altered from the original script. Note: The PRIMARY KEY is implemented as a unique primary index. The UNIQUE constraint is implemented as a unique secondary index. The REFERENCES constraint is implemented as a FOREIGN KEY at the table level. The CHECK constraint is implemented at the table level. Additional notes: Since this table was created in Teradata mode, the following also applies: The table is created as a SET table. The character field is implemented with a NOTCASESPECIFIC attribute. It is advisable to keep original scripts for documentation, as the original coding will otherwise be lost. Page 21-30 Access Considerations and Constraints Example (13.0) – SHOW Department Table This is an example of SHOW TABLE with Teradata 13.0. SHOW TABLE Department; CREATE SET TABLE PD.Department, FALLBACK, NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT ( dept_number INTEGER NOT NULL, dept_name CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, dept_mgr_number INTEGER, budget_amount DECIMAL(10,2) COMPRESS 0, CONSTRAINT dn_gt_1000 CHECK ( dept_number > 1000 ), CONSTRAINT refer_1 FOREIGN KEY ( dept_mgr_number ) REFERENCES PD.EMPLOYEE ( EMPLOYEE_NUMBER )) UNIQUE PRIMARY INDEX primary_1 ( dept_number ) UNIQUE INDEX ( dept_name ); Notes: • In Teradata 13.0, the SHOW TABLE command does not show the Primary Key and Unique constraints. • Since Primary Key and Unique constraints are implemented as unique indexes, the Show Table command shows these constraints as indexes. • All constraints are specified at table level with SHOW TABLE. Access Considerations and Constraints Page 21-31 Example (13.10) – SHOW Department Table An example of the same SHOW TABLE with Teradata 13.10 follows: SHOW TABLE Department; CREATE SET TABLE PD.Department , FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( dept_number INTEGER NOT NULL, dept_name CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, dept_mgr_number INTEGER, budget_amount DECIMAL(10,2) COMPRESS 0, CONSTRAINT dn_1000_plus CHECK ( dept_number > 999 ), CONSTRAINT primary_1 PRIMARY KEY ( dept_number ), UNIQUE ( dept_name ), CONSTRAINT refer_1 FOREIGN KEY ( dept_mgr_number ) REFERENCES PD.EMPLOYEE ( EMPLOYEE_NUMBER )) ; The SHOW TABLE command again shows a definition that is slightly altered from the original script; however the Teradata 13.10 version shows PRIMARY KEY and UNIQUE constraints as originally specified. Note: The PRIMARY KEY is implemented as a unique primary index. The UNIQUE constraint is implemented as a unique secondary index. The REFERENCES constraint is implemented as a FOREIGN KEY at the table level. The CHECK constraint is implemented at the table level. Additional notes: Since this table was created in Teradata mode, the following also applies: The table is created as a SET table. The character field is implemented with a NOTCASESPECIFIC attribute. As before, it is advisable to keep the original scripts for documentation, as the original coding will otherwise be lost. Page 21-32 Access Considerations and Constraints Example (13.10) – SHOW Department Table This is an example of SHOW TABLE with Teradata 13.10. SHOW TABLE Department; CREATE SET TABLE PD.Department, FALLBACK, NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT, DEFAULT MERGEBLOCKRATIO ( dept_number INTEGER NOT NULL, dept_name CHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL, dept_mgr_number INTEGER, budget_amount DECIMAL(10,2) COMPRESS 0, CONSTRAINT dn_gt_1000 CHECK ( dept_number > 1000 ), CONSTRAINT primary_1 PRIMARY KEY ( dept_number ), UNIQUE ( dept_name ), CONSTRAINT refer_1 FOREIGN KEY ( dept_mgr_number ) REFERENCES PD.EMPLOYEE ( EMPLOYEE_NUMBER )) ; Notes: • In Teradata 13.10, the SHOW TABLE command does show the Primary Key and Unique constraints. • As always, Primary Key and Unique constraints are implemented as unique indexes. • All constraints are specified at table level with SHOW TABLE. Access Considerations and Constraints Page 21-33 Altering Table Constraints Once a table has been created, constraints may be added, dropped and in some cases, modified. The ALTER TABLE command can also be used to add new columns (up to 2048) to an existing table. UNIQUE Constraints Uniqueness constraints may also be added or dropped as needed. They may apply to one or more columns. Columns must be defined as NOT NULL before a uniqueness constraint may be applied to them. Uniqueness constraints are physically implemented by Teradata as unique indexes, either primary or secondary. If the specified columns do not contain data that is unique, the constraint will be rejected and an error will be returned. Unique constraints may be dropped either by referencing their name, or by dropping the index on the specified columns. PRIMARY KEY Constraints Adding a primary key constraint to a table via ALTER TABLE will always result in the primary key being implemented as a unique secondary index (USI). This can only be done if there has not already been a primary key defined on the table. Dropping a primary key constraint may be done either by dropping the named constraint or by dropping the associated index. It is not possible to drop a primary key constraint that is implemented as a primary index. FOREIGN KEY Constraints Foreign key constraints ma