Big Data Protector Guide 6.6.5
User Manual:
Open the PDF directly: View PDF .
Page Count: 259
Download | |
Open PDF In Browser | View PDF |
Protegrity Big Data Protector Guide Release 6.6.5 Big Data Protector Guide 6.6.5 Copyright Copyright © 2004-2017 Protegrity Corporation. All rights reserved. Protegrity products are protected by and subject to patent protections; Patent:http://www.protegrity.com/patents Protegrity logo is the trademark of Protegrity Corporation. NOTICE TO ALL PERSONS RECEIVING THIS DOCUMENT Some of the product names mentioned herein are used for identification purposes only and may be trademarks and/or registered trademarks of their respective owners. Windows, MS-SQL Server, Internet Explorer and Internet Explorer logo, Active Directory, and Hyper-V are registered trademarks of Microsoft Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. SCO and SCO UnixWare are registered trademarks of The SCO Group. Sun, Oracle, Java, and Solaris, and their logos are the trademarks or registered trademarks of Oracle Corporation and/or its affiliates in the United States and other countries. Teradata and the Teradata logo are the trademarks or registered trademarks of Teradata Corporation or its affiliates in the United States and other countries. Hadoop or Apache Hadoop, Hadoop elephant logo, HDFS, Hive, Pig, HBase, and Spark are trademarks of Apache Software Foundation. Cloudera, Impala, and the Cloudera logo are trademarks of Cloudera and its suppliers or licensors. Hortonworks and the Hortonworks logo are the trademarks of Hortonworks, Inc. in the United States and other countries. Greenplum is the registered trademark of EMC Corporation in the U.S. and other countries. Pivotal HD and HAWQ are the registered trademarks of Pivotal, Inc. in the U.S. and other countries. MapR logo is a registered trademark of MapR Technologies, Inc. PostgreSQL or Postgres is the copyright of The PostgreSQL Global Development Group and The Regents of the University of California. IBM and the IBM logo, z/OS, AIX, DB2, Netezza, and BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Utimaco Safeware AG is a member of the Sophos Group. Jaspersoft, the Jaspersoft logo, and JasperServer products are trademarks and/or registered trademarks of Jaspersoft Corporation in the United States and in jurisdictions throughout the world. Confidential I Big Data Protector Guide 6.6.5 Xen, XenServer, and Xen Source are trademarks or registered trademarks of Citrix Systems, Inc. and/or one or more of its subsidiaries, and may be registered in the United States Patent and Trademark Office and in other countries. VMware, the VMware “boxes” logo and design, Virtual SMP and VMotion are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. HP is a registered trademark of the Hewlett-Packard Company. Dell is a registered trademark of Dell Inc. Novell is a registered trademark of Novell, Inc. in the United States and other countries. POSIX is a registered trademark of the Institute of Electrical and Electronics Engineers, Inc. Mozilla and Firefox are registered trademarks of Mozilla foundation. Chrome is a registered trademark of Google Inc. Confidential II Big Data Protector Guide 6.6.5 Contents Contents Copyright ............................................................................................................................. I 1 Introduction to this Guide ....................................................................................... 14 1.1. Sections contained in this Guide ....................................................................................14 1.2. Protegrity Documentation Suite ....................................................................................14 1.5 Glossary.....................................................................................................................15 2 Overview of the Big Data Protector ......................................................................... 16 2.1 Components of Hadoop ................................................................................................16 2.1.1 Hadoop Distributed File System (HDFS) .....................................................................17 2.1.2 MapReduce .............................................................................................................17 2.1.3 Hive ......................................................................................................................17 2.1.4 Pig ........................................................................................................................17 2.1.5 HBase ....................................................................................................................17 2.1.6 Impala ...................................................................................................................17 2.1.7 HAWQ ....................................................................................................................18 2.1.8 Spark ....................................................................................................................18 2.2 Features of Protegrity Big Data Protector........................................................................18 2.3 Using Protegrity Data Security Platform with Hadoop .......................................................20 2.4 Overview of Hadoop Application Protection .....................................................................21 2.4.1 Protection in MapReduce Jobs ...................................................................................21 2.4.2 Protection in Hive Queries ........................................................................................21 2.4.3 Protection in Pig Jobs ...............................................................................................22 2.4.4 Protection in HBase .................................................................................................22 2.4.5 Protection in Impala ................................................................................................22 2.4.6 Protection in HAWQ .................................................................................................22 2.4.7 Protection in Spark ..................................................................................................22 2.5 HDFS File Protection (HDFSFP)......................................................................................23 2.6 Ingesting Data Securely ...............................................................................................23 2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway (FPG) .................................23 2.6.2 Ingesting Files Using Hive Staging .............................................................................23 2.6.3 Ingesting Files into HDFS by HDFSFP .........................................................................23 2.7 Data Security Policy and Protection Methods ...................................................................23 3 Installing and Uninstalling Big Data Protector ........................................................ 25 3.1 Installing Big Data Protector on a Cluster .......................................................................25 3.1.1 Verifying Prerequisites for Installing Big Data Protector ................................................25 3.1.2 Extracting Files from the Installation Package .............................................................27 3.1.3 Updating the BDP.config File .....................................................................................28 3.1.4 Installing Big Data Protector .....................................................................................29 Confidential 3 Big Data Protector Guide 6.6.5 Contents 3.1.5 Applying Patches .....................................................................................................33 3.1.6 Installing the DFSFP Service .....................................................................................33 3.1.7 Configuring HDFSFP.................................................................................................34 3.1.8 Configuring HBase ...................................................................................................36 3.1.9 Configuring Impala ..................................................................................................37 3.1.10 Configuring HAWQ ...................................................................................................38 3.1.11 Configuring Spark ...................................................................................................38 3.2 Installing or Uninstalling Big Data Protector on Specific Nodes ..........................................39 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop Cluster ..........................39 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster ....................39 3.3 Utilities ......................................................................................................................40 3.3.1 PEP Server Control ..................................................................................................40 3.3.2 Update Cluster Policy ...............................................................................................40 3.3.3 Protegrity Cache Control ..........................................................................................41 3.3.4 Recover Utility ........................................................................................................41 3.4 Uninstalling Big Data Protector from a Cluster .................................................................42 3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector .......................................42 3.4.2 Removing the Cluster from the ESA ...........................................................................42 3.4.3 Uninstalling Big Data Protector from the Cluster ..........................................................42 4 Hadoop Application Protector .................................................................................. 47 4.1 Using the Hadoop Application Protector ..........................................................................47 4.2 Prerequisites...............................................................................................................47 4.3 Samples .....................................................................................................................47 4.4 MapReduce APIs .........................................................................................................47 4.4.1 openSession().........................................................................................................48 4.4.2 closeSession() ........................................................................................................48 4.4.3 getVersion() ...........................................................................................................48 4.4.4 getCurrentKeyId() ...................................................................................................49 4.4.5 checkAccess() .........................................................................................................49 4.4.6 getDefaultDataElement()..........................................................................................50 4.4.7 protect() ................................................................................................................50 4.4.8 protect() ................................................................................................................51 4.4.9 protect() ................................................................................................................51 4.4.10 unprotect() .............................................................................................................51 4.4.11 unprotect() .............................................................................................................52 4.4.12 unprotect() .............................................................................................................52 4.4.13 bulkProtect() ..........................................................................................................53 4.4.14 bulkProtect() ..........................................................................................................54 4.4.15 bulkProtect() ..........................................................................................................55 Confidential 4 Big Data Protector Guide 6.6.5 Contents 4.4.16 bulkUnprotect() ......................................................................................................56 4.4.17 bulkUnprotect() ......................................................................................................58 4.4.18 bulkUnprotect() ......................................................................................................59 4.4.19 reprotect() .............................................................................................................60 4.4.20 reprotect() .............................................................................................................61 4.4.21 reprotect() .............................................................................................................61 4.4.22 hmac() ..................................................................................................................62 4.5 Hive UDFs ..................................................................................................................62 4.5.1 ptyGetVersion() ......................................................................................................62 4.5.2 ptyWhoAmI()..........................................................................................................63 4.5.3 ptyProtectStr()........................................................................................................63 4.5.4 ptyUnprotectStr() ....................................................................................................64 4.5.5 ptyReprotect() ........................................................................................................64 4.5.6 ptyProtectUnicode() .................................................................................................65 4.5.7 ptyUnprotectUnicode() .............................................................................................66 4.5.8 ptyReprotectUnicode() .............................................................................................66 4.5.9 ptyProtectInt() ........................................................................................................67 4.5.10 ptyUnprotectInt() ....................................................................................................68 4.5.11 ptyReprotect() ........................................................................................................69 4.5.12 ptyProtectFloat() .....................................................................................................69 4.5.13 ptyUnprotectFloat() .................................................................................................70 4.5.14 ptyReprotect() ........................................................................................................71 4.5.15 ptyProtectDouble() ..................................................................................................71 4.5.16 ptyUnprotectDouble() ..............................................................................................72 4.5.17 ptyReprotect() ........................................................................................................73 4.5.18 ptyProtectBigInt() ...................................................................................................74 4.5.19 ptyUnprotectBigInt() ...............................................................................................74 4.5.20 ptyReprotect() ........................................................................................................75 4.5.21 ptyProtectDec() ......................................................................................................76 4.5.22 ptyUnprotectDec() ...................................................................................................76 4.5.23 ptyProtectHiveDecimal() ..........................................................................................77 4.5.24 ptyUnprotectHiveDecimal().......................................................................................78 4.5.25 ptyReprotect() ........................................................................................................78 4.6 Pig UDFs ....................................................................................................................79 4.6.1 ptyGetVersion() ......................................................................................................79 4.6.2 ptyWhoAmI()..........................................................................................................80 4.6.3 ptyProtectInt() ........................................................................................................80 4.6.4 ptyUnprotectInt() ....................................................................................................81 4.6.5 ptyProtectStr()........................................................................................................81 Confidential 5 Big Data Protector Guide 6.6.5 4.6.6 Contents ptyUnprotectStr() ....................................................................................................81 5 HDFS File Protector (HDFSFP) ................................................................................. 83 5.1 Overview of HDFSFP ....................................................................................................83 5.2 Features of HDFSFP .....................................................................................................83 5.3 Protector Usage ..........................................................................................................83 5.4 File Recover Utility ......................................................................................................83 5.5 HDFSFP Commands .....................................................................................................84 5.5.1 copyFromLocal ........................................................................................................84 5.5.2 put ........................................................................................................................84 5.5.3 copyToLocal............................................................................................................84 5.5.4 get ........................................................................................................................85 5.5.5 cp .........................................................................................................................85 5.5.6 mkdir.....................................................................................................................85 5.5.7 mv ........................................................................................................................86 5.5.8 rm .........................................................................................................................86 5.5.9 rmr .......................................................................................................................86 5.6 Ingesting Files Securely ...............................................................................................87 5.7 Extracting Files Securely ..............................................................................................87 5.8 HDFSFP Java API.........................................................................................................87 5.8.1 copy ......................................................................................................................87 5.8.2 copyFromLocal ........................................................................................................88 5.8.3 copyToLocal............................................................................................................89 5.8.4 deleteFile ...............................................................................................................89 5.8.5 deleteDir ................................................................................................................90 5.8.6 mkdir.....................................................................................................................90 5.8.7 move .....................................................................................................................91 5.9 Developing Applications using HDFSFP Java API ..............................................................92 5.9.1 Setting up the Development Environment ..................................................................92 5.9.2 Protecting Data using the Class file ............................................................................92 5.9.3 Protecting Data using the JAR file ..............................................................................92 5.9.4 Sample Program for the HDFSFP Java API ..................................................................92 5.10 Quick Reference Tasks .................................................................................................94 5.10.1 Protecting Existing Data ...........................................................................................94 5.10.2 Reprotecting Files ....................................................................................................95 5.11 Sample Demo Use Case ...............................................................................................95 5.12 Appliance components of HDFSFP..................................................................................95 5.12.1 Dfsdatastore Utility ..................................................................................................95 5.12.2 Dfsadmin Utility ......................................................................................................95 Confidential 6 Big Data Protector Guide 6.6.5 Contents 5.13 Access Control Rules for Files and Folders ......................................................................95 5.14 Using the DFS Cluster Management Utility (dfsdatastore) .................................................95 5.14.1 Adding a Cluster for Protection ..................................................................................96 5.14.2 Updating a Cluster...................................................................................................97 5.14.3 Removing a Cluster .................................................................................................98 5.14.4 Monitoring a Cluster ................................................................................................99 5.14.5 Searching a Cluster ............................................................................................... 100 5.14.6 Listing all Clusters ................................................................................................. 101 5.15 Using the ACL Management Utility (dfsadmin)............................................................... 101 5.15.1 Adding an ACL Entry for Protecting Directories in HDFS .............................................. 101 5.15.2 Updating an ACL Entry ........................................................................................... 103 5.15.3 Reprotecting Files or Folders ................................................................................... 104 5.15.4 Deleting an ACL Entry to Unprotect Files or Directories .............................................. 104 5.15.5 Activating Inactive ACL Entries ............................................................................... 105 5.15.6 Viewing the ACL Activation Job Progress Information in the Interactive Mode................ 106 5.15.7 Viewing the ACL Activation Job Progress Information in the Non Interactive Mode ......... 107 5.15.8 Searching ACL Entries ............................................................................................ 108 5.15.9 Listing all ACL Entries ............................................................................................ 108 5.16 HDFS Codec for Encryption and Decryption................................................................... 109 6 HBase .................................................................................................................... 110 6.1 Overview of the HBase Protector ................................................................................. 110 6.2 HBase Protector Usage............................................................................................... 110 6.3 Adding Data Elements and Column Qualifier Mappings to a New Table ............................. 110 6.4 Adding Data Elements and Column Qualifier Mappings to an Existing Table ...................... 111 6.5 Inserting Protected Data into a Protected Table ............................................................. 111 6.6 Retrieving Protected Data from a Table ........................................................................ 111 6.7 Protecting Existing Data ............................................................................................. 112 6.8 HBase Commands ..................................................................................................... 112 6.8.1 put ...................................................................................................................... 112 6.8.2 get ...................................................................................................................... 112 6.8.3 scan .................................................................................................................... 113 6.9 Ingesting Files Securely ............................................................................................. 113 6.10 Extracting Files Securely ............................................................................................ 113 6.11 Sample Use Cases ..................................................................................................... 113 7 Impala .................................................................................................................. 114 7.1 Overview of the Impala Protector ................................................................................ 114 7.2 Impala Protector Usage .............................................................................................. 114 7.3 Impala UDFs ............................................................................................................. 114 Confidential 7 Big Data Protector Guide 6.6.5 Contents 7.3.1 pty_GetVersion() .................................................................................................. 114 7.3.2 pty_WhoAmI() ...................................................................................................... 115 7.3.3 pty_GetCurrentKeyId() .......................................................................................... 115 7.3.4 pty_GetKeyId() ..................................................................................................... 115 7.3.5 pty_StringEnc() .................................................................................................... 115 7.3.6 pty_StringDec() .................................................................................................... 116 7.3.7 pty_StringIns() ..................................................................................................... 116 7.3.8 pty_StringSel() ..................................................................................................... 116 7.3.9 pty_UnicodeStringIns() .......................................................................................... 117 7.3.10 pty_UnicodeStringSel() .......................................................................................... 117 7.3.11 pty_IntegerEnc()................................................................................................... 118 7.3.12 pty_IntegerDec() .................................................................................................. 118 7.3.13 pty_IntegerIns() ................................................................................................... 118 7.3.14 pty_IntegerSel() ................................................................................................... 118 7.3.15 pty_FloatEnc() ...................................................................................................... 119 7.3.16 pty_FloatDec() ...................................................................................................... 119 7.3.17 pty_FloatIns()....................................................................................................... 119 7.3.18 pty_FloatSel() ....................................................................................................... 120 7.3.19 pty_DoubleEnc() ................................................................................................... 120 7.3.20 pty_DoubleDec() ................................................................................................... 121 7.3.21 pty_DoubleIns() .................................................................................................... 121 7.3.22 pty_DoubleSel() .................................................................................................... 121 7.4 Inserting Data from a File into a Table ......................................................................... 122 7.5 Protecting Existing Data ............................................................................................. 123 7.6 Unprotecting Protected Data ....................................................................................... 123 7.7 Retrieving Data from a Table ...................................................................................... 123 7.8 Sample Use Cases ..................................................................................................... 124 8 HAWQ.................................................................................................................... 125 8.1 Overview of the HAWQ Protector ................................................................................. 125 8.2 HAWQ Protector Usage .............................................................................................. 125 8.3 HAWQ UDFs ............................................................................................................. 125 8.3.1 pty_GetVersion() .................................................................................................. 125 8.3.2 pty_WhoAmI() ...................................................................................................... 126 8.3.3 pty_GetCurrentKeyId() .......................................................................................... 126 8.3.4 pty_GetKeyId() ..................................................................................................... 126 8.3.5 pty_VarcharEnc() .................................................................................................. 126 8.3.6 pty_VarcharDec() .................................................................................................. 127 8.3.7 pty_VarcharHash() ................................................................................................ 127 8.3.8 pty_VarcharIns()................................................................................................... 127 Confidential 8 Big Data Protector Guide 6.6.5 Contents 8.3.9 pty_VarcharSel() ................................................................................................... 128 8.3.10 pty_UnicodeVarcharIns() ....................................................................................... 128 8.3.11 pty_UnicodeVarcharSel()........................................................................................ 128 8.3.12 pty_IntegerEnc()................................................................................................... 129 8.3.13 pty_IntegerDec() .................................................................................................. 129 8.3.14 pty_IntegerHash()................................................................................................. 129 8.3.15 pty_IntegerIns() ................................................................................................... 130 8.3.16 pty_IntegerSel() ................................................................................................... 130 8.3.17 pty_DateEnc() ...................................................................................................... 130 8.3.18 pty_DateDec() ...................................................................................................... 130 8.3.19 pty_DateHash() .................................................................................................... 131 8.3.20 pty_DateIns() ....................................................................................................... 131 8.3.21 pty_DateSel() ....................................................................................................... 131 8.3.22 pty_RealEnc() ....................................................................................................... 132 8.3.23 pty_RealDec()....................................................................................................... 132 8.3.24 pty_RealHash() ..................................................................................................... 132 8.3.25 pty_RealIns() ....................................................................................................... 132 8.3.26 pty_RealSel() ....................................................................................................... 133 8.4 Inserting Data from a File into a Table ......................................................................... 133 8.5 Protecting Existing Data ............................................................................................. 134 8.6 Unprotecting Protected Data ....................................................................................... 134 8.7 Retrieving Data from a Table ...................................................................................... 135 8.8 Sample Use Cases ..................................................................................................... 135 9 Spark..................................................................................................................... 136 9.1 Overview of the Spark Protector .................................................................................. 136 9.2 Spark Protector Usage ............................................................................................... 136 9.3 Spark APIs ............................................................................................................... 136 9.3.1 getVersion() ......................................................................................................... 136 9.3.2 getCurrentKeyId() ................................................................................................. 137 9.3.3 checkAccess() ....................................................................................................... 137 9.3.4 getDefaultDataElement()........................................................................................ 138 9.3.5 hmac() ................................................................................................................ 138 9.3.6 protect() .............................................................................................................. 138 9.3.7 protect() .............................................................................................................. 139 9.3.8 protect() .............................................................................................................. 140 9.3.9 protect() .............................................................................................................. 140 9.3.10 protect() .............................................................................................................. 141 9.3.11 protect() .............................................................................................................. 141 9.3.12 protect() .............................................................................................................. 142 Confidential 9 Big Data Protector Guide 6.6.5 Contents 9.3.13 protect() .............................................................................................................. 142 9.3.14 protect() .............................................................................................................. 143 9.3.15 protect() .............................................................................................................. 143 9.3.16 protect() .............................................................................................................. 144 9.3.17 protect() .............................................................................................................. 145 9.3.18 protect() .............................................................................................................. 145 9.3.19 unprotect() ........................................................................................................... 146 9.3.20 unprotect() ........................................................................................................... 146 9.3.21 unprotect() ........................................................................................................... 147 9.3.22 unprotect() ........................................................................................................... 148 9.3.23 unprotect() ........................................................................................................... 148 9.3.24 unprotect() ........................................................................................................... 149 9.3.25 unprotect() ........................................................................................................... 149 9.3.26 unprotect() ........................................................................................................... 150 9.3.27 unprotect() ........................................................................................................... 151 9.3.28 unprotect() ........................................................................................................... 151 9.3.29 unprotect() ........................................................................................................... 152 9.3.30 unprotect() ........................................................................................................... 152 9.3.31 unprotect() ........................................................................................................... 153 9.3.32 reprotect() ........................................................................................................... 154 9.3.33 reprotect() ........................................................................................................... 154 9.3.34 reprotect() ........................................................................................................... 155 9.3.35 reprotect() ........................................................................................................... 155 9.3.36 reprotect() ........................................................................................................... 156 9.3.37 reprotect() ........................................................................................................... 157 9.3.38 reprotect() ........................................................................................................... 157 9.4 Displaying the Cleartext Data from a File ..................................................................... 158 9.5 Protecting Existing Data ............................................................................................. 158 9.6 Unprotecting Protected Data ....................................................................................... 158 9.7 Retrieving the Unprotected Data from a File ................................................................. 159 9.8 Spark APIs and Supported Protection Methods .............................................................. 159 9.9 Sample Use Cases ..................................................................................................... 160 9.10 Spark SQL ................................................................................................................ 160 9.10.1 DataFrames .......................................................................................................... 161 9.10.2 SQLContext .......................................................................................................... 161 9.10.3 Accessing the Hive Protector UDFs .......................................................................... 161 9.10.4 Sample Use Cases ................................................................................................. 162 9.11 9.11.1 Spark Scala .............................................................................................................. 162 Sample Use Cases ................................................................................................. 162 Confidential 10 Big Data Protector Guide 6.6.5 Contents 10 Data Node and Name Node Security with File Protector ........................................ 163 10.1 Features of the Protegrity File Protector ....................................................................... 163 10.1.1 Protegrity File Encryption ....................................................................................... 163 10.1.2 Protegrity Volume Encryption .................................................................................. 163 10.1.3 Protegrity Access Control ....................................................................................... 163 11 Appendix: Return Codes ........................................................................................ 164 12 Appendix: Samples ................................................................................................ 169 12.1 Roles in the Samples ................................................................................................. 170 12.2 Data Elements in the Security Policy ............................................................................ 170 12.3 Role-based Permissions for Data Elements in the Sample ............................................... 171 12.4 Data Used by the Samples ......................................................................................... 171 12.5 Protecting Data using MapReduce................................................................................ 171 12.5.1 Basic Use Case ..................................................................................................... 172 12.5.2 Role-based Use Cases ............................................................................................ 173 12.5.3 Sample Code Usage ............................................................................................... 176 12.6 Protecting Data using Hive ......................................................................................... 179 12.6.1 Basic Use Case ..................................................................................................... 179 12.6.2 Role-based Use Cases ............................................................................................ 181 12.7 Protecting Data using Pig ........................................................................................... 183 12.7.1 Basic Use Case ..................................................................................................... 184 12.7.2 Role-based Use Cases ............................................................................................ 185 12.8 Protecting Data using HBase ....................................................................................... 189 12.8.1 Basic Use Case ..................................................................................................... 189 12.8.2 Role-based Use Cases ............................................................................................ 190 12.9 Protecting Data using Impala ...................................................................................... 195 12.9.1 Basic Use Case ..................................................................................................... 195 12.9.2 Role-based Use Cases ............................................................................................ 197 12.10 Protecting Data using HAWQ .................................................................................... 201 12.10.1 Basic Use Case ..................................................................................................... 201 12.10.2 Role-based Use Cases ............................................................................................ 203 12.11 Protecting Data using Spark ..................................................................................... 207 12.11.1 Basic Use Case ..................................................................................................... 208 12.11.2 Role-based Use Cases ............................................................................................ 209 12.11.3 Sample Code Usage for Spark (Java) ....................................................................... 212 12.11.4 Sample Code Usage for Spark (Scala) ...................................................................... 217 13 Appendix: HDFSFP Demo ....................................................................................... 221 13.1 Roles in the Demo ..................................................................................................... 221 13.2 HDFS Directories used in Demo................................................................................... 221 Confidential 11 Big Data Protector Guide 6.6.5 Contents 13.3 User Permissions for HDFS Directories ......................................................................... 221 13.4 Prerequisites for the Demo ......................................................................................... 222 13.5 Running the Demo .................................................................................................... 224 13.5.1 Protecting Existing Data in HDFS ............................................................................. 224 13.5.2 Ingesting Data into a Protected Directory ................................................................. 225 13.5.3 Ingesting Data into an Unprotected Public Directory .................................................. 225 13.5.4 Reading the Data by Authorized Users ..................................................................... 225 13.5.5 Reading the Data by Unauthorized Users .................................................................. 226 13.5.6 Copying Data from One Directory to Another by Authorized Users ............................... 226 13.5.7 Copying Data from One Directory to Another by Unauthorized Users ........................... 227 13.5.8 Deleting Data by Authorized Users .......................................................................... 227 13.5.9 Deleting Data by Unauthorized Users ....................................................................... 228 13.5.10 Copying Data to a Public Directory by Authorized Users ............................................. 228 13.5.11 Running MapReduce Job by Authorized Users ........................................................... 228 13.5.12 Reading Data for Analysis by Authorized Users.......................................................... 229 14 Appendix: Using Hive with HDFSFP ....................................................................... 230 14.1 Data Used by the Samples ......................................................................................... 230 14.2 Ingesting Data to Hive Table ...................................................................................... 230 14.2.1 Table Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive 230 14.2.2 Table Ingesting Protected Data from HDFSFP Protected Hive Table to another HDFSFP Protected Hive 231 14.3 14.3.1 Tokenization and Detokenization with HDFSFP .............................................................. 232 Verifying Prerequisites for Using Hadoop Application Protector .................................... 232 14.3.2 Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive Table in Tokenized Form ...................................................................................................... 232 14.3.3 Ingesting Detokenized Data from HDFSFP Protected Internal Hive Table to HDFSFP Protected External Hive Table ............................................................................................................. 233 14.3.4 Ingesting Data from HDFSFP Protected External Hive Table to Internal Hive Table not protected by HDFSFP in Tokenized Form............................................................................................... 233 14.3.5 Ingesting Detokenized Data from Internal Hive Table not protected by HDFSFP to HDFSFP Protected External Hive Table ............................................................................................... 234 15 Appendix: Configuring Talend with HDFSFP .......................................................... 235 15.1 Verifying Prerequisites before Configuring Talend with HDFSFP ....................................... 235 15.2 Verifying the Talend Packages .................................................................................... 235 15.3 Configuring Talend with HDFSFP ................................................................................. 235 15.4 Starting a Project in Talend ........................................................................................ 236 15.5 Configuring the Preferences for Talend ......................................................................... 237 15.6 Ingesting Data in the Target HDFS Directory in Protected Form....................................... 238 15.7 Accessing the Data from the Protected Directory in HDFS ............................................... 243 Confidential 12 Big Data Protector Guide 6.6.5 Contents 15.8 Configuring Talend Jobs to run with HDFSFP with Target Exec as Remote ......................... 247 15.9 Using Talend with HDFSFP and MapReduce ................................................................... 249 15.9.1 Protecting Data Using Talend with HDFSFP and MapReduce ........................................ 249 15.9.2 Unprotecting Data Using Talend with HDFSFP and MapReduce .................................... 252 15.9.3 Sample Code Usage ............................................................................................... 254 16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database ... 257 16.1 Migrating Tokenized Unicode Data from a Teradata Database ......................................... 257 16.2 Migrating Tokenized Unicode Data to a Teradata Database ............................................. 258 Confidential 13 Big Data Protector Guide 6.6.5 1 Introduction to this Guide Introduction to this Guide This guide provides information about installing, configuring, and using the Protegrity Big Data Protector (BDP) for Hadoop. 1.1. Sections contained in this Guide The guide is broadly divided into the following sections: • • • • • • • • • • • • • • • • Section 1 Introduction to this Guide defines the purpose and scope for this guide. In addition, it explains how information is organized in this guide. Section 2 Overview of the Big Data Protector provides a general idea of Hadoop and how it has been integrated with the Big Data Protector. In addition, it describes the protection coverage of various Hadoop ecosystem applications, such as MapReduce, Hive and Pig, and information about HDFS File Protection (HDFSFP). Section 3 Installing and Uninstalling Big Data Protector includes information common to all distributions, such as prerequisites for installation, installation procedure and uninstallation of the product from the cluster. In addition, it provides information about the tools and utilities. Section 4 Hadoop Application Protector provides information about Hadoop Application Protector. In addition, it covers information about MapReduce APIs and Hive and Pig UDFs. Section 5 HDFS File Protector (HDFSFP) provides information about the protection of files stored in HDFSFP and the commands supported. Section 6 HBase provides information about the Protegrity HBase protector. Section 7 Impala provides information about the Protegrity Impala protector. Section 8 HAWQ provides information about the Protegrity HAWQ protector. Section 9 Spark provides information about the Protegrity Spark protector. In addition, it provides information about Spark SQL and Spark Scala. Section 10 Data Node and Name Node Security with File Protector provides information about the protection of the Data and Name nodes using the File Protector. Section 11 Appendix: Return Codes provides information about all possible error codes and error descriptions for Big Data Protector. Section 12 Appendix: Samples provides information about sample data protection for MapReduce, Hive, Pig, HBase, Impala, HAWQ, and Spark using Big Data Protector. Section 13 Appendix: HDFSFP Demo provides information about sample data protection for HDFSFP using Big Data Protector. Section 14 Appendix: Using Hive with HDFSFP provides information about using Hive with HDFSFP. Section 15 Appendix: Configuring Talend with HDFSFP provides the procedures for configuring Talend with HDFSFP. Section 16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database describes procedures for migrating tokenized Unicode data from and to a Teradata database. 1.2. Protegrity Documentation Suite The Protegrity Documentation Suite comprises of the following documents: • • • • • Protegrity Protegrity Protegrity Protegrity Protegrity Documentation Master Index Release 6.6.5 Appliances Overview Release 6.6.5 Enterprise Security Administrator Guide Release 6.6.5 File Protector Gateway Server User Guide Release 6.6.4 Protection Server Guide Release 6.6.5 Confidential 14 Big Data Protector Guide 6.6.5 • • • • • • • • • • • • • • • 1.5 Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Introduction to this Guide Data Security Platform Feature Guide Release 6.6.5 Data Security Platform Licensing Guide Release 6.6 Data Security Platform Upgrade Guide Release 6.6.5 Reports Guide Release 6.6.5 Troubleshooting Guide Release 6.6.5 Application Protector Guide Release 6.5 SP2 Big Data Protector Guide Release 6.6.5 Database Protector Guide Release 6.6.5 File Protector Guide Release 6.6.4 Protection Enforcements Point Servers Installation Guide Release 6.6.5 Protection Methods Reference Release 6.6.5 Row Level Protector Guide Release 6.6.5 Enterprise Security Administrator Quick Start Guide Release 6.6 File Protector Gateway Server Quick Start Guide Release 6.6.2 Protection Server Quick Start Guide Release 6.6 Glossary This section includes Protegrity specific terms, products, and abbreviations used in this document. Name Description BDP The Big Data Protector (BDP) is the API for protecting data on platforms such as Hive, Impala and HBase. ESA Enterprise Security Administrator (ESA) DPS roles The DPS roles relate to the security policy in the ESA and control the access permissions to the Access Keys. For instance, if a user does not have the required DPS role, then the user would not have access to Access Keys. DPS Protegrity Data Protection System (DPS) is the entire system where security policies are defined and enforced, including ESA and Protectors. Confidential 15 Big Data Protector Guide 6.6.5 2 Overview of the Big Data Protector Overview of the Big Data Protector The Protegrity Big Data Protector for Apache Hadoop is based on the Protegrity Application Protector. The data is split and shared with all the data nodes in the Hadoop cluster. The Big Data Protector is deployed on each of these nodes and the PEP Server, where the protection enforcement policies are shared. The Protegrity Big Data Protector is scalable and new nodes can be added as required. It is cost effective since massively parallel computing is done on commodity servers, and it is flexible as it can work with data from any number of sources. The Big Data Protector is fault tolerant as the system redirects the work to another node if a node is lost. It can handle all types of data, such as structured and unstructured data, irrespective of their native formats. The Big Data Protector protects data, which is handled by various Hadoop applications and protects files stored in the cluster. MapReduce, Hive, Pig, HBase, and Impala can use Protegrity protection interfaces to protect data as it is stored or retrieved from the Hadoop cluster. All standard protection techniques offered by Protegrity are applicable to Big Data Protector. For more information about the available protection options, such as data types, Tokenization or Encryption types, or length preserving and non-preserving tokens, refer to Protection Methods Reference Guide 6.6.5. 2.1 Components of Hadoop The Big Data Protector works on the Hadoop framework as shown in the following figure. BI Applications Data Access Framework HBase Hive Pig Data Storage Framework (HDFS) Other Data Processing Framework (MapReduce) Figure 2-1 Hadoop Components The illustration of Hadoop components is an example. Based on requirements, the components of Hadoop might be different. Hadoop interfaces have been used extensively to develop the Big Data Protector. It is a common deployment practice to utilize Hadoop Distributed File System (HDFS) to store the data, and let MapReduce process the data and store the result back in HDFS. Confidential 16 Big Data Protector Guide 6.6.5 2.1.1 Overview of the Big Data Protector Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) spans across all nodes in a Hadoop cluster for data storage. It links together the file systems on many nodes to make them into one big file system. HDFS assumes that nodes will fail, so data is replicated across multiple nodes to achieve reliability. 2.1.2 MapReduce The MapReduce framework assigns work to every node in large clusters of commodity machines. MapReduce programs are sets of instructions to parse the data, create a map or index, and aggregate the results. Since data is distributed across multiple nodes, MapReduce programs run in parallel, working on smaller sets of data. A MapReduce job is executed by splitting each job into small Map tasks, and these tasks are executed on the node where a portion of the data is stored. If a node containing the required data is saturated and not able to execute a task, then MapReduce shifts the task to the least busy node by replicating the data to that node. A Reduce task combines results from multiple Map tasks, and store all of them back to the HDFS. 2.1.3 Hive The Hive framework resides above Hadoop to enable ad hoc queries on the data in Hadoop. Hive supports HiveQL, which is similar to SQL. Hive translates a HiveQL query into a MapReduce program and then sends it to the Hadoop cluster. 2.1.4 Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop. 2.1.5 HBase HBase is a column-oriented datastore, meaning it stores data by columns rather than by rows. This makes certain data access patterns much less expensive than with traditional row-oriented relational database systems. The data in HBase is protected transparently using Protegrity HBase coprocessors. 2.1.6 Impala Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. The Impala daemon runs on each node in the cluster, reading and writing to data in the files, and accepts queries from the Impala shell command. The following are the core components of Impala: • • • Impala daemon (impalad) – This component is the Impala daemon which runs on each node in the cluster. It reads and writes the data in the files and accepts queries from the Impala shell command. Impala Statestore (statestored) – This component checks the health of the Impala daemons on all the nodes contained in the cluster. If a node is unavailable due to any error or failure, then the Impala statestore component informs all other nodes about the failed node to ensure that new queries are not sent to the failed node. Impala Catalog (catalogd) – This component is responsible for communicating any changes in the metadata received from the Impala SQL statements to all the nodes in the cluster. Confidential 17 Big Data Protector Guide 6.6.5 2.1.7 Overview of the Big Data Protector HAWQ HAWQ is an MPP database, which uses several Postgres database instances and HDFS storage. The database is distributed across HAWQ segments, which enable it to achieve data and processing parallelism. Since HAWQ uses the Postgres engine for processing queries, the query language is similar to PostgresSQL. Users connect to the HAWQ Master and interact using SQL statements, similar to the Postgres database. The following are the core components of HAWQ: HAWQ Master Server: Enables users to interact with HAWQ using client programs, such as PSQL or APIs, such as JDBC or ODBC Name Node: Enables client applications to locate a file HAWQ Segments: Are the units which process the individual data modules simultaneously HAWQ Storage: Is HDFS, which stores all the table data Interconnect Switch: Is the networking layer of HAWQ, which handles the communication between the segments • • • • • 2.1.8 Spark Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. Spark leverages the physical memory of the Hadoop system and utilizes Resilient Distributed Datasets (RDDs) to store the data in-memory and lowers latency, if the data fits in the memory size. The data is saved on the hard drive only if required. 2.2 Features of Protegrity Big Data Protector The Protegrity Big Data Protector (Big Data Protector) uses patent-pending vaultless tokenization and central policy control for access management and secures sensitive data at rest in the following areas: • Data in HDFS • Data used during MapReduce, Hive and Pig processing, and with HBase, Impala, HAWQ, and Spark • Data traversing enterprise data systems The data is protected from internal and external threats, and users and business processes can continue to utilize the secured data. Data protection may be by encryption or tokenization. In tokenization, data is converted to similar looking inert data known as tokens where the data format and type can be preserved. These tokens can be detokenized back to the original values when it is required. Protegrity secures files with volume encryption and also protects data inside files using tokenization and strong encryption protection methods. Depending on the user access rights and the policies set using Policy management in ESA, this data is unprotected. The Protegrity Hadoop Big Data Protector provides the following features: • Provides fine grained field-level protection within the MapReduce, Hive, Pig, HBase, and Spark frameworks. Confidential 18 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector • Provides directory and file level protection (encryption). • Retains distributed processing capability as field-level protection is applied to the data. • Protects data in the Hadoop cluster using role-based administration with a centralized security policy. • Provides logging and viewing data access activities and real-time alerts with a centralized monitoring system. • Ensures minimal overhead for processing secured data, with minimal consumption of resources, threads and processes, and network bandwidth. • Performs file and volume encryption including the protection of files on the local file system of Hadoop nodes. • Provides transparent data protection and row level filtering based on the user profile with Protegrity HBase protectors. • Transparently protects files processed by MapReduce and Hive in HDFS using HDFSFP. The following figure illustrates the various components in an Enterprise Hadoop ecosystem. Figure 2-2 Enterprise Hadoop Components Currently, Protegrity supports MapReduce, Hive, Pig, and HBase which utilize HDFS as the data storage layer. The following points can be referred to as general guidelines: • Sqoop: Sqoop can be used for ingestion into HDFSFP protected zone (For Hortonworks, Cloudera and Pivotal HD). • Beeline, Beeswax, and Hue on Cloudera: Beeline, Beeswax, and Hue are certified with Hive protector and Hive with HDFSFP integrations. • Beeline, Beeswax, and Hue on Hortonworks & Pivotal HD: Beeline, Beeswax, and Hue are certified with Hive protector and Hive with HDFSFP integrations. • Ranger (Hortonworks): Ranger is certified to work with the Hive protector and Hive with HDFSFP integrations only. • Sentry (Cloudera): Sentry is certified with Hive protector, Hive with HDFSFP integrations, and Impala protector only. • MapReduce and HDFSFP integration is certified with TEXTFILE format only. Confidential 19 Big Data Protector Guide 6.6.5 • Overview of the Big Data Protector Hive and HDFSFP integration is certified with TEXTFILE, RCFile, and SEQUENCEFILE formats only. • Pig and HDFSFP integration is certified with TEXTFILE format only. We neither support nor have certified other components in the Hadoop stack. We strongly recommend consulting Protegrity, before using any unsupported components from the Hadoop ecosystem with our products. 2.3 Using Protegrity Data Security Platform with Hadoop To protect data, the components of the Protegrity Data Security Platform are integrated into the Hadoop cluster as shown in the following figure. Figure 2-3 Protegrity Data Security Platform with Hadoop The Enterprise Security Administrator (ESA) is a soft appliance that needs to be pre-installed on a separate server, which is used to create and manage policies. The following figure illustrates the inbound and outbound ports that need to be allowed on the network for communication between the ESA and the Big Data Protector nodes in a Hadoop cluster. Figure 2-4 Inbound and Outbound Ports between the ESA and Big Data Protector Nodes Confidential 20 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector For more information about installing the ESA, and creating and managing policies, refer Protegrity Enterprise Security Administrator Guide Release 6.6.5. To achieve a parallel nature for the system, a PEP Server is installed on every data node. It is synchronized with the connection properties of ESA. Each task runs on a node under the same Hadoop user. Every user has a policy deployed for running their jobs on this system. Hadoop manages the accounts and users. You can get the Hadoop user information from the actual job configuration. HDFS implements a permission model for files and directories, based on the Portable Operating System Interface (POSIX) for Unix model. Each file and directory is associated with an owner and a group. Depending on the permissions granted, users for the file and directory can be classified into one of these three groups: • • • 2.4 Owner Other users of the group All other users Overview of Hadoop Application Protection This section describes the various levels of protection provided by Hadoop Application Protection. 2.4.1 Protection in MapReduce Jobs A MapReduce job in the Hadoop cluster involves sensitive data. You can use Protegrity interfaces to protect data when it is saved or retrieved from a protected source. The output data written by the job can be encrypted or tokenized. The protected data can be subsequently used by other jobs in the cluster in a secured manner. Field level data can be secured and ingested into HDFS by independent Hadoop jobs or other ETL tools. For more information about secure ingestion of data in Hadoop, refer to section 2.6.2 Ingesting Files Using Hive Staging. For more information on the list of available APIs, refer to section 4.4 MapReduce APIs. If Hive queries are created to operate on sensitive data, then you can use Protegrity Hive UDFs for securing data. While inserting data to Hive tables, or retrieving data from protected Hive table columns, you can call Protegrity UDFs loaded into Hive during installation. The UDFs protect data based on the input parameters provided. Secure ingestion of data into HDFS to operate Hive queries can be achieved by independent Hadoop jobs or other ETL tools. For more information about securely ingesting data in Hadoop, refer to section 2.6 Ingesting Data Securely. 2.4.2 Protection in Hive Queries Protection in Hive queries is done by Protegrity Hive UDFs, which translates a HiveQL query into a MapReduce program and then sends it to the Hadoop cluster. For more information on the list of available UDFs, refer to section 4.5 Hive UDFs. Confidential 21 Big Data Protector Guide 6.6.5 2.4.3 Overview of the Big Data Protector Protection in Pig Jobs Protection in Pig jobs is done by Protegrity Pig UDFs, which are similar in function to the Protegrity UDFs in Hive. For more information on the list of available UDFs, refer to section 4.6 Pig UDFs. 2.4.4 Protection in HBase HBase is a database which provides random read and write access to tables, consisting of rows and columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions contain a datastore in memory and a persistent datastore(HFile). The Name node assigns multiple regions to a region server. The Name node manages the cluster and the region servers store portions of the HBase tables and perform the work on the data. The Protegrity HBase protector extends the functionality of the data storage framework and provides transparent data protection and unprotection using coprocessors, which provide the functionality to run code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers and protects the data stored in the servers. All clients which work with HBase are supported. The data is transparently protected or unprotected, as required, utilizing the coprocessor framework. For more information about HBase, refer to section 6 HBase. 2.4.5 Protection in Impala Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. The Protegrity Impala protector extends the functionality of the Impala query engine and provides UDFs which protect or unprotect the data as it is stored or retrieved. For more information about the Impala protector, refer to section 7 Impala. 2.4.6 Protection in HAWQ HAWQ is an MPP database, which enable it to achieve data and processing parallelism. The Protegrity HAWQ protector provides UDFs for protecting data using encryption or tokenization, and unprotecting data by using decryption or detokenization. For more information about the HAWQ protector, refer to section 8 HAWQ. 2.4.7 Protection in Spark Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. The Protegrity Spark protector extends the functionality of the Spark engine and provides APIs that protect or unprotect the data as it is stored or retrieved. For more information about the Spark protector, refer to section 9 Spark. Confidential 22 Big Data Protector Guide 6.6.5 2.5 Overview of the Big Data Protector HDFS File Protection (HDFSFP) Files are stored and retrieved by Hadoop system elements, such as file shell commands, MapReduce, Hive, Pig, HBase and so on. The stored files reside in HDFS and span multiple cluster nodes. Most of the files in HDFS are plain text files and stored in the clear, with access control like a POSIX file system. These files contain sensitive data, making it vulnerable with exposure to unwanted users. These files are transparently protected as they are stored in HDFS. In addition, the content is exposed only to authorized users. The content in the files is unprotected transparently to processes or users, authorized to view and process the files. The user is automatically detected from the job information provided by HDFSFP. The job accessing secured files must be initialized by an authorized user having the required privileges in ACL. The files encrypted by HDFSFP are suitable for distributed processing by Hadoop distributed jobs like MapReduce. HDFSFP protects individual files or files stored in a directory. The access control is governed by the security policy and ACL supplied by the security officer. The access control and security policy is controlled through ESA interfaces. Command line and UI options are available to control ACL entries for file paths and directories. 2.6 Ingesting Data Securely This section describes the ways in which data can be secured and ingested by various jobs in Hadoop at a field or file level. 2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway (FPG) Protegrity provides the File Protector Gateway (FPG) for secure field level protection to ingest data through ETL tools. The ingested files data can be used by Hadoop applications for analytics and processing. The sensitive fields are secured by the FPG before Hadoop jobs operate on it. For more information about field level ingestion by custom MapReduce job for data at rest in HDFS, refer to File Protector Gateway Server Guide 6.6.4. 2.6.2 Ingesting Files Using Hive Staging Semi-structured data files can be loaded into a Hive staging table for ingestion into a Hive table with Hive queries and Protegrity UDFs. After loading data in the table, the data will be stored in protected form. 2.6.3 Ingesting Files into HDFS by HDFSFP The HDFSFP component of Big Data Protector can be used for ingesting files securely in HDFS. It provides granular access control for the files in HDFS. You can ingest files using the command shell and Java API in HDFSFP. For more information about using HDFSFP, refer to section 5 HDFS File Protector (HDFSFP). 2.7 Data Security Policy and Protection Methods A data security policy establishes processes to ensure the security and confidentiality of sensitive information. In addition, the data security policy establishes administrative and technical safeguards against unauthorized access or use of the sensitive information. Depending on the requirements, the data security policy typically performs the following functions: • Classifies the data that is sensitive for the organization. Confidential 23 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector Defines the methods to protect sensitive data, such as encryption and tokenization. Defines the methods to present the sensitive data, such as masking the display of sensitive information. • Defines the access privileges of the users that would be able to access the data. • Defines the time frame for privileged users to access the sensitive data. • Enforces the security policies at the location where sensitive data is stored. • Provides a means of auditing authorized and unauthorized accesses to the sensitive data. In addition, it can also provide a means of auditing operations to protect and unprotect the sensitive data. The data security policy contains a number of components, such as, data elements, datastores, member sources, masks, and roles. The following list describes the functions of each of these entities: • • Data elements define the data protection properties for protecting sensitive data, consisting of the data securing method, data element type and its description. In addition, Data elements describe the tokenization or encryption properties, which can be associated with roles. • Datastores consist of enterprise systems, which might contain the data that needs to be processed, where the policy is deployed and the data protection function is utilized. • Member sources are the external sources from which users (or members) and groups of users are accessed. Examples are a file, database, LDAP, and Active Directory. • Masks are a pattern of symbols and characters, that when imposed on a data field, obscures its actual value to the user. Masks effectively aid in hiding sensitive data. • Roles define the levels of member access that are appropriate for various types of information. Combined with a data element, roles determine and define the unique data access privileges for each member. For more information about the data security policies, protection methods, and the data elements supported by the components of the Big Data Protector, refer to Protection Methods Reference Guide 6.6.5. • Confidential 24 Big Data Protector Guide 6.6.5 3 Installing and Uninstalling Big Data Protector Installing and Uninstalling Big Data Protector This section describes the procedure to install and uninstall the Big Data Protector. 3.1 Installing Big Data Protector on a Cluster This section describes the tasks for installing Big Data Protector on a cluster. Starting from the Big Data Protector 6.6.4 release, you do not require root access to install Big Data Protector on a cluster. You need a sudoer user account to install Big Data Protector on a cluster. 3.1.1 Verifying Prerequisites for Installing Big Data Protector Ensure that the following prerequisites are met, before installing Big Data Protector: • • • • • The Hadoop cluster is installed, configured, and running. ESA appliance version 6.6.5 is installed, configured, and running. A sudoer user account with privileges to perform the following tasks: o Update the system by modifying the configuration, permissions, or ownership of directories and files. o Perform third party configuration. o Create directories and files. o Modify the permissions and ownership for the created directories and files. o Set the required permissions to the create directories and files for the Protegrity Service Account. o Permissions for using the SSH service. The sudoer password is the same across the cluster. The following user accounts to perform the required tasks: o ADMINISTRATOR_USER: It is the sudoer user account that is responsible to install and uninstall the Big Data Protector on the cluster. This user account needs to have sudo access to install the product. o EXECUTOR_USER: It is a user that has ownership of all Protegrity files, folders, and services. o OPERATOR_USER: It is responsible for performing tasks such as, starting or stopping tasks, monitoring services, updating the configuration, and maintaining the cluster while the Big Data Protector is installed on it. If you need to start, stop, or restart the Protegrity services, then you need sudoer privileges for this user to impersonate the EXECUTOR_USER. Depending on the requriements, a single user on the system may perform multiple roles. If a single user is performing multiple roles, then ensure that the following conditions are met: • • The user has the required permissions and privileges to impersonate the other user accounts, for performing their roles, and perform tasks as the impersonated user. The user is assigned the highest set of privileges, from the required roles that it needs to perform, to execute the required tasks. For instance, if a single user is performing tasks as ADMINISTRATOR_USER, EXECUTOR_USER, and Confidential 25 Big Data Protector Guide 6.6.5 • • • • • • • • Installing and Uninstalling Big Data Protector OPERATOR_USER, then ensure that the user is assigned the privileges of the ADMINISTRATOR_USER. The management scripts provided by the installer in the cluster_utils directory should be run only by the user (OPERATOR_USER) having privileges to impersonate the EXECUTOR_USER. o If the value of the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No, then ensure that a service group containing a user for running the Protegrity services on all the nodes in the cluster already exists. o If the Hadoop cluster is configured with LDAP or AD for user management, then ensure that the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No and that the required service account user is created on all the nodes in the cluster. If the Big Data Protector with versions lower than 6.6.3 was previously installed with HDFSFP, then ensure that you create the backup of DFSFP on the ESA. For more information about creating the DFSFP backup, refer to section 4.1.4 Backing Up DFSFP before Installing Big Data Protector 6.6.3 in Data Security Platform Upgrade Guide 6.6.5. If Big Data Protector, version 6.6.3, with build version 6.6.3.15, or lower, was previously installed and the following Spark protector APIs for Encryption/Decryption are utilized: o public void protect(String dataElement, ListerrorIndex, short[] input, byte[][] output) o public void protect(String dataElement, List errorIndex, int[] input, byte[][] output) o public void protect(String dataElement, List errorIndex, long[] input, byte[][] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, short[] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, int[] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, long[] output) For more information, refer to the Advisory for Spark Protector APIs, before installing Big Data Protector, version 6.6.5. If the Big Data Protector was previously installed then uninstall it. In addition, delete the directory from the Lead node. If the /var/log/protegrity/ directory exists on any node in the cluster, then ensure that it is empty. Password based authentication is enabled in the sshd_config file before installation. After the installation is completed, this setting might be reverted back by the system administrator. The lsb_release library is present on the client machine, at least on the Lead node. The Lead node can be any node, such as the Name node, Data node, or Edge node, that can access the Hadoop cluster. The Lead node would be driving the installation of the Big Data Protector across the Hadoop cluster and is responsible for managing the Big Data Protector services throughout the cluster. If the lsb_release library is not present, then the installation of the Big Data Protector fails. This can be verified by running the following command. lsb_release If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster, then ensure that the HDFS superuser (hdfs) has a valid Kerberos ticket. If you are configuring HDFSFP with Big Data Protector, then ensure that the following prerequisites are met: o Ensure that an unstructured policy is created in the ESA, containing the data elements to be linked with the ACL. Confidential 26 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If a sticky bit is set for an HDFS directory, which is required to be protected by HDFSFP, then the user needs to remove the sticky bit before creating ACLs (for Protect/Reprotect/Unprotect/Update) for that HDFS directory. If required, then the user can set the sticky bit again after activating the ACLs. For more information about creating data elements, security policies, and user roles, refer to Enterprise Security Administrator Guide 6.6.5 and Protection Enforcement Point Servers Installation Guide 6.6.5. o 3.1.2 Extracting Files from the Installation Package To extract the files from the installation package: 1. After receiving the installation package from Protegrity, copy it to the Lead node in any temporary folder, such as /opt/bigdata. 2. Extract the files from the installation package using the following command: tar –xf BigDataProtector_ - -nCPU_64_6.6.5.x.tgz The following files are extracted: • • • • • • • • • • • • • • • • • • • • • • • • • • • • BDP.config BdpInstallx.x.x_Linux_ _6.6.5.x.sh FileProtector_ _x86- _AccessControl_6.6.x.x.sh FileProtector_ _x86- _ClusterDeploy_6.6.x.x.sh FileProtector_ _x86- _FileEncryption_6.6.x.x.sh FileProtector_ _x86- _PreInstallCheck_6.6.x.x.sh FileProtector_ _x86- _VolumeEncryption_6.6.x.x.sh FP_ClusterDeploy_hosts INSTALL.txt JpepLiteSetup_Linux_ _6.6.5.x.sh node_uninstall.sh PepHbaseProtectorx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepHdfsFp_Setup_ -x.x_6.6.5.x.sh PepHivex.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepImpalax.xSetup_ _x86- _6.6.5.x.sh, only if it is a Cloudera or MapR distribution PepHawqx.xSetup_ _x86- _6.6.5.x.sh, only if it is a Pivotal distribution PepMapreducex.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepPigx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepServer_Setup_Linux_ _6.6.5.x.sh PepSparkx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepTalendSetup_x.x.x_6.6.5.x.sh Prepackaged_Policyx.x.x_Linux_ _6.6.5.x.sh ptyLogAnalyzer.sh ptyLog_Consolidator.sh samples-mapreduce.tar samples-spark.tar uninstall.sh XCPep2Jni_Setup_Linux_ _6.6.5.x.sh Confidential 27 Big Data Protector Guide 6.6.5 3.1.3 Installing and Uninstalling Big Data Protector Updating the BDP.config File Ensure that the BDP.config file is updated before the Big Data Protector is installed. Do not update the BDP.config file when the installation of the Big Data Protector is in progress. To update the BDP.config file: 1. Create a file containing a list of all nodes in the cluster, except the Lead node, and specify it in the BDP.config file. This file is used by the installer for installing Big Data Protector on the nodes. 2. Open the BDP.config file in any text editor and modify the following parameter values: • HADOOP_DIR – The installation home directory for the Hadoop distribution. • PROTEGRITY_DIR – The directory where the Big Data Protector will be installed. The samples and examples used in this document assume that the Big Data Protector is installed in the /opt/protegrity/ directory. • CLUSTERLIST_FILE – This file contains the host name or IP addresses all the nodes in the cluster, except the Lead node, listing one host name and IP address per line. Ensure that you specify the file name with the complete path. • INSTALL_DEMO – Specifies one of the following values, as required: o Yes – The installer installs the demo. o No – The installer does not install the demo. • HDFSFP – Specifies one of the following values, as required: o Yes – The installer installs HDFSFP. o No – The installer does not install HDFSFP. If HDFSFP is being installed, then XCPep2Jni is installed using the XCPep2Jni_Setup_Linux_ _6.6.5.x.sh script. • • • • SPARK_PROTECTOR – Specifies one of the following values, as required: Yes – The installer installs the Spark protector. This parameter also needs to be set to Yes, if the user needs to run Hive UDFs with Spark SQL, or use the Spark protector samples if the INSTALL_DEMO parameter is set to Yes. o No – The installer does not install the Spark protector. IP_NN – The IP address of the Lead node in the Hadoop cluster, which is required for the installation of HDFSFP. PROTEGRITY_CACHE_PORT – The Protegrity Cache port used in the cluster. This port should be open in the firewall across the cluster. On the Lead node, it should be open only for the corresponding ESA, which is used to manage the cluster protection. This is required for the installation of HDFSFP. Typical value for this port is 6379. AUTOCREATE_PROTEGRITY_IT_USR – This parameter determines the Protegrity service account. The service group and service user name specified in the PROTEGRITY_IT_USR_GROUP and PROTEGRITY_IT_USR parameters respectively will be created if this parameter is set to Yes. One of the following values can be specified, as required: o Yes – The installer creates a service group PROTEGRITY_IT_USR_GROUP containing the user PROTEGRITY_IT_USR for running the Protegrity services on all the nodes in the cluster. If the service group or service user are already present, then the installer exits. If you uninstall the Big Data Protector, then the service group and the service user are deleted. Confidential 28 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector No – The installer does not create a service group PROTEGRITY_IT_USR_GROUP with the service user PROTEGRITY_IT_USR for running the Protegrity services on all the nodes in the cluster. Ensure that a service group containing a service user for running Protegrity services has been created, as described in section 3.1.1 Verifying Prerequisites for Installing Big Data Protector. PROTEGRITY_IT_USR_GROUP – This service group is required for running the Protegrity services on all the nodes in the cluster. All the Protegrity installation directories are owned by this service group. PROTEGRITY_IT_USR – This service account user is required for running the Protegrity services on all the nodes in the cluster and is a part of the group PROTEGRITY_IT_USR_GROUP. All the Protegrity installation directories are owned by this service user. HADOOP_NATIVE_DIR – The Hadoop native directory. This parameter needs to be specified if you are using MapR. HADOOP_SUPER_USER – The Hadoop super user name. This parameter needs to be specified if you are using MapR. o • • • • 3.1.4 Installing Big Data Protector To install the Big Data Protector: 1. As a sudoer user, run BdpInstallx.x.x_Linux_ _6.6.5.x.sh from the folder where it is extracted. A prompt to confirm or cancel the Big Data Protector installation appears. 2. Type yes to continue with the installation. The Big Data Protector installation starts. If you are using a Cloudera or MapR distribution, then the presence of the HDFS connection is also verified. A prompt to enter the sudoer password for the ADMINISTRATOR user appears. 3. Enter the sudoer password. A prompt to enter the ESA user name or IP address appears. 4. Enter the ESA host name or IP address. A prompt to enter the ESA user name appears. 5. Enter the ESA user name (Security Officer). The PEP Server Installation wizard starts and a prompt to configure the host as ESA proxy appears. 6. Depending on the requirements, type Yes or No to configure the host as an ESA proxy. 7. If the ESA proxy is set to Yes, then enter the host password for the required ESA user. 8. When prompted, perform the following steps to download the ESA keys and certificates. a) Specify the Security Officer user with administrative privileges. b) Specify the Security Officer password for the ESA certificates and keys. The installer then installs the Big Data Protector on all the nodes in the cluster. The status of the installation of the individual components appears, and the log files for all the required components on all the nodes in the cluster are stored on the Lead node in the /cluster_utils/logs directory. Verify the installation report, that is generated at /cluster_utils/installation_report.txt to ensure that the installation of all the components is successful on all the nodes in the cluster. Verify the bdp_setup.log file confirm if the Big Data Protector was installed successfully on all the nodes in the cluster. Confidential 29 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 9. Restart the MapReduce (MRv1) or Yarn (MRv2) services on the Hadoop cluster. The installer installs the following components in the installation folder of the Big Data Protector: • PEP server in the /defiance_dps directory • XCPep2Jni in the /defiance_xc directory • JpepLite in the /jpeplite directory • MapReduce protector in the /pepmapreduce/lib directory • Hive protector in the /pephive/lib directory • Pig protector in the /peppig/lib directory • HBase protector in the /pephbase-protector/lib directory • Impala protector in the /pepimpala directory, if you are using a Cloudera or MapR distribution • HAWQ protector in the /pephawq directory, if you are using a Pivotal distribution • hdfsfp-xxx.jar in the /hdfsfp directory, only if the value of the HDFSFP parameter in the BDP.config file is specified as Yes • pepspark-xxx.jar in the /pepspark/lib directory, only if the value of the SPARK parameter in the BDP.config file is specified as Yes • Talend-related files in /etl/talend directory • Cluster Utilities in the /cluster_utils directory The following files and directories are present in the /cluster_utils folder: o BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility to install the Big Data Protector on any node in the cluster. For more information about using the BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility, refer to section 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop Cluster. o cluster_cachesrvctl.sh utility for monitoring the status of the Protegrity Cache on all the nodes in the cluster, only if the value of the HDFSFP parameter in the BDP.config file is specified as Yes. o cluster_pepsrvctl.sh utility for managing PEP servers on all nodes in the cluster. o uninstall.sh utility to uninstall the Big Data Protector from all the nodes in the cluster. o node_uninstall.sh to uninstall the Big Data Protector from any nodes in the cluster. For more information about using the node_uninstall.sh utility, refer to section 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster. o update_cluster_policy.sh utility for updating PEP servers when a new policy is deployed. o BDP.config file o CLUSTERLIST_FILE, which is a file containing a list of all the nodes, except the Lead node. o installation_report.txt file that contains the status of installation of all the components in the cluster. o logs directory that contains the consolidated setup logs from all the nodes in the cluster. Confidential 30 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 10. Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the MapReduce protector will return the detailed error and return codes instead of 0 for failure and 1 for success. For more information about the error codes for Big Data Protector, version 6.6.5, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the older behaviour from the Big Data Protector, version 6.6.3 or lower with the Bulk APIs in the MapReduce protector is desired, then perform the following steps to enable the Backward compatibility mode to retain the same error handling capabilities. a) If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), then append the following entry to the mapreduce.admin.reduce.child.java.opts property in the mapred-site.xml file. -Dpty.mr.compatibility=old b) If you are using CDH, then add the following values to the Yarn Service Mapreduce Advanced Configuration Snippet (Safety Valve) parameter in the mapred-site.xml file. mapreduce.admin.map.child.java.opts -Dpty.mr.compatibility=old 11. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have installed HDFSFP, then perform the following steps. a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file contains the following entries in the order provided. mapreduce.admin.reduce.child.java.opts -Dpty.mr.compatibility=old /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* /hdfsfp/* Ensure that the above entries are mapreduce.application.classpath property. before all other entries in the b) Ensure that the mapred.min.split.size property in the hive-site.xml file is set to the following value. mapred.min.split.size=256000 c) Restart the Yarn service. d) Restart the MRv2 service. e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file contains the following entries in the order provided. /pepmapreduce/lib/* /pephive/lib/* Confidential 31 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector /peppig/lib/* /hdfsfp/* Ensure that the above entries are before tez.cluster.additional.classpath.prefix property. f) all other entries in the Restart the Tez services. 12. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have not installed HDFSFP, then perform the following steps. a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry is before mapreduce.application.classpath property. all other entries in the b) Ensure that the yarn.application.classpath property in the yarn-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry yarn.application.classpath property. c) is before all other entries in the Restart the Yarn service. d) Restart the MRv2 service. e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry is before tez.cluster.additional.classpath.prefix property. f) all other entries in the Restart the Tez services. 13. If HDFSFP is not installed and you need to use the Hive protector, then perform the following steps. a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml file. hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook b) Restart the Hive services to ensure that the updates are propagated to all the nodes in the cluster. Confidential 32 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 14. If HDFSFP is installed and you need to use the Hive protector with HDFSFP, then perform the following steps. a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml file. hive.exec.pre.hooks=com.protegrity.hadoop.fileprotector.hive.PtyHivePr eHook b) Restart the Hive services to ensure that the updates are propagated to all the nodes in the cluster. If you are using Beeline or Hue, then ensure that Protegrity Big Data Protector is installed on the following machines: • • For Beeline: The machines where Hive Metastore, and HiveServer2 are running. For Hue: The machines where HueServer, Hive Metastore, HiveServer2 are running. It is recommended to use the Cluster Policy provider to deploy the policies in a multi-node cluster environment, such as Big Data, Teradata etc. If you require the PEP Server service to start automatically after every reboot of the system, then define the PEP Server service in the startup with the required run levels. For more info about starting the PEP Server service automatically, refer to Protection Enforcements Point Servers Installation Guide Release 6.6.5. 3.1.5 Applying Patches As the functionality of the ESA is extended, it should be updated through patches applied to ESA. The patches are available as .pty files, which should be loaded with the ESA user interface. Receive the ESA_PAP-ALL-64_x86-64_6.6.5.pty, or later patch from Protegrity. Upload this patch on the ESA using the Web UI. Then install this patch using the ESA CLI manager. For more information about applying patches, refer to section 4.4.6.2 Install Patches of Protegrity Appliances Overview. 3.1.6 Installing the DFSFP Service Using the Add/Remove Services tool on the ESA to install the DFSFP service. For more information about installing services, refer to Section 4.4.6 of Protegrity Appliances Overview. To install the DFSFP service using the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Administration Add/Remove Services. 3. Press ENTER. The root password prompt appears. 4. Enter the root password. 5. Press ENTER. The Add/Remove Services screen appears 6. Select Install applications. 7. Press ENTER. 8. Select DFSFP. 9. Press ENTER. The DFSFP service is installed. Confidential 33 Big Data Protector Guide 6.6.5 3.1.7 Installing and Uninstalling Big Data Protector Configuring HDFSFP If HDFSFP is used, then it should be configured after Big Data Protector is installed. To ensure that the user is able to access protected data in the Hadoop cluster, HDFSFP is globally configured so that it can perform checks for access control transparently. Ensure that you set the the value of the mapreduce.output.fileoutputformat.compress.type property to BLOCK in the mapredsite.xml file. 3.1.7.1 Configuring HDFSFP for Yarn (MRv2) To configure Yarn (MRv2) with HDFSFP: 1. Register the Protegrity codec in the Hadoop codec factory configuration. In the io.compression.codecs property in the core-site.xml file, add the codec com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the mapred-site.xml file to true. 3. Add the property mapreduce.output.fileoutputformat.compress.codec to the mapredfile and set the value to site.xml com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. If the property is already present in the mapred-site.xml file, then ensure that the existing value of the property is replaced with com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 4. Include the /hdfsfp/* path as the first value in the yarn.application.classpath property in the yarn-site.xml file. 5. Restart the HDFS and Yarn services. 3.1.7.2 Configuring HDFSFP for MapReduce, v1 (MRv1) A MapReduce job processes large data sets stored in HDFS across the Hadoop cluster. The result of the MapReduce job is stored in HDFS. The HDFSFP stores protected data in encrypted form in HDFS. The Map job reads protected data and the Reduce job saves the result in protected form. This is done by configuring the Protegrity codec at global level for MapReduce jobs. To configure MRv1 with HDFSFP: 1. Register the Protegrity codec in the Hadoop codec factory configuration. In the io.compression.codecs property in the core-site.xml file, add the codec com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to true. 3. Modify the value of the mapred.output.compression.codec property in the mapred-site.xml file to com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 4. Restart the HDFS and MapReduce services. 3.1.7.3 Adding a Cluster to the ESA Before configuring the Cache Refresh Server, ensure that a cluster is added to the ESA. For more information about adding a cluster to the ESA, refer to section 5.14.1 Adding a Cluster for Protection. Confidential 34 Big Data Protector Guide 6.6.5 3.1.7.4 Installing and Uninstalling Big Data Protector Configuring the Cache Refresh Server If a cluster is added to the ESA, then the Cache Refresh server periodically validates the cache entries and takes corrective action, if necessary. This server should always be active. The Cache Refresh Server periodically validates the ACL entries in Protegrity Cache with the ACL entries in the ESA. If a Data store is created using ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed, then the Cluster configuration file (clusterconfig.xml), located in the /dfs/dfsadmin/config/ directory, contains the field names RedisPort and RedisAuth. • If a Data store is created using ESA 6.5 SP2 Patch 4 with DFSFPv8 patch installed, then the Cluster configuration file (clusterconfig.xml) contains the field names ProtegrityCachePort and ProtegrityCacheAuth. • If a migration of the ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed to the ESA 6.5 SP2 Patch 4 with DFSFPv8 patch installed is done, then the Cluster configuration file (clusterconfig.xml) contains the field name entries RedisPort and RedisAuth for the old Data stores, and the entries ProtegrityCachePort and ProtegrityCacheAuth for the new Data stores, created after the migration. If the ACL entries present in the appliance are not matching the ACL entries in Protegrity Cache, then logs are generated in the ESA. The logs can be viewed from the ESA Web Interface at the following path: Distributed File System File Protector Logs. • The various error codes are explained in Troubleshooting Guide 6.6.5. To configure the Cache Refresh Server time: 1. Navigate to the path /dfs/cacherefresh/data. 2. Open the dfscacherefresh.cfg file. 3. Modify the cacherefreshtime parameter as required based on the following guidelines: • Default value – 30 minutes • Minimum value – 10 minutes • Maximum value – 720 minutes (12 hours) The Cache Refresh Interval should be entered in minutes. To verify if the Cache Refresh Server is running: 1. Login to the ESA Web Interface. 2. Navigate to System Services DFS Cache Refresh. The Cache Refresh Server would be running. 3. If the Cache Refresh Server is not running, then click on the Start button ( Cache Refresh Server. 3.1.7.5 ) to start the Configuring Hive Support in HDFSFP If Hive is used with HDFSFP, then it should be configured after installing Big Data Protector. To configure Hive support in HDFSFP: 1. If you are using a Hadoop distribution that has a Management UI, then perform the following steps. a) In the hive-site.xml file, set the value of the mapreduce.job.maps property to 1, using the Management UI. Confidential 35 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If the hive-site.xml file does not have any mapreduce.job.maps property, then perform the following tasks. a. Add the property with the name mapreduce.job.maps in the hive-site.xml file. b. Set the value of the mapreduce.job.maps property to 1. b) In the hive-site.xml file, add the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook to the hive.exec.pre.hooks property before any other existing value, using the Management UI. If the hive-site.xml file does not have any hive.exec.pre.hooks property, then perform the following tasks. a. Add the property with the name hive.exec.pre.hooks in the hive-site.xml file. b. Set the value of the hive.exec.pre.hooks property to com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook. 2. If you are using a Hadoop distribution without a Management UI, then perform the following steps. a) Add the following property in the hive-site.xml file on all nodes. If the property is already present in the hive-site.xml file, then ensure that the value of the property is set to 1. b) Add the following property in the hive-site.xml file on all nodes. mapreduce.job.maps 1 If the property is already present in the hive-site.xml file, then ensure that the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook is before any other existing value. For more information about using Hive with HDFSFP, refer to section 13 Appendix: Using Hive with HDFSFP. 3.1.8 Configuring HBase If HBase is used, then it should be configured after Big Data Protector is installed. Ensure that you configure the Protegrity HBase coprocessor on all the region servers. If the Protegrity HBase coprocessor is not configured in some region servers, then an inconsistent state might occur, where some records in a table are protected and some are not protected. This could potentially lead to data corruption, making it difficult to separate the protected data from clear text data. It is recommended to use HBase version 0.98 or above. Confidential 36 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If you are using an HBase version lower than 0.98, then you would need a Java client to perform the protection of data. HBase versions lower than 0.98 do not support ATTRIBUTES, which controls the MIGRATION and BYPASS_COPROCESSOR parameters. To configure HBase: 1. If you are using a Hadoop distribution that has a Management UI, then add the following value to the HBase coprocessor region classes property in the hbase-site.xml file in all the respective region server groups, using the Management UI. com.protegrity.hbase.PTYRegionObserver If the hbase-site.xml file does not have any HBase coprocessor region classes property, then perform the following tasks. a) Add the property with the name hbase.coprocessor.region.classes in the hbase-site.xml file in all the respective region server groups. b) Set the following value for the hbase.coprocessor.region.classes property. com.protegrity.hbase.PTYRegionObserver If any coprocessors are already defined in the HBase coprocessor region class property, then ensure that the value of the Protegrity coprocessor is before any pre-existing coprocessors defined in the hbase-site.xml file. 2. If you are using a Hadoop distribution without a Management UI, then add the following property in the hbase-site.xml file on all region server nodes. hive.exec.pre.hooks com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook If the property is already present in the hbase-site.xml file, then ensure that the value of the Protegrity coprocessor region class is before any other coprocessor in the hbase-site.xml file. 3. Restart all HBase services. 3.1.9 Configuring Impala If Impala is used, then it should be configured after Big Data Protector is installed. To configure Impala: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Navigate to the hbase.coprocessor.region.classes com.protegrity.hbase.PTYRegionObserver /pepimpala/sqlscripts/ folder. This folder contains the Protegrity UDFs for the Impala protector. 3. If you are not using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql script to load the Protegrity UDFs for the Impala protector. impala-shell -i -f /pepimpala/sqlscripts/createobjects.sql 4. If you are using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql script to load the Protegrity UDFs for the Impala protector. impala-shell -i -f /pepimpala/sqlscripts/createobjects.sql -k If the catalogd process is restarted at any point in time, then all the Protegrity UDFs for the Impala protector should be reloaded using the command in Step 3 or 4, as required. Confidential 37 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 3.1.10 Configuring HAWQ If HAWQ is used, then it should be configured after Big Data Protector is installed. Ensure that you are logged as the gpadmin user for configuring HAWQ. To configure HAWQ: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Navigate to the /pephawq/sqlscripts/ folder. This folder contains the Protegrity UDFs for the HAWQ protector. 3. Execute the createobjects.sql script to load the Protegrity UDFs for the HAWQ protector. psql -h -p 5432 -f /pephawq/sqlscripts/createobjects.sql where: HAWQ_Master_Hostname: Hostname or IP Address of the HAWQ Master Node 5432: Port number 3.1.11 Configuring Spark If Spark is used, then it should be configured after Big Data Protector is installed. To configure Spark: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pepspark/lib/* spark.executor.extraClassPath= /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following classpath entries. spark.driver.extraClassPath= /pepspark/lib/*: / hdfsfp/* spark.executor.extraClassPath= /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. If the user needs to run Hive UDFs with Spark SQL, then the following steps need to be performed. To configure Spark SQL: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following classpath entries. Confidential 38 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/*: /hdfsfp/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. 3.2 Installing or Uninstalling Big Data Protector on Specific Nodes This section describes the following procedures: • • 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop cluster Uninstalling Big Data Protector from a Nodes in the Hadoop cluster Installing Big Data Protector on New Nodes added to a Hadoop Cluster If you need to install Big Data Protector on new nodes added to a Hadoop cluster, then use the BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility in the /cluster_utils directory. Ensure that you install the Big Data Protector from an ADMINISTRATOR user having full sudoer privileges. To install Big Data Protector on New Nodes added to a Hadoop Cluster: 1. Login to the Lead Node. 2. Navigate to the /cluster_utils directory. 3. Add additional entries for each new node, on which the Big Data Protector needs to be installed, in the NEW_HOSTS_FILE file. The new nodes from the NEW_HOSTS_FILE file will be appended to the CLUSTERLIST_FILE. 4. Execute the following command utility to install Big Data Protector on the new nodes. ./BdpInstall1.0.1_Linux_ _6.6.5.X.sh –a The Protegrity Big Data Protector is installed on the new nodes. 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster If you need to uninstall Big Data Protector from selective nodes in the Hadoop cluster, then use the node_uninstall.sh utility in the /cluster_utils directory. Ensure that you uninstall the Big Data Protector from an ADMINISTRATOR user having full sudoer privileges. To uninstall Big Data Protector from Selective Nodes in the Hadoop Cluster: 1. Login to the Lead Node. 2. Navigate to the /cluster_utils directory. 3. Create a new hosts file (such as NEW_HOSTS_FILE). The NEW_HOSTS_FILE file contains the required nodes on which the Big Data Protector needs to be uninstalled. 4. Add the nodes from which the Big Data Protector needs to be uninstalled in the new hosts file. Confidential 39 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 5. Execute the following command to remove the Big Data Protector from the nodes that are listed in the new hosts file. ./node_uninstall.sh -c NEW_HOSTS_FILE The Big Data Protector is uninstalled from the nodes listed in the new hosts file. 6. Remove the nodes from which the Big Data Protector is uninstalled in Step 5 from the CLUSTERLIST_FILE file. 3.3 Utilities This section provides information about the following utilities: • • • • 3.3.1 PEP Server Control (cluster_pepsrvctl.sh) – Manages PEP servers across the cluster. Update Cluster Policy (update_cluster_policy.sh) – Updates the configurations of the PEP servers across the cluster. Protegrity Cache Control (cluster_cachesrvctl.sh) – Monitors the status of the Protegrity Cache on all the nodes in the cluster. This utility is available only for HDFSFP. Recover Utility – Recovers the contents from a protected path. This utility is available only for HDFSFP. Ensure that you run the utilities with a user (OPERATOR_USER) having sudo privileges for impersonating the service account (EXECUTOR_USER or PROTEGRITY_IT_USR, as configured). PEP Server Control This utility (cluster_pepsrvctl.sh), in the /cluster_utils folder, manages the PEP server services on all the nodes in the cluster, except the Lead node. The utility provides the following options: Start – Starts the PEP servers in the cluster. Stop – Stops the PEP servers in the cluster. • Restart – Restarts the PEP servers in the cluster. • Status – Reports the status of the PEP servers. The utility (pepsrvctrl.sh), in the /defiance_dps/bin/ folder, manages the PEP server services on the Lead node. • • When you run the the PEP Server Control utility, then you will be prompted to enter the OPERATOR_USER password, which is same across all the nodes in the cluster. 3.3.2 Update Cluster Policy This utility (update_cluster_policy.sh), in the /cluster_utils folder, updates the configurations of the PEP servers across the cluster. For example, if you need to make any changes to the PEP server configuration, make the changes on the Lead node and then propagate the change to all the PEP servers in the cluster using the update_cluster_policy.sh utility. Ensure that all the PEP servers in the cluster are stopped before running the update_cluster_policy.sh utility. Confidential 40 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector When you run the the Update Cluster Policy utility, then you will be prompted to enter the OPERATOR_USER password, which is same across all the nodes in the cluster. 3.3.3 Protegrity Cache Control This utility (cluster_cachesrvctl.sh), in the /cluster_utils folder, monitors the status of the Protegrity Cache on all the nodes in the cluster. This utility prompts for the OPERATOR_USER password. The utility provides the following options: • • • • 3.3.4 Start – Starts the Protegrity Cache services in the cluster. Stop – Stops the Protegrity Cache services in the cluster. Restart – Restarts the Protegrity Cache services in the cluster. Status – Reports the status of the Protegrity Cache services. Recover Utility The Recover utility is available for HDFSFP only. This utility recovers the contents from protected files of types Text, RC, and Sequence, in the absence of ACL or loss of ACL information. This ensures that the data is not lost under any circumstances. Parameters srcpath: destpath: The protected HDFS path containg the data The destination directory to store unprotected data. to be unprotected. Result • • If srcpath is the file path, then the Recover utility recovers all files. If srcpath is the directory path, then the Recover utility recovers all files inside the directory. Ensure that the user running the Recover utility has unprotect access on the data element which was used to protect the files in the HDFS path. Ensure that an ADMINISTRATOR or OPERATOR_USER is running the Recover Utility and the user has the required read/execute permissions to the /hdfsfp/recover.sh script. Example The following two ACLs are created: 1. /user/root/employee 2. /user/ptyitusr/prot/employee Run the Recover Utility on these two paths with destination local directory as /tmp/HDFSFPrecovered/ by using the following commands. The following would be recovered in the local directory: Confidential 41 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 1. /tmp/HDFSFP-recovered/user/root/employee - The files and sub-directories present in the HDFS location /user/root/employee are recovered in cleartext form. 2. /tmp/HDFSFP-recovered/user/ptyitusr/prot/employee - The files and sub-directories present in the HDFS location /user/ptyitusr/prot/employee are recovered in cleartext form. To recover the protected data from a Hive warehouse directory to a local file system directory: 1. Execute the following command to retrieve the protected data from a Hive warehouse directory. /hdfsfp/recover.sh –srcpath -destpath The cleartext data from the protected HDFS path is stored in the destination directory. 2. If you need to ensure that the existing Hive queries for the table function, then perform the following steps. a) Execute the following command to delete the warehouse directory for the table. hadoop fs –rm –r /tablename b) Move the destination directory with the cleartext data in HDFS using the following command. hadoop fs –put /user/hive/warehouse/table_name /tablename in the local file c) To view the cleartext data in the table, use the following command. Select * from tablename 3.4 Uninstalling Big Data Protector from a Cluster This section describes the procedure for uninstalling the Big Data Protector from the cluster. 3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster, then ensure that the HDFS superuser (hdfs) has a valid Kerberos ticket. 3.4.2 Removing the Cluster from the ESA Before uninstalling Big Data Protector from the cluster, the cluster should be deleted from the ESA. For more information about deleting the cluster from the ESA, refer to section 5.14.3 Removing a Cluster. 3.4.3 Uninstalling Big Data Protector from the Cluster Depending on the requirements, perform the following tasks to uninstall the Big Data Protector from the cluster. 3.4.3.1 Removing HDFSFP Configuration for Yarn (MRv2) If HDFSFP is configured for Yarn (MRv2), then the configuration should be removed before uninstalling Big Data Protector. To remove HDFSFP configuration for Yarn (MRv2) after uninstalling Big Data Protector: 1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec codec from the io.compression.codecs property in the core-site.xml file. 2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the mapred-site.xml file to false. Confidential 42 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 3. Remove the value of the mapreduce.output.fileoutputformat.compress.codec property in the mapred-site.xml file. 4. Remove the /hdfsfp/* path from the yarn.application.classpath property in the yarn-site.xml file. 5. Restart the HDFS and Yarn services. 3.4.3.2 Removing HDFSFP Configuration for MapReduce, v1 (MRv1) If HDFSFP is configured for MapReduce, v1 (MRv1), then the configuration should be removed before uninstalling Big Data Protector. To remove HDFSFP configuration for MRv1 after uninstalling Big Data Protector: 1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec codec from the io.compression.codecs property in the core-site.xml file. 2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to false. 3. Remove the value of the mapred.output.compression.codec property in the mapred-site.xml file. 4. Restart the HDFS and MapReduce services. 3.4.3.3 Removing Configuration for Hive Protector if HDFSFP is not Installed If the Hive protector is used and HDFSFP is not installed, then the configuration should be removed before uninstalling Big Data Protector. To remove configuration for Hive protector if HDFSFP is not installed: 1. If you are using a Hadoop distribution with a Management UI, then remove the value com.protegrity.hive.PtyHiveUserPreHook from the hive.exec.pre.hooks property, from the hive-site.xml file using the configuration management UI. 2. If you are using a Hadoop distribution without a Management UI, then remove the following property in the hive-site.xml file from all nodes. 3.4.3.4 Removing Configurations for Hive Support in HDFSFP If Hive is used with HDFSFP, then the configuration should be removed before uninstalling Big Data Protector. To remove configurations for Hive support in HDFSFP: 1. If you are using a Hadoop distribution with a Management UI, then perform the following steps. a) In the hive-site.xml file, remove the value of the mapreduce.job.maps property, using the Management UI. b) In the hive-site.xml file, remove the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook from the hive.exec.pre.hooks property, using the configuration management UI. 2. If you are using a Hadoop distribution without a Management UI, then perform the following steps. a) Remove the following property in the hive-site.xml file on all nodes. hive.exec.pre.hooks hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook b) Remove the following property in the hive-site.xml file on all nodes. mapreduce.job.maps Confidential 43 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector1 3.4.3.5 Removing the Configuration Properties when HDFSFP is not Installed If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have not installed HDFSFP, then the configuration should be removed before uninstalling Big Data Protector. To remove the configuration properties: 1. Remove the following entries from the mapreduce.application.classpath property in the mapred-site.xml file. hive.exec.pre.hooks com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 2. Remove the following entries from the yarn.application.classpath property in the yarnsite.xml file. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 3. Restart the Yarn service. 4. Restart the MRv2 service. 5. Remove the following entries from the tez.cluster.additional.classpath.prefix property in the tez-site.xml file. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 6. Restart the Tez services. 3.4.3.6 Removing HBase Configuration If HBase is configured, then the configuration should be removed before uninstalling Big Data Protector. To remove HBase configuration: 1. If you are using a Hadoop distribution that has a Management UI, then remove the following HBase coprocessor region classes property value from the hbase-site.xml file in all the respective region server groups, using the Management UI. com.protegrity.hbase.PTYRegionObserver 2. If you are using a Hadoop distribution without a Management UI, then remove the following property in the hbase-site.xml file from all region server nodes. 3. Restart all HBase services. 3.4.3.7 Removing the Defined Impala UDFs If Impala is configured, then the defined Protegrity UDFs for the Impala protector should be removed before uninstalling Big Data Protector. To remove the defined Impala UDFs: If you are not using a Kerberos-enabled Hadoop cluster, then run the following command to remove the defined Protegrity UDFs for the Impala protector using the dropobjects.sql script. impala-shell -i hbase.coprocessor.region.classes com.protegrity.hbase.PTYRegionObserver Confidential 44 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector/pepimpala/sqlscripts/dropobjects.sql slave node> -f If you are using a Kerberos-enabled Hadoop cluster, then run the following command to remove the defined Protegrity UDFs for the Impala protector using the dropobjects.sql script. impala-shell -i /pepimpala/sqlscripts/dropobjects.sql -k 3.4.3.8 slave node> -f Removing the Defined HAWQ UDFs If HAWQ is configured, then the defined Protegrity UDFs for the HAWQ protector should be removed before uninstalling Big Data Protector. To remove the defined HAWQ UDFs: Run the following command to remove the defined Protegrity UDFs for the HAWQ protector using the dropobjects.sql script. psql -h -p 5432 -f /pephawq/sqlscripts/dropobjects.sql 3.4.3.9 Removing the Spark Protector Configuration If the Spark protector is used, then the required configuration settings should be removed before uninstalling the Big Data Protector. To remove the Spark protector configuration: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pepspark/lib/* spark.executor.extraClassPath= /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following classpath entries. spark.driver.extraClassPath= /pepspark/lib/*: / hdfsfp/* spark.executor.extraClassPath= /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. Confidential 45 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If Spark SQL is configured to run Hive UDFs, then the required configuration settings should be removed before uninstalling the Big Data Protector. To remove the Spark SQL configuration: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following classpath entries. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/*: /hdfsfp/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. 3.4.3.10 Running the Uninstallation Script To run the scripts for uninstalling the Big Data Protector on all nodes in the cluster: 1. Login as the sudoer user and navigate to the /cluster_utils directory on the Lead node. 2. Run the following script to stop the PEP servers on all the nodes in the cluster. ./cluster_pepsrvctl.sh 3. Run the uninstall.sh utility. A prompt to confirm or cancel the Big Data Protector uninstallation appears. 4. Type yes to continue with the uninstallation. 5. When prompted, enter the sudoer password. The uninstallation script continues with the uninstallation of Big Data Protector. If you are using a Cloudera or MapR distribution, then the presence of an HDFS connection and a valid Kerberos ticket is also verified. The /cluster_utils directory continues to exist on the Lead node. This directory is retained to perform a cleanup in the event of the uninstallation failing on some nodes, due to unavoidable reasons, such as host being down. 6. After Big Data Protector is successfully uninstalled from all nodes, manually delete the directory from the Lead node. 7. If the /defiance_dps_old directory is present on any of the nodes in the cluster, then it can be manually deleted from the respective nodes. 8. Restart all Hadoop services. Confidential 46 Big Data Protector Guide 6.6.5 4 4.1 Hadoop Application Protector Hadoop Application Protector Using the Hadoop Application Protector Various jobs written in the Hadoop cluster require data fields to be stored and retrieved. This data requires protection when it is at rest. The Hadoop Application Protector provides MapReduce, Hive and Pig the power to protect data while it is being processed and stored. Application programmers using these tools can include Protegrity software in their jobs to secure data. For more information about using the protector APIs in various Hadoop applications and samples, refer to the following sections. 4.2 Prerequisites Ensure that the following prerequisites are met before using Hadoop Application Protector: The Big Data Protector is installed and configured in the Hadoop cluster. The security officer has created the necessary security policy which creates data elements and user roles with appropriate permissions. For more information about creating security policies, data elements and user roles, refer to Protection Enforcement Point Servers Installation Guide 6.6.5 and Enterprise Security Administrator Guide 6.6.5. • The policy is deployed across the cluster. For more information about the list of all APIs available to Hadoop applications, refer to sections 4.4 MapReduce APIs, 4.5 Hive UDFs, and 4.6 Pig UDFs. • • 4.3 Samples To run the samples provided with the Big Data Protector, the pre-packaged policy should be deployed from the ESA. During installation, specify the INSTALL_DEMO parameter as Yes in the BDP.config file. The commands in the samples may require Hadoop-super-user permissions. For more information about the samples, refer to section 11 Appendix: Samples. 4.4 MapReduce APIs This section describes the MapReduce APIs available for protection and unprotection in the Big Data Protector to build secure Big Data applications. The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. If you are using the Bulk APIs for the MapReduce protector, then the following two modes for error handling and return codes are available: • Default mode: Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the MapReduce protector will return the detailed error and return codes instead of 0 for failure and 1 for success. In addition, the MapReduce jobs involving Bulk APIs will provide error codes instead of throwing exceptions. Confidential 47 Big Data Protector Guide 6.6.5 Hadoop Application Protector For more information about the error codes for Big Data Protector, version 6.6.5, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. • 4.4.1 Backward compatibility mode: If you need to continue using the error handling capabilities provided with Big Data Protector, version 6.6.3 or lower, that is 0 for failure and 1 for success, then you can set this mode. openSession() This method opens a new user session for protect and unprotect operations. It is a good practice to create one session per user thread. public synchronized int openSession(String parameter) Parameters parameter: An internal API requirement that should be set to 0. Result 1: If session is successfully created Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); Exception (and Error Codes) ptyMapRedProtectorException: if session creation fails 4.4.2 closeSession() This function closes the current open user session. Every instance of ptyMapReduceProtector opens only one session, and a session ID is not required to close it. public synchronized int closeSession() Parameters None Result 1: If session is successfully closed 0: If session closure is a failure Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int closeSessionStatus = mapReduceProtector.closeSession(); Exception (and Error Codes) None 4.4.3 getVersion() This function returns the current version of the MapReduce protector. public java.lang.String getVersion() Parameters None Confidential 48 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result This function returns the current version of MapReduce protector. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); String version = mapReduceProtector.getVersion(); int closeSessionStatus = mapReduceProtector.closeSession(); 4.4.4 getCurrentKeyId() This method returns the current Key ID for the data element which contains the KEY ID attribute, while creating the data element, such as ASE-256, ASE-128, and so on. public int getCurrentKeyId(java.lang.String dataElement) Parameters dataElement: Name of the data element Result This method returns the current Key ID for the data element containing the KEY ID attribute. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int currentKeyId = mapReduceProtector.getCurrentKeyId("ENCRYPTION_DE"); int closeSessionStatus = mapReduceProtector.closeSession(); 4.4.5 checkAccess() This method checks the access of the user for the specified data element. public boolean checkAccess(java.lang.String dataElement, byte bAccessType) Parameters dataElement: Name of the data element bAccessType: Type of the access of the user for the data element. The following are the different values for the bAccessType variable: DELETE 0x01 PROTECT 0x02 REPROTECT 0x04 UNPROTECT 0x08 CREATE 0x10 MANAGE 0x20 Result 1: If the user has access to the data element Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte bAccessType = 0x02; boolean isAccess = mapReduceProtector.checkAccess("DE_PROTECT" , bAccessType ); int closeSessionStatus = mapReduceProtector.closeSession(); Confidential 49 Big Data Protector Guide 6.6.5 4.4.6 Hadoop Application Protector getDefaultDataElement() This method returns default data element configured in security policy. public String getDefaultDataElement(String policyName) Parameters policyName: Name of policy configured using Policy management in ESA. Result Default data element name configured in a given policy. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); String defaultDataElement = mapReduceProtector.getDefaultDataElement("my_policy"); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to return default data element name 4.4.7 protect() Protects the data provided as a byte array. The type of protection applied is defined by dataElement. public byte[] protect(String dataElement, byte[] data) Parameters dataElement: Name of the data element to be protected data: Byte array of data to be protected The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. If you are using the Protect API which accepts byte as input and provides byte as output, then ensure that when unprotecting the data, the Unprotect API, with byte as input and byte as output is utilized. In addition, ensure that the byte data being provided as input to the Protect API has been converted from a string data type only. Result Byte array of protected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] bResult = mapReduceProtector.protect( "DE_PROTECT","protegrity".getBytes()); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data Confidential 50 Big Data Protector Guide 6.6.5 4.4.8 Hadoop Application Protector protect() Protects the data provided as int. The type of protection applied is defined by dataElement. public int protect(String dataElement, int data) Parameters dataElement: Name of the data element to be protected data: int to be protected Result Protected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int bResult = mapReduceProtector.protect( "DE_PROTECT",1234); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data 4.4.9 protect() Protects the data provided as long. The type of protection applied is defined by dataElement. public long protect(String dataElement, long data) Parameters dataElement: Name of the data element to be protected data: long data to be protected Result Protected long data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long bResult = mapReduceProtector.protect( "DE_PROTECT",123412341234); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data 4.4.10 unprotect() This function returns the data in its original form. public byte[] unprotect(String dataElement, byte[] data) Parameters dataElement: Name of data element to be unprotected data: array of data to be unprotected Confidential 51 Big Data Protector Guide 6.6.5 Hadoop Application Protector The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. Result Byte array of unprotected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", "protegrity".getBytes() ); byte[] unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.11 unprotect() This function returns the data in its original form. public int unprotect(String dataElement, int data) Parameters dataElement: Name of data element to be unprotected data: int to be unprotected Result Unprotected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", 1234 ); int unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.12 unprotect() This function returns the data in its original form. public long unprotect(String dataElement, long data) Parameters dataElement: Name of data element to be unprotected data: long data to be unprotected Result Unprotected long data Confidential 52 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", 123412341234 ); long unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.13 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public byte[][] bulkProtect(String dataElement, List errorIndex, byte[][] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in inputDataItems inputDataItems: Two-dimensional array to store bulk data for protection Result Two-dimensional byte array of protected data. If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(), "protegrity".getBytes(), "protegrity".getBytes()}; byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); System.out.print("Protected Data: "); for(int i = 0; i < protectedData.length; i++) { Confidential 53 Big Data Protector Guide 6.6.5 Hadoop Application Protector //THIS WILL PRINT THE PROTECTED DATA System.out.print(protectedData[i] == null ? null : new String(protectedData[i])); if(i < protectedData.length - 1) { System.out.print(","); } } System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.14 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public int[] bulkProtect(String dataElement, List errorIndex, int[] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in input Data Items inputDataItems: array to store bulk int data for protection Result int array of protected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Confidential 54 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); int[] protectData = {1234, 5678, 9012, 3456}; int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //CHECK THE ERROR INDEXES FOR ERRORS System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.15 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public long[] bulkProtect(String dataElement, List errorIndex, long[] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in input Data Items inputDataItems: array to store bulk long data for protection Result Long array of protected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Confidential 55 Big Data Protector Guide 6.6.5 • Hadoop Application Protector Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); long[] protectData = {123412341234, 567856785678, 901290129012, 345634563456}; long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //CHECK THE ERROR INDEXES FOR ERRORS System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.16 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public byte[][] bulkUnprotect(String dataElement, List errorIndex, byte[][] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems byte[][] inputDataItems: two-dimensional array to help store bulk data to be unprotected Result Two-dimensional byte array of unprotected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. Confidential 56 Big Data Protector Guide 6.6.5 Hadoop Application Protector For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. o • Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(), "protegrity".getBytes(), "protegrity".getBytes()}; byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE UNPROTECTED DATA System.out.print("Protected Data: "); for(int i = 0; i < protectedData.length; i++) { System.out.print(protectedData[i] == null ? null : new String(protectedData[i])); if(i < protectedData.length - 1) { System.out.print(","); } } //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } byte[][] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE PROTECTED DATA System.out.print("UnProtected Data: "); for(int i = 0; i < unprotectedData.length; i++) { System.out.print(unprotectedData[i] == null ? null : new String(unprotectedData[i])); if(i < unprotectedData.length - 1) { System.out.print(","); } } //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); Confidential 57 Big Data Protector Guide 6.6.5 Hadoop Application Protector System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.17 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public int[] bulkUnprotect(String dataElement, List errorIndex, int[] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems int[] inputDataItems: int array to be unprotected Result unprotected int array data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); int[] protectData = {1234, 5678,9012,3456 }; int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); Confidential 58 Big Data Protector Guide 6.6.5 Hadoop Application Protector System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.18 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public long[] bulkUnprotect(String dataElement, List errorIndex, long[] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems long[] inputDataItems: long array to be unprotected Result Unprotected long array data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Confidential 59 Big Data Protector Guide 6.6.5 • Hadoop Application Protector Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); long[] protectData = { 123412341234, 567856785678, 901290129012, 345634563456 }; long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } long[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.19 reprotect() Data that has been protected earlier is protected again with a separate data element. public byte[] reprotect(String oldDataElement, String newDataElement, byte[] data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected byte[] data: array of data to be protected Confidential 60 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result Byte array of reprotected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", "protegrity".getBytes() ); byte[] reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.20 reprotect() Data that has been protected earlier is protected again with a separate data element. public int reprotect(String oldDataElement, String newDataElement, int data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected int data: array of data to be protected Result Reprotected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", 1234 ); int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.21 reprotect() Data that has been protected earlier is protected again with a separate data element. public long reprotect(String oldDataElement, String newDataElement, long data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected long data: array of data to be protected Result Reprotected long data Confidential 61 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", 123412341234 ); int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.22 hmac() This method performs data hashing using the HMAC operation on a single data item with a data element, which is associated with hmac. It returns hmac value of the given data with the given data element. public byte[] hmac(String dataElement, byte[] data) Parameters String dataElement: Name of data element for HMAC byte[] data: array of data for HMAC Result Byte array of HMAC data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] bResult = mapReduceProtector.hmac( "DE_HMAC", "protegrity".getBytes() ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error occurs while doing HMAC 4.5 Hive UDFs This section describes all Hive User Defined Functions (UDFs) that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. If you are using Ranger or Sentry, then ensure that your policy provides create access permissions to the required UDFs. 4.5.1 ptyGetVersion() This UDF returns the current version of PEP. ptyGetVersion() Parameters None Result This UDF returns the current version of PEP. Confidential 62 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example create temporary function ptyGetVersion AS 'com.protegrity.hive.udf.ptyGetVersion'; drop table if exists test_data_table; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table; select ptyGetVersion() from test_data_table; 4.5.2 ptyWhoAmI() This UDF returns the current logged in user. ptyWhoAmI() Parameters None Result This UDF returns the current logged in user. Example create temporary function ptyWhoAmI AS 'com.protegrity.hive.udf.ptyWhoAmI'; drop table if exists test_data_table; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table; select ptyWhoAmI() from test_data_table; 4.5.3 ptyProtectStr() This UDF protects string values. ptyProtectStr(String input, String dataElement) Parameters String input: String value to protect String dataElement: Name of data element to protect string value Result This UDF returns protected string value. Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; Confidential 63 Big Data Protector Guide 6.6.5 Hadoop Application Protector LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; select ptyProtectStr(val, 'Token_alpha') from test_data_table; 4.5.4 ` ptyUnprotectStr() This UDF unprotects the existing protected string value. ptyUnprotectStr(String input, String dataElement) Parameters String input: Protected string value to unprotect String dataElement: Name of data element to unprotect string value Result This UDF returns unprotected string value. Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; create temporary function ptyUnprotectStr AS 'com.protegrity.hive.udf.ptyUnprotectStr'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; insert overwrite table protected_data_table select ptyProtectStr(val, 'Token_alpha') from test_data_table; select ptyUnprotectStr(protectedValue, 'Token_alpha') from protected_data_table; 4.5.5 ptyReprotect() This UDF reprotects string format protected data, which was earlier protected using the ptyProtectStr UDF, with a different data element. ptyReprotect(String input, String oldDataElement, String newDataElement) Parameters String input: String value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected string value. Confidential 64 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; insert overwrite table test_protected_data_table select ptyProtectStr(val, 'Token_alpha') from test_data_table; create table test_reprotected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, 'Token_alpha', 'new_Token_alpha') from test_protected_data_table; 4.5.6 ptyProtectUnicode() This UDF protects string (Unicode) values. ptyProtectUnicode(String input, String dataElement) Parameters String input: String (Unicode) value to protect String dataElement: Name of data element to protect string (Unicode) value This UDF should be used only if you need to tokenize Unicode data in Hive, and migrate the tokenized data from Hive to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected string value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; drop table if exists temp_table; Confidential 65 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; select ptyProtectUnicode(val, 'Token_unicode') from temp_table; 4.5.7 ptyUnprotectUnicode() This UDF unprotects the existing protected string value. ptyUnprotectUnicode(String input, String dataElement) Parameters String input: Protected string value to unprotect String dataElement: Name of data element to unprotect string value This UDF should be used only if you need to tokenize Unicode data in Teradata using the Protegrity Database Protector, and migrate the tokenized data from a Teradata database to Hive and detokenize the data using the Protegrity Big Data Protector for Hive. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data from a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns unprotected string (Unicode) value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; create temporary function ptyUnprotectUnicode AS 'com.protegrity.hive.udf.ptyUnprotectUnicode'; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table protected_data_table select ptyProtectUnicode(val, 'Token_unicode') from temp_table; 4.5.8 ptyReprotectUnicode() This UDF reprotects string format protected data, which was protected earlier using the ptyProtectUnicode UDF, with a different data element. Confidential 66 Big Data Protector Guide 6.6.5 Hadoop Application Protector ptyReprotectUnicode(String input, String oldDataElement, String newDataElement) Parameters String input: String (Unicode) value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data This UDF should be used only if you need to tokenize Unicode data in Hive, and migrate the tokenized data from Hive to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected string value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; create temporary function ptyReprotectUnicode AS 'com.protegrity.hive.udf.ptyReprotectUnicode'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val)) from temp_table; insert overwrite table test_protected_data_table select ptyProtectUnicode(val, 'Unicode_Token') from test_data_table; create table test_reprotected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotectUnicode(val, 'Unicode_Token',’new_Unicode_Token’) from test_data_table; 4.5.9 ptyProtectInt() This UDF protects integer values. ptyProtectInt(int input, String dataElement) Confidential 67 Big Data Protector Guide 6.6.5 Hadoop Application Protector Parameters int input: Integer value to protect String dataElement: Name of data element to protect integer value Result This UDF returns protected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; select ptyProtectInt(val, 'Token_numeric') from test_data_table; 4.5.10 ptyUnprotectInt() ` This UDF unprotects the existing protected integer value. ptyUnprotectInt(int input, String dataElement) Parameters int input: Protected integer value to unprotect String dataElement: Name of data element to unprotect integer value Result This UDF returns unprotected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; create temporary function ptyUnprotectInt AS 'com.protegrity.hive.udf.ptyUnprotectInt'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; Confidential 68 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table protected_data_table select ptyProtectInt(val, 'Token_numeric') from test_data_table; select ptyUnprotectInt(protectedValue, 'Token_numeric') from protected_data_table; 4.5.11 ptyReprotect() This UDF reprotects integer format protected data with a different data element. ptyReprotect(int input, String oldDataElement, String newDataElement) Parameters int input: Integer value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val int) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; insert overwrite table test_protected_data_table select ptyProtectInt(val, 'Token_Integer') from test_data_table; create table test_reprotected_data_table(val int) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, 'Token_Integer', 'new_Token_Integer') from test_protected_data_table; 4.5.12 ptyProtectFloat() This UDF protects float value. ptyProtectFloat(Float input, String dataElement) Parameters Float input: Float value to protect String dataElement: Name of data element to unprotect float value Confidential 69 Big Data Protector Guide 6.6.5 Hadoop Application Protector Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected float value. Example create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; select ptyProtectFloat(val, 'FLOAT_DE') from test_data_table; 4.5.13 ptyUnprotectFloat() This UDF unprotects protected float value. ptyUnprotectFloat(Float input, String dataElement) Parameters Float input: Protected float value to unprotect String dataElement: Name of data element to unprotect float value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected float value. Example create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat'; create temporary function ptyUnprotectFloat as 'com.protegrity.hive.udf.ptyUnprotectFloat'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue float) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; Confidential 70 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; insert overwrite table protected_data_table select ptyProtectFloat(val, 'FLOAT_DE') from test_data_table; select ptyUnprotectFloat(protectedValue, 'FLOAT_DE') from protected_data_table; 4.5.14 ptyReprotect() This UDF reprotects float format protected data with a different data element. ptyReprotect(Float input, String oldDataElement, String newDataElement) Parameters Float input: Float value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected float value. Example create temporary function ptyProtectFloat AS 'com.protegrity.hive.udf.ptyProtectFloat'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val float) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val float) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; insert overwrite table test_protected_data_table select ptyProtectFloat(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val float) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncryption’) from test_protected_data_table; 4.5.15 ptyProtectDouble() This UDF protects double value. ptyProtectDouble(Double input, String dataElement) Confidential 71 Big Data Protector Guide 6.6.5 Hadoop Application Protector Parameters Double input: Double value to unprotect String dataElement: Name of data element to unprotect double value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected double value. Example create temporary function ptyProtectDouble as 'com.protegrity.hive.udf.ptyProtectDouble'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; select ptyProtectDouble(val, 'DOUBLE_DE') from test_data_table; 4.5.16 ptyUnprotectDouble() This UDF unprotects protected double value. ptyUnprotectDouble(Double input, String dataElement) Parameters Double input: Double value to unprotect String dataElement: Name of data element to unprotect double value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected double value. Example create temporary function ptyProtectDouble as 'com.protegrity.hive.udf.ptyProtectDouble'; create temporary function ptyUnprotectDouble as 'com.protegrity.hive.udf.ptyUnprotectDouble'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val double) row format delimited fields terminated by ',' stored as textfile; Confidential 72 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue double) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; insert overwrite table protected_data_table select ptyProtectDouble(val, 'DOUBLE_DE') from test_data_table; select ptyUnprotectDouble(protectedValue, 'DOUBLE_DE') from protected_data_table; 4.5.17 ptyReprotect() This UDF reprotects double format protected data with a different data element. ptyReprotect(Double input, String oldDataElement, String newDataElement) Parameters Double input: Double value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected double value. Example create temporary function ptyProtectDouble AS 'com.protegrity.hive.udf.ptyProtectDouble'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val double) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val double) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; insert overwrite table test_protected_data_table select ptyProtectDouble(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val double) row format delimited fields terminated by ',' stored as textfile; Confidential 73 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncryption’) from test_protected_data_table; 4.5.18 ptyProtectBigInt() This UDF protects BigInt value. ptyProtectBigInt(BigInt input, String dataElement) Parameters BigInt input: Value to protect String dataElement: Name of data element to protect value Result This UDF returns protected BigInteger value. Example create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; select ptyProtectBigInt(val, 'BIGINT_DE') from test_data_table; 4.5.19 ptyUnprotectBigInt() This UDF unprotects protected BigInt value. ptyUnprotectBigInt(BigInt input, String dataElement) Parameters BigInt input: Protected value to unprotect String dataElement: Name of data element to unprotect value Result This UDF returns unprotected BigInteger value. Example create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt'; create temporary function ptyUnprotectBigInt as 'com.protegrity.hive.udf.ptyUnprotectBigInt'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; Confidential 74 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue bigint) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; insert overwrite table protected_data_table select ptyProtectBigInt(val, 'BIGINT_DE') from test_data_table; select ptyUnprotectBigInt(protectedValue, 'BIGINT_DE') from protected_data_table; 4.5.20 ptyReprotect() This UDF reprotects BigInt format protected data with a different data element. ptyReprotect(Bigint input, String oldDataElement, String newDataElement) Parameters Bigint input: Bigint value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected bigint value. Example create temporary function ptyProtectBigInt AS 'com.protegrity.hive.udf.ptyProtectBigInt'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; insert overwrite table test_protected_data_table select ptyProtectBigInt(val, 'Token_BigInteger') from test_data_table; create table test_reprotected_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' 'BIGINT_DE', 'new_BIGINT_DE') from test_protected_data_table; Confidential 75 Big Data Protector Guide 6.6.5 Hadoop Application Protector 4.5.21 ptyProtectDec() This UDF protects decimal value. This API works only with the CDH 4.3 distribution. ptyProtectDec(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to protect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected decimal value. Example create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; select ptyProtectDec(val, 'BIGDECIMAL_DE') from test_data_table; 4.5.22 ptyUnprotectDec() This UDF unprotects protected decimal value. This API works only with the CDH 4.3 distribution. ptyUnprotectDec(Decimal input, String dataElement) Parameters Decimal input: Protected decimal value to unprotect String dataElement: Name of data element to unprotect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected decimal value. Example create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec'; Confidential 76 Big Data Protector Guide 6.6.5 Hadoop Application Protector create temporary function ptyUnprotectDec as 'com.protegrity.hive.udf.ptyUnprotectDec'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table protected_data_table select ptyProtectDec(val, 'BIGDECIMAL_DE') from test_data_table; select ptyUnprotectDec(protectedValue, 'BIGDECIMAL_DE') from protected_data_table; 4.5.23 ptyProtectHiveDecimal() This UDF protects decimal value. This API works only for distributions which include Hive, Version 0.11 and later. ptyProtectHiveDecimal(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to protect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Before the ptyProtectHiveDecimal() UDF is called, Hive rounds off the decimal value in the table to 18 digits in scale, irrespective of the length of the data. Result This UDF returns protected decimal value. Example reate temporary function ptyProtectHiveDecimal as 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; Confidential 77 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; select ptyProtectHiveDecimal(val, 'BIGDECIMAL_DE') from test_data_table; 4.5.24 ptyUnprotectHiveDecimal() This UDF unprotects Decimal value. This API works only for distributions which include Hive, Version 0.11 and later. ptyUnprotectHiveDecimal(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to unprotect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Before the ptyUnprotectHiveDecimal() UDF is called, Hive rounds off the decimal value in the table to 18 digits in scale, irrespective of the length of the data. Result This UDF returns unprotected decimal value. Example create temporary function ptyProtectHiveDecimal as 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; create temporary function ptyUnprotectHiveDecimal as 'com.protegrity.hive.udf.ptyUnprotectHiveDecimal'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table protected_data_table select ptyProtectHiveDecimal(val, 'BIGDECIMAL_DE') from test_data_table; select ptyUnprotectHiveDecimal(protectedValue, 'BIGDECIMAL_DE') from protected_data_table; 4.5.25 ptyReprotect() This UDF reprotects decimal format protected data with a different data element. Confidential 78 Big Data Protector Guide 6.6.5 Hadoop Application Protector This API works only for distributions which include Hive, Version 0.11 and later. ptyReprotect(Decimal input, String oldDataElement, String newDataElement) Parameters Decimal input: Decimal value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected decimal value. Example create temporary function ptyProtectHiveDecimal AS 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table test_protected_data_table select ptyProtectHiveDecimal(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncyption’) from test_protected_data_table; 4.6 Pig UDFs This section describes all Pig UDFs that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. 4.6.1 ptyGetVersion() This UDF returns the current version of PEP. ptyGetVersion() Parameters None Confidential 79 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result chararray: Version number Example REGISTER /opt/protegrity/Hadoop-protector/lib/peppig-0.10.0.jar; // register pep pig version DEFINE ptyGetVersion com.protegrity.pig.udf.ptyGetVersion; //define UDF employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray,name:chararray, ssn:chararray); // load employee.csv from HDFS path version = FOREACH employees GENERATE ptyGetVersion(); DUMP version; 4.6.2 ptyWhoAmI() This UDF returns the current logged in user name. ptyWhoAmI() Parameters None Result chararray: User name Example REGISTER /opt/protegrity/Hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyWhoAmI com.protegrity.pig.udf.ptyWhoAmI; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); username = FOREACH employees GENERATE ptyWhoAmI(); DUMP username; 4.6.3 ptyProtectInt() This UDF returns protected value for integer data. ptyProtectInt (int data, chararray dataElement) Parameters int data: Data to protect chararray dataElement: Name of data element to use for protection Result Protected value for given numeric data Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’); DUMP data_p; Confidential 80 Big Data Protector Guide 6.6.5 4.6.4 Hadoop Application Protector ptyUnprotectInt() This UDF returns unprotected value for protected integer data. ptyUnprotectInt (int data, chararray dataElement) Parameters int data: Protected data chararray dataElement: Name of data element to use for unprotection Result Unprotected value for given protected integer data Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt; DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectInt; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’); data_u = FOREACH data_p GENERATE ptyUnprotectInt(eid, ‘token_integer’); DUMP data_u; 4.6.5 ptyProtectStr() This UDF protects string value. ptyProtectStr(chararray input, chararray dataElement) Parameters chararray input: String value to protect chararray dataElement: Name of data element to unprotect string value Result chararray Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectStr com.protegrity.pig.udf.ptyProtectStr; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectIntStr(name, ‘token_alphanumeric’); DUMP data_p 4.6.6 ptyUnprotectStr() This UDF unprotects protected string value. ptyUnprotectStr (chararray input, chararray dataElement) Parameters chararray input: Unprotected string value chararray dataElement: Name of data element to unprotect string value Result chararray: Unprotected value Confidential 81 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectStr; DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectStr; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectStr(name, ‘token_alphanumeric’) as name:chararray DUMP data_p; data_u = FOREACH data_p GENERATE ptyUnprotectStr(ssn, ‘Token_alphanumeric’); DUMP data_u; Confidential 82 Big Data Protector Guide 6.6.5 5 5.1 HDFS File Protector (HDFSFP) HDFS File Protector (HDFSFP) Overview of HDFSFP The files stored in HDFS are plain text files and can be accessed with a POSIX-based file system access control. These files may contain sensitive data which is vulnerable with exposure to unwanted users. The HDFS File Protector (HDFSFP) helps to transparently protect these files as they are stored into HDFS and allow only authorized users to access the content in the files. 5.2 Features of HDFSFP The following are the features of HDFSFP: • • • • • • • • • 5.3 Protects and stores files in HDFS and retrieves the protected files in the clear from HDFS, as per centrally defined security policy and access control. Stores and retrieves from HDFS transparently for the user, depending upon their access control rights. Preserves Hadoop distributed data processing ensuring that protected content is processed on data nodes independently. Blocks the addressing of files by the defined access control pass-through, transparently without any protection or unprotection. Protects temporary data, such as intermediate files generated by the MapReduce job. Provides recursive access control for HDFS directories and files. Protects directories, its subdirectories and files, as per defined security policy and access control. Protects files at rest so that unauthorized users can view only the protected content. Adds minimum overhead for data processing in HDFS. Can be accessed using the command shell and Java API. Protector Usage Files stored in HDFS are plain text files. Access controls for HDFS are implemented by using filebased permissions that follow the UNIX permissions model. These files may contain sensitive data, making it vulnerable when exposed to unwanted users. These files should be transparently protected as they are stored into HDFS and the content should be exposed only to authorized users. The files are stored and retrieved from HDFS using Hadoop ecosystem products, such as file shell commands, MapReduce jobs, and so on. Any user or application with write access to protected data at rest in HDFS can delete, update or move the protected data. Hence though the protected data can be lost, the data is not compromised as the user or application do not access the original data in the clear. Ensure that the Hadoop administrator assigns file permissions in HDFS cautiously. 5.4 File Recover Utility The File Recover utility recovers the contents from a protected file. For more information about the File Recover Utility, refer to section 3.4.3 Recover Utility. Confidential 83 Big Data Protector Guide 6.6.5 5.5 HDFS File Protector (HDFSFP) HDFSFP Commands Hadoop provides shell commands for modifying and administering HDFS. HDFSFP extends the modification commands to control access to files and directories in HDFS. The section describes the commands supported in HDFSFP. 5.5.1 copyFromLocal This command ingests local data into HDFS. hadoop ptyfs -copyFromLocal Result • • • If the destination directory path is protected and the user executing the command has permissions to create and protect, then the data is ingested in encrypted form. If the destination directory path is protected and the user does not have permissions to create and protect, then the copy operation fails. If the destination HDFS directory path is not protected, then the data is ingested in clear form. 5.5.2 put This command ingests local data into HDFS. hadoop ptyfs -put Result • • • If the destination HDFS directory path is protected and the user executing the command has permissions to create and protect, then the data is ingested in encrypted form. If the destination HDFS directory path is protected and the user does not have permissions to create and protect, then the copy operation fails. If the destination HDFS directory path is not protected, then the data is ingested in clear form. 5.5.3 copyToLocal This command is used to copy an HDFS file to a local directory. hadoop ptyfs -copyToLocal Result • • • If the source HDFS file is protected and the user has unprotect permissions, then the file is copied to the destination directory in clear form. If the source HDFS file is not protected, then the file is copied to the destination directory. If the HDFS file is protected the user does not have unprotect permissions, then the copy operation fails. Confidential 84 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.5.4 get This command copies an HDFS file to a local directory. hadoop ptyfs -get Result • • • If the source HDFS file is protected and the user has unprotect permissions, then the file is copied to the destination directory in clear form. If the source HDFS file is not protected, then the file is copied to the destination directory. If the HDFS file is protected the user does not have unprotect permissions, then the copy operation fails. 5.5.5 cp This command copies a file from one HDFS directory to another HDFS directory. hadoop ptyfs -cp