Big Data Protector Guide 6.6.5
User Manual:
Open the PDF directly: View PDF .
Page Count: 259
Download | ![]() |
Open PDF In Browser | View PDF |
Protegrity Big Data Protector Guide Release 6.6.5 Big Data Protector Guide 6.6.5 Copyright Copyright © 2004-2017 Protegrity Corporation. All rights reserved. Protegrity products are protected by and subject to patent protections; Patent:http://www.protegrity.com/patents Protegrity logo is the trademark of Protegrity Corporation. NOTICE TO ALL PERSONS RECEIVING THIS DOCUMENT Some of the product names mentioned herein are used for identification purposes only and may be trademarks and/or registered trademarks of their respective owners. Windows, MS-SQL Server, Internet Explorer and Internet Explorer logo, Active Directory, and Hyper-V are registered trademarks of Microsoft Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. SCO and SCO UnixWare are registered trademarks of The SCO Group. Sun, Oracle, Java, and Solaris, and their logos are the trademarks or registered trademarks of Oracle Corporation and/or its affiliates in the United States and other countries. Teradata and the Teradata logo are the trademarks or registered trademarks of Teradata Corporation or its affiliates in the United States and other countries. Hadoop or Apache Hadoop, Hadoop elephant logo, HDFS, Hive, Pig, HBase, and Spark are trademarks of Apache Software Foundation. Cloudera, Impala, and the Cloudera logo are trademarks of Cloudera and its suppliers or licensors. Hortonworks and the Hortonworks logo are the trademarks of Hortonworks, Inc. in the United States and other countries. Greenplum is the registered trademark of EMC Corporation in the U.S. and other countries. Pivotal HD and HAWQ are the registered trademarks of Pivotal, Inc. in the U.S. and other countries. MapR logo is a registered trademark of MapR Technologies, Inc. PostgreSQL or Postgres is the copyright of The PostgreSQL Global Development Group and The Regents of the University of California. IBM and the IBM logo, z/OS, AIX, DB2, Netezza, and BigInsights are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Utimaco Safeware AG is a member of the Sophos Group. Jaspersoft, the Jaspersoft logo, and JasperServer products are trademarks and/or registered trademarks of Jaspersoft Corporation in the United States and in jurisdictions throughout the world. Confidential I Big Data Protector Guide 6.6.5 Xen, XenServer, and Xen Source are trademarks or registered trademarks of Citrix Systems, Inc. and/or one or more of its subsidiaries, and may be registered in the United States Patent and Trademark Office and in other countries. VMware, the VMware “boxes” logo and design, Virtual SMP and VMotion are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. HP is a registered trademark of the Hewlett-Packard Company. Dell is a registered trademark of Dell Inc. Novell is a registered trademark of Novell, Inc. in the United States and other countries. POSIX is a registered trademark of the Institute of Electrical and Electronics Engineers, Inc. Mozilla and Firefox are registered trademarks of Mozilla foundation. Chrome is a registered trademark of Google Inc. Confidential II Big Data Protector Guide 6.6.5 Contents Contents Copyright ............................................................................................................................. I 1 Introduction to this Guide ....................................................................................... 14 1.1. Sections contained in this Guide ....................................................................................14 1.2. Protegrity Documentation Suite ....................................................................................14 1.5 Glossary.....................................................................................................................15 2 Overview of the Big Data Protector ......................................................................... 16 2.1 Components of Hadoop ................................................................................................16 2.1.1 Hadoop Distributed File System (HDFS) .....................................................................17 2.1.2 MapReduce .............................................................................................................17 2.1.3 Hive ......................................................................................................................17 2.1.4 Pig ........................................................................................................................17 2.1.5 HBase ....................................................................................................................17 2.1.6 Impala ...................................................................................................................17 2.1.7 HAWQ ....................................................................................................................18 2.1.8 Spark ....................................................................................................................18 2.2 Features of Protegrity Big Data Protector........................................................................18 2.3 Using Protegrity Data Security Platform with Hadoop .......................................................20 2.4 Overview of Hadoop Application Protection .....................................................................21 2.4.1 Protection in MapReduce Jobs ...................................................................................21 2.4.2 Protection in Hive Queries ........................................................................................21 2.4.3 Protection in Pig Jobs ...............................................................................................22 2.4.4 Protection in HBase .................................................................................................22 2.4.5 Protection in Impala ................................................................................................22 2.4.6 Protection in HAWQ .................................................................................................22 2.4.7 Protection in Spark ..................................................................................................22 2.5 HDFS File Protection (HDFSFP)......................................................................................23 2.6 Ingesting Data Securely ...............................................................................................23 2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway (FPG) .................................23 2.6.2 Ingesting Files Using Hive Staging .............................................................................23 2.6.3 Ingesting Files into HDFS by HDFSFP .........................................................................23 2.7 Data Security Policy and Protection Methods ...................................................................23 3 Installing and Uninstalling Big Data Protector ........................................................ 25 3.1 Installing Big Data Protector on a Cluster .......................................................................25 3.1.1 Verifying Prerequisites for Installing Big Data Protector ................................................25 3.1.2 Extracting Files from the Installation Package .............................................................27 3.1.3 Updating the BDP.config File .....................................................................................28 3.1.4 Installing Big Data Protector .....................................................................................29 Confidential 3 Big Data Protector Guide 6.6.5 Contents 3.1.5 Applying Patches .....................................................................................................33 3.1.6 Installing the DFSFP Service .....................................................................................33 3.1.7 Configuring HDFSFP.................................................................................................34 3.1.8 Configuring HBase ...................................................................................................36 3.1.9 Configuring Impala ..................................................................................................37 3.1.10 Configuring HAWQ ...................................................................................................38 3.1.11 Configuring Spark ...................................................................................................38 3.2 Installing or Uninstalling Big Data Protector on Specific Nodes ..........................................39 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop Cluster ..........................39 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster ....................39 3.3 Utilities ......................................................................................................................40 3.3.1 PEP Server Control ..................................................................................................40 3.3.2 Update Cluster Policy ...............................................................................................40 3.3.3 Protegrity Cache Control ..........................................................................................41 3.3.4 Recover Utility ........................................................................................................41 3.4 Uninstalling Big Data Protector from a Cluster .................................................................42 3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector .......................................42 3.4.2 Removing the Cluster from the ESA ...........................................................................42 3.4.3 Uninstalling Big Data Protector from the Cluster ..........................................................42 4 Hadoop Application Protector .................................................................................. 47 4.1 Using the Hadoop Application Protector ..........................................................................47 4.2 Prerequisites...............................................................................................................47 4.3 Samples .....................................................................................................................47 4.4 MapReduce APIs .........................................................................................................47 4.4.1 openSession().........................................................................................................48 4.4.2 closeSession() ........................................................................................................48 4.4.3 getVersion() ...........................................................................................................48 4.4.4 getCurrentKeyId() ...................................................................................................49 4.4.5 checkAccess() .........................................................................................................49 4.4.6 getDefaultDataElement()..........................................................................................50 4.4.7 protect() ................................................................................................................50 4.4.8 protect() ................................................................................................................51 4.4.9 protect() ................................................................................................................51 4.4.10 unprotect() .............................................................................................................51 4.4.11 unprotect() .............................................................................................................52 4.4.12 unprotect() .............................................................................................................52 4.4.13 bulkProtect() ..........................................................................................................53 4.4.14 bulkProtect() ..........................................................................................................54 4.4.15 bulkProtect() ..........................................................................................................55 Confidential 4 Big Data Protector Guide 6.6.5 Contents 4.4.16 bulkUnprotect() ......................................................................................................56 4.4.17 bulkUnprotect() ......................................................................................................58 4.4.18 bulkUnprotect() ......................................................................................................59 4.4.19 reprotect() .............................................................................................................60 4.4.20 reprotect() .............................................................................................................61 4.4.21 reprotect() .............................................................................................................61 4.4.22 hmac() ..................................................................................................................62 4.5 Hive UDFs ..................................................................................................................62 4.5.1 ptyGetVersion() ......................................................................................................62 4.5.2 ptyWhoAmI()..........................................................................................................63 4.5.3 ptyProtectStr()........................................................................................................63 4.5.4 ptyUnprotectStr() ....................................................................................................64 4.5.5 ptyReprotect() ........................................................................................................64 4.5.6 ptyProtectUnicode() .................................................................................................65 4.5.7 ptyUnprotectUnicode() .............................................................................................66 4.5.8 ptyReprotectUnicode() .............................................................................................66 4.5.9 ptyProtectInt() ........................................................................................................67 4.5.10 ptyUnprotectInt() ....................................................................................................68 4.5.11 ptyReprotect() ........................................................................................................69 4.5.12 ptyProtectFloat() .....................................................................................................69 4.5.13 ptyUnprotectFloat() .................................................................................................70 4.5.14 ptyReprotect() ........................................................................................................71 4.5.15 ptyProtectDouble() ..................................................................................................71 4.5.16 ptyUnprotectDouble() ..............................................................................................72 4.5.17 ptyReprotect() ........................................................................................................73 4.5.18 ptyProtectBigInt() ...................................................................................................74 4.5.19 ptyUnprotectBigInt() ...............................................................................................74 4.5.20 ptyReprotect() ........................................................................................................75 4.5.21 ptyProtectDec() ......................................................................................................76 4.5.22 ptyUnprotectDec() ...................................................................................................76 4.5.23 ptyProtectHiveDecimal() ..........................................................................................77 4.5.24 ptyUnprotectHiveDecimal().......................................................................................78 4.5.25 ptyReprotect() ........................................................................................................78 4.6 Pig UDFs ....................................................................................................................79 4.6.1 ptyGetVersion() ......................................................................................................79 4.6.2 ptyWhoAmI()..........................................................................................................80 4.6.3 ptyProtectInt() ........................................................................................................80 4.6.4 ptyUnprotectInt() ....................................................................................................81 4.6.5 ptyProtectStr()........................................................................................................81 Confidential 5 Big Data Protector Guide 6.6.5 4.6.6 Contents ptyUnprotectStr() ....................................................................................................81 5 HDFS File Protector (HDFSFP) ................................................................................. 83 5.1 Overview of HDFSFP ....................................................................................................83 5.2 Features of HDFSFP .....................................................................................................83 5.3 Protector Usage ..........................................................................................................83 5.4 File Recover Utility ......................................................................................................83 5.5 HDFSFP Commands .....................................................................................................84 5.5.1 copyFromLocal ........................................................................................................84 5.5.2 put ........................................................................................................................84 5.5.3 copyToLocal............................................................................................................84 5.5.4 get ........................................................................................................................85 5.5.5 cp .........................................................................................................................85 5.5.6 mkdir.....................................................................................................................85 5.5.7 mv ........................................................................................................................86 5.5.8 rm .........................................................................................................................86 5.5.9 rmr .......................................................................................................................86 5.6 Ingesting Files Securely ...............................................................................................87 5.7 Extracting Files Securely ..............................................................................................87 5.8 HDFSFP Java API.........................................................................................................87 5.8.1 copy ......................................................................................................................87 5.8.2 copyFromLocal ........................................................................................................88 5.8.3 copyToLocal............................................................................................................89 5.8.4 deleteFile ...............................................................................................................89 5.8.5 deleteDir ................................................................................................................90 5.8.6 mkdir.....................................................................................................................90 5.8.7 move .....................................................................................................................91 5.9 Developing Applications using HDFSFP Java API ..............................................................92 5.9.1 Setting up the Development Environment ..................................................................92 5.9.2 Protecting Data using the Class file ............................................................................92 5.9.3 Protecting Data using the JAR file ..............................................................................92 5.9.4 Sample Program for the HDFSFP Java API ..................................................................92 5.10 Quick Reference Tasks .................................................................................................94 5.10.1 Protecting Existing Data ...........................................................................................94 5.10.2 Reprotecting Files ....................................................................................................95 5.11 Sample Demo Use Case ...............................................................................................95 5.12 Appliance components of HDFSFP..................................................................................95 5.12.1 Dfsdatastore Utility ..................................................................................................95 5.12.2 Dfsadmin Utility ......................................................................................................95 Confidential 6 Big Data Protector Guide 6.6.5 Contents 5.13 Access Control Rules for Files and Folders ......................................................................95 5.14 Using the DFS Cluster Management Utility (dfsdatastore) .................................................95 5.14.1 Adding a Cluster for Protection ..................................................................................96 5.14.2 Updating a Cluster...................................................................................................97 5.14.3 Removing a Cluster .................................................................................................98 5.14.4 Monitoring a Cluster ................................................................................................99 5.14.5 Searching a Cluster ............................................................................................... 100 5.14.6 Listing all Clusters ................................................................................................. 101 5.15 Using the ACL Management Utility (dfsadmin)............................................................... 101 5.15.1 Adding an ACL Entry for Protecting Directories in HDFS .............................................. 101 5.15.2 Updating an ACL Entry ........................................................................................... 103 5.15.3 Reprotecting Files or Folders ................................................................................... 104 5.15.4 Deleting an ACL Entry to Unprotect Files or Directories .............................................. 104 5.15.5 Activating Inactive ACL Entries ............................................................................... 105 5.15.6 Viewing the ACL Activation Job Progress Information in the Interactive Mode................ 106 5.15.7 Viewing the ACL Activation Job Progress Information in the Non Interactive Mode ......... 107 5.15.8 Searching ACL Entries ............................................................................................ 108 5.15.9 Listing all ACL Entries ............................................................................................ 108 5.16 HDFS Codec for Encryption and Decryption................................................................... 109 6 HBase .................................................................................................................... 110 6.1 Overview of the HBase Protector ................................................................................. 110 6.2 HBase Protector Usage............................................................................................... 110 6.3 Adding Data Elements and Column Qualifier Mappings to a New Table ............................. 110 6.4 Adding Data Elements and Column Qualifier Mappings to an Existing Table ...................... 111 6.5 Inserting Protected Data into a Protected Table ............................................................. 111 6.6 Retrieving Protected Data from a Table ........................................................................ 111 6.7 Protecting Existing Data ............................................................................................. 112 6.8 HBase Commands ..................................................................................................... 112 6.8.1 put ...................................................................................................................... 112 6.8.2 get ...................................................................................................................... 112 6.8.3 scan .................................................................................................................... 113 6.9 Ingesting Files Securely ............................................................................................. 113 6.10 Extracting Files Securely ............................................................................................ 113 6.11 Sample Use Cases ..................................................................................................... 113 7 Impala .................................................................................................................. 114 7.1 Overview of the Impala Protector ................................................................................ 114 7.2 Impala Protector Usage .............................................................................................. 114 7.3 Impala UDFs ............................................................................................................. 114 Confidential 7 Big Data Protector Guide 6.6.5 Contents 7.3.1 pty_GetVersion() .................................................................................................. 114 7.3.2 pty_WhoAmI() ...................................................................................................... 115 7.3.3 pty_GetCurrentKeyId() .......................................................................................... 115 7.3.4 pty_GetKeyId() ..................................................................................................... 115 7.3.5 pty_StringEnc() .................................................................................................... 115 7.3.6 pty_StringDec() .................................................................................................... 116 7.3.7 pty_StringIns() ..................................................................................................... 116 7.3.8 pty_StringSel() ..................................................................................................... 116 7.3.9 pty_UnicodeStringIns() .......................................................................................... 117 7.3.10 pty_UnicodeStringSel() .......................................................................................... 117 7.3.11 pty_IntegerEnc()................................................................................................... 118 7.3.12 pty_IntegerDec() .................................................................................................. 118 7.3.13 pty_IntegerIns() ................................................................................................... 118 7.3.14 pty_IntegerSel() ................................................................................................... 118 7.3.15 pty_FloatEnc() ...................................................................................................... 119 7.3.16 pty_FloatDec() ...................................................................................................... 119 7.3.17 pty_FloatIns()....................................................................................................... 119 7.3.18 pty_FloatSel() ....................................................................................................... 120 7.3.19 pty_DoubleEnc() ................................................................................................... 120 7.3.20 pty_DoubleDec() ................................................................................................... 121 7.3.21 pty_DoubleIns() .................................................................................................... 121 7.3.22 pty_DoubleSel() .................................................................................................... 121 7.4 Inserting Data from a File into a Table ......................................................................... 122 7.5 Protecting Existing Data ............................................................................................. 123 7.6 Unprotecting Protected Data ....................................................................................... 123 7.7 Retrieving Data from a Table ...................................................................................... 123 7.8 Sample Use Cases ..................................................................................................... 124 8 HAWQ.................................................................................................................... 125 8.1 Overview of the HAWQ Protector ................................................................................. 125 8.2 HAWQ Protector Usage .............................................................................................. 125 8.3 HAWQ UDFs ............................................................................................................. 125 8.3.1 pty_GetVersion() .................................................................................................. 125 8.3.2 pty_WhoAmI() ...................................................................................................... 126 8.3.3 pty_GetCurrentKeyId() .......................................................................................... 126 8.3.4 pty_GetKeyId() ..................................................................................................... 126 8.3.5 pty_VarcharEnc() .................................................................................................. 126 8.3.6 pty_VarcharDec() .................................................................................................. 127 8.3.7 pty_VarcharHash() ................................................................................................ 127 8.3.8 pty_VarcharIns()................................................................................................... 127 Confidential 8 Big Data Protector Guide 6.6.5 Contents 8.3.9 pty_VarcharSel() ................................................................................................... 128 8.3.10 pty_UnicodeVarcharIns() ....................................................................................... 128 8.3.11 pty_UnicodeVarcharSel()........................................................................................ 128 8.3.12 pty_IntegerEnc()................................................................................................... 129 8.3.13 pty_IntegerDec() .................................................................................................. 129 8.3.14 pty_IntegerHash()................................................................................................. 129 8.3.15 pty_IntegerIns() ................................................................................................... 130 8.3.16 pty_IntegerSel() ................................................................................................... 130 8.3.17 pty_DateEnc() ...................................................................................................... 130 8.3.18 pty_DateDec() ...................................................................................................... 130 8.3.19 pty_DateHash() .................................................................................................... 131 8.3.20 pty_DateIns() ....................................................................................................... 131 8.3.21 pty_DateSel() ....................................................................................................... 131 8.3.22 pty_RealEnc() ....................................................................................................... 132 8.3.23 pty_RealDec()....................................................................................................... 132 8.3.24 pty_RealHash() ..................................................................................................... 132 8.3.25 pty_RealIns() ....................................................................................................... 132 8.3.26 pty_RealSel() ....................................................................................................... 133 8.4 Inserting Data from a File into a Table ......................................................................... 133 8.5 Protecting Existing Data ............................................................................................. 134 8.6 Unprotecting Protected Data ....................................................................................... 134 8.7 Retrieving Data from a Table ...................................................................................... 135 8.8 Sample Use Cases ..................................................................................................... 135 9 Spark..................................................................................................................... 136 9.1 Overview of the Spark Protector .................................................................................. 136 9.2 Spark Protector Usage ............................................................................................... 136 9.3 Spark APIs ............................................................................................................... 136 9.3.1 getVersion() ......................................................................................................... 136 9.3.2 getCurrentKeyId() ................................................................................................. 137 9.3.3 checkAccess() ....................................................................................................... 137 9.3.4 getDefaultDataElement()........................................................................................ 138 9.3.5 hmac() ................................................................................................................ 138 9.3.6 protect() .............................................................................................................. 138 9.3.7 protect() .............................................................................................................. 139 9.3.8 protect() .............................................................................................................. 140 9.3.9 protect() .............................................................................................................. 140 9.3.10 protect() .............................................................................................................. 141 9.3.11 protect() .............................................................................................................. 141 9.3.12 protect() .............................................................................................................. 142 Confidential 9 Big Data Protector Guide 6.6.5 Contents 9.3.13 protect() .............................................................................................................. 142 9.3.14 protect() .............................................................................................................. 143 9.3.15 protect() .............................................................................................................. 143 9.3.16 protect() .............................................................................................................. 144 9.3.17 protect() .............................................................................................................. 145 9.3.18 protect() .............................................................................................................. 145 9.3.19 unprotect() ........................................................................................................... 146 9.3.20 unprotect() ........................................................................................................... 146 9.3.21 unprotect() ........................................................................................................... 147 9.3.22 unprotect() ........................................................................................................... 148 9.3.23 unprotect() ........................................................................................................... 148 9.3.24 unprotect() ........................................................................................................... 149 9.3.25 unprotect() ........................................................................................................... 149 9.3.26 unprotect() ........................................................................................................... 150 9.3.27 unprotect() ........................................................................................................... 151 9.3.28 unprotect() ........................................................................................................... 151 9.3.29 unprotect() ........................................................................................................... 152 9.3.30 unprotect() ........................................................................................................... 152 9.3.31 unprotect() ........................................................................................................... 153 9.3.32 reprotect() ........................................................................................................... 154 9.3.33 reprotect() ........................................................................................................... 154 9.3.34 reprotect() ........................................................................................................... 155 9.3.35 reprotect() ........................................................................................................... 155 9.3.36 reprotect() ........................................................................................................... 156 9.3.37 reprotect() ........................................................................................................... 157 9.3.38 reprotect() ........................................................................................................... 157 9.4 Displaying the Cleartext Data from a File ..................................................................... 158 9.5 Protecting Existing Data ............................................................................................. 158 9.6 Unprotecting Protected Data ....................................................................................... 158 9.7 Retrieving the Unprotected Data from a File ................................................................. 159 9.8 Spark APIs and Supported Protection Methods .............................................................. 159 9.9 Sample Use Cases ..................................................................................................... 160 9.10 Spark SQL ................................................................................................................ 160 9.10.1 DataFrames .......................................................................................................... 161 9.10.2 SQLContext .......................................................................................................... 161 9.10.3 Accessing the Hive Protector UDFs .......................................................................... 161 9.10.4 Sample Use Cases ................................................................................................. 162 9.11 9.11.1 Spark Scala .............................................................................................................. 162 Sample Use Cases ................................................................................................. 162 Confidential 10 Big Data Protector Guide 6.6.5 Contents 10 Data Node and Name Node Security with File Protector ........................................ 163 10.1 Features of the Protegrity File Protector ....................................................................... 163 10.1.1 Protegrity File Encryption ....................................................................................... 163 10.1.2 Protegrity Volume Encryption .................................................................................. 163 10.1.3 Protegrity Access Control ....................................................................................... 163 11 Appendix: Return Codes ........................................................................................ 164 12 Appendix: Samples ................................................................................................ 169 12.1 Roles in the Samples ................................................................................................. 170 12.2 Data Elements in the Security Policy ............................................................................ 170 12.3 Role-based Permissions for Data Elements in the Sample ............................................... 171 12.4 Data Used by the Samples ......................................................................................... 171 12.5 Protecting Data using MapReduce................................................................................ 171 12.5.1 Basic Use Case ..................................................................................................... 172 12.5.2 Role-based Use Cases ............................................................................................ 173 12.5.3 Sample Code Usage ............................................................................................... 176 12.6 Protecting Data using Hive ......................................................................................... 179 12.6.1 Basic Use Case ..................................................................................................... 179 12.6.2 Role-based Use Cases ............................................................................................ 181 12.7 Protecting Data using Pig ........................................................................................... 183 12.7.1 Basic Use Case ..................................................................................................... 184 12.7.2 Role-based Use Cases ............................................................................................ 185 12.8 Protecting Data using HBase ....................................................................................... 189 12.8.1 Basic Use Case ..................................................................................................... 189 12.8.2 Role-based Use Cases ............................................................................................ 190 12.9 Protecting Data using Impala ...................................................................................... 195 12.9.1 Basic Use Case ..................................................................................................... 195 12.9.2 Role-based Use Cases ............................................................................................ 197 12.10 Protecting Data using HAWQ .................................................................................... 201 12.10.1 Basic Use Case ..................................................................................................... 201 12.10.2 Role-based Use Cases ............................................................................................ 203 12.11 Protecting Data using Spark ..................................................................................... 207 12.11.1 Basic Use Case ..................................................................................................... 208 12.11.2 Role-based Use Cases ............................................................................................ 209 12.11.3 Sample Code Usage for Spark (Java) ....................................................................... 212 12.11.4 Sample Code Usage for Spark (Scala) ...................................................................... 217 13 Appendix: HDFSFP Demo ....................................................................................... 221 13.1 Roles in the Demo ..................................................................................................... 221 13.2 HDFS Directories used in Demo................................................................................... 221 Confidential 11 Big Data Protector Guide 6.6.5 Contents 13.3 User Permissions for HDFS Directories ......................................................................... 221 13.4 Prerequisites for the Demo ......................................................................................... 222 13.5 Running the Demo .................................................................................................... 224 13.5.1 Protecting Existing Data in HDFS ............................................................................. 224 13.5.2 Ingesting Data into a Protected Directory ................................................................. 225 13.5.3 Ingesting Data into an Unprotected Public Directory .................................................. 225 13.5.4 Reading the Data by Authorized Users ..................................................................... 225 13.5.5 Reading the Data by Unauthorized Users .................................................................. 226 13.5.6 Copying Data from One Directory to Another by Authorized Users ............................... 226 13.5.7 Copying Data from One Directory to Another by Unauthorized Users ........................... 227 13.5.8 Deleting Data by Authorized Users .......................................................................... 227 13.5.9 Deleting Data by Unauthorized Users ....................................................................... 228 13.5.10 Copying Data to a Public Directory by Authorized Users ............................................. 228 13.5.11 Running MapReduce Job by Authorized Users ........................................................... 228 13.5.12 Reading Data for Analysis by Authorized Users.......................................................... 229 14 Appendix: Using Hive with HDFSFP ....................................................................... 230 14.1 Data Used by the Samples ......................................................................................... 230 14.2 Ingesting Data to Hive Table ...................................................................................... 230 14.2.1 Table Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive 230 14.2.2 Table Ingesting Protected Data from HDFSFP Protected Hive Table to another HDFSFP Protected Hive 231 14.3 14.3.1 Tokenization and Detokenization with HDFSFP .............................................................. 232 Verifying Prerequisites for Using Hadoop Application Protector .................................... 232 14.3.2 Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive Table in Tokenized Form ...................................................................................................... 232 14.3.3 Ingesting Detokenized Data from HDFSFP Protected Internal Hive Table to HDFSFP Protected External Hive Table ............................................................................................................. 233 14.3.4 Ingesting Data from HDFSFP Protected External Hive Table to Internal Hive Table not protected by HDFSFP in Tokenized Form............................................................................................... 233 14.3.5 Ingesting Detokenized Data from Internal Hive Table not protected by HDFSFP to HDFSFP Protected External Hive Table ............................................................................................... 234 15 Appendix: Configuring Talend with HDFSFP .......................................................... 235 15.1 Verifying Prerequisites before Configuring Talend with HDFSFP ....................................... 235 15.2 Verifying the Talend Packages .................................................................................... 235 15.3 Configuring Talend with HDFSFP ................................................................................. 235 15.4 Starting a Project in Talend ........................................................................................ 236 15.5 Configuring the Preferences for Talend ......................................................................... 237 15.6 Ingesting Data in the Target HDFS Directory in Protected Form....................................... 238 15.7 Accessing the Data from the Protected Directory in HDFS ............................................... 243 Confidential 12 Big Data Protector Guide 6.6.5 Contents 15.8 Configuring Talend Jobs to run with HDFSFP with Target Exec as Remote ......................... 247 15.9 Using Talend with HDFSFP and MapReduce ................................................................... 249 15.9.1 Protecting Data Using Talend with HDFSFP and MapReduce ........................................ 249 15.9.2 Unprotecting Data Using Talend with HDFSFP and MapReduce .................................... 252 15.9.3 Sample Code Usage ............................................................................................... 254 16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database ... 257 16.1 Migrating Tokenized Unicode Data from a Teradata Database ......................................... 257 16.2 Migrating Tokenized Unicode Data to a Teradata Database ............................................. 258 Confidential 13 Big Data Protector Guide 6.6.5 1 Introduction to this Guide Introduction to this Guide This guide provides information about installing, configuring, and using the Protegrity Big Data Protector (BDP) for Hadoop. 1.1. Sections contained in this Guide The guide is broadly divided into the following sections: • • • • • • • • • • • • • • • • Section 1 Introduction to this Guide defines the purpose and scope for this guide. In addition, it explains how information is organized in this guide. Section 2 Overview of the Big Data Protector provides a general idea of Hadoop and how it has been integrated with the Big Data Protector. In addition, it describes the protection coverage of various Hadoop ecosystem applications, such as MapReduce, Hive and Pig, and information about HDFS File Protection (HDFSFP). Section 3 Installing and Uninstalling Big Data Protector includes information common to all distributions, such as prerequisites for installation, installation procedure and uninstallation of the product from the cluster. In addition, it provides information about the tools and utilities. Section 4 Hadoop Application Protector provides information about Hadoop Application Protector. In addition, it covers information about MapReduce APIs and Hive and Pig UDFs. Section 5 HDFS File Protector (HDFSFP) provides information about the protection of files stored in HDFSFP and the commands supported. Section 6 HBase provides information about the Protegrity HBase protector. Section 7 Impala provides information about the Protegrity Impala protector. Section 8 HAWQ provides information about the Protegrity HAWQ protector. Section 9 Spark provides information about the Protegrity Spark protector. In addition, it provides information about Spark SQL and Spark Scala. Section 10 Data Node and Name Node Security with File Protector provides information about the protection of the Data and Name nodes using the File Protector. Section 11 Appendix: Return Codes provides information about all possible error codes and error descriptions for Big Data Protector. Section 12 Appendix: Samples provides information about sample data protection for MapReduce, Hive, Pig, HBase, Impala, HAWQ, and Spark using Big Data Protector. Section 13 Appendix: HDFSFP Demo provides information about sample data protection for HDFSFP using Big Data Protector. Section 14 Appendix: Using Hive with HDFSFP provides information about using Hive with HDFSFP. Section 15 Appendix: Configuring Talend with HDFSFP provides the procedures for configuring Talend with HDFSFP. Section 16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database describes procedures for migrating tokenized Unicode data from and to a Teradata database. 1.2. Protegrity Documentation Suite The Protegrity Documentation Suite comprises of the following documents: • • • • • Protegrity Protegrity Protegrity Protegrity Protegrity Documentation Master Index Release 6.6.5 Appliances Overview Release 6.6.5 Enterprise Security Administrator Guide Release 6.6.5 File Protector Gateway Server User Guide Release 6.6.4 Protection Server Guide Release 6.6.5 Confidential 14 Big Data Protector Guide 6.6.5 • • • • • • • • • • • • • • • 1.5 Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Protegrity Introduction to this Guide Data Security Platform Feature Guide Release 6.6.5 Data Security Platform Licensing Guide Release 6.6 Data Security Platform Upgrade Guide Release 6.6.5 Reports Guide Release 6.6.5 Troubleshooting Guide Release 6.6.5 Application Protector Guide Release 6.5 SP2 Big Data Protector Guide Release 6.6.5 Database Protector Guide Release 6.6.5 File Protector Guide Release 6.6.4 Protection Enforcements Point Servers Installation Guide Release 6.6.5 Protection Methods Reference Release 6.6.5 Row Level Protector Guide Release 6.6.5 Enterprise Security Administrator Quick Start Guide Release 6.6 File Protector Gateway Server Quick Start Guide Release 6.6.2 Protection Server Quick Start Guide Release 6.6 Glossary This section includes Protegrity specific terms, products, and abbreviations used in this document. Name Description BDP The Big Data Protector (BDP) is the API for protecting data on platforms such as Hive, Impala and HBase. ESA Enterprise Security Administrator (ESA) DPS roles The DPS roles relate to the security policy in the ESA and control the access permissions to the Access Keys. For instance, if a user does not have the required DPS role, then the user would not have access to Access Keys. DPS Protegrity Data Protection System (DPS) is the entire system where security policies are defined and enforced, including ESA and Protectors. Confidential 15 Big Data Protector Guide 6.6.5 2 Overview of the Big Data Protector Overview of the Big Data Protector The Protegrity Big Data Protector for Apache Hadoop is based on the Protegrity Application Protector. The data is split and shared with all the data nodes in the Hadoop cluster. The Big Data Protector is deployed on each of these nodes and the PEP Server, where the protection enforcement policies are shared. The Protegrity Big Data Protector is scalable and new nodes can be added as required. It is cost effective since massively parallel computing is done on commodity servers, and it is flexible as it can work with data from any number of sources. The Big Data Protector is fault tolerant as the system redirects the work to another node if a node is lost. It can handle all types of data, such as structured and unstructured data, irrespective of their native formats. The Big Data Protector protects data, which is handled by various Hadoop applications and protects files stored in the cluster. MapReduce, Hive, Pig, HBase, and Impala can use Protegrity protection interfaces to protect data as it is stored or retrieved from the Hadoop cluster. All standard protection techniques offered by Protegrity are applicable to Big Data Protector. For more information about the available protection options, such as data types, Tokenization or Encryption types, or length preserving and non-preserving tokens, refer to Protection Methods Reference Guide 6.6.5. 2.1 Components of Hadoop The Big Data Protector works on the Hadoop framework as shown in the following figure. BI Applications Data Access Framework HBase Hive Pig Data Storage Framework (HDFS) Other Data Processing Framework (MapReduce) Figure 2-1 Hadoop Components The illustration of Hadoop components is an example. Based on requirements, the components of Hadoop might be different. Hadoop interfaces have been used extensively to develop the Big Data Protector. It is a common deployment practice to utilize Hadoop Distributed File System (HDFS) to store the data, and let MapReduce process the data and store the result back in HDFS. Confidential 16 Big Data Protector Guide 6.6.5 2.1.1 Overview of the Big Data Protector Hadoop Distributed File System (HDFS) Hadoop Distributed File System (HDFS) spans across all nodes in a Hadoop cluster for data storage. It links together the file systems on many nodes to make them into one big file system. HDFS assumes that nodes will fail, so data is replicated across multiple nodes to achieve reliability. 2.1.2 MapReduce The MapReduce framework assigns work to every node in large clusters of commodity machines. MapReduce programs are sets of instructions to parse the data, create a map or index, and aggregate the results. Since data is distributed across multiple nodes, MapReduce programs run in parallel, working on smaller sets of data. A MapReduce job is executed by splitting each job into small Map tasks, and these tasks are executed on the node where a portion of the data is stored. If a node containing the required data is saturated and not able to execute a task, then MapReduce shifts the task to the least busy node by replicating the data to that node. A Reduce task combines results from multiple Map tasks, and store all of them back to the HDFS. 2.1.3 Hive The Hive framework resides above Hadoop to enable ad hoc queries on the data in Hadoop. Hive supports HiveQL, which is similar to SQL. Hive translates a HiveQL query into a MapReduce program and then sends it to the Hadoop cluster. 2.1.4 Pig Pig is a high-level platform for creating MapReduce programs used with Hadoop. 2.1.5 HBase HBase is a column-oriented datastore, meaning it stores data by columns rather than by rows. This makes certain data access patterns much less expensive than with traditional row-oriented relational database systems. The data in HBase is protected transparently using Protegrity HBase coprocessors. 2.1.6 Impala Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. The Impala daemon runs on each node in the cluster, reading and writing to data in the files, and accepts queries from the Impala shell command. The following are the core components of Impala: • • • Impala daemon (impalad) – This component is the Impala daemon which runs on each node in the cluster. It reads and writes the data in the files and accepts queries from the Impala shell command. Impala Statestore (statestored) – This component checks the health of the Impala daemons on all the nodes contained in the cluster. If a node is unavailable due to any error or failure, then the Impala statestore component informs all other nodes about the failed node to ensure that new queries are not sent to the failed node. Impala Catalog (catalogd) – This component is responsible for communicating any changes in the metadata received from the Impala SQL statements to all the nodes in the cluster. Confidential 17 Big Data Protector Guide 6.6.5 2.1.7 Overview of the Big Data Protector HAWQ HAWQ is an MPP database, which uses several Postgres database instances and HDFS storage. The database is distributed across HAWQ segments, which enable it to achieve data and processing parallelism. Since HAWQ uses the Postgres engine for processing queries, the query language is similar to PostgresSQL. Users connect to the HAWQ Master and interact using SQL statements, similar to the Postgres database. The following are the core components of HAWQ: HAWQ Master Server: Enables users to interact with HAWQ using client programs, such as PSQL or APIs, such as JDBC or ODBC Name Node: Enables client applications to locate a file HAWQ Segments: Are the units which process the individual data modules simultaneously HAWQ Storage: Is HDFS, which stores all the table data Interconnect Switch: Is the networking layer of HAWQ, which handles the communication between the segments • • • • • 2.1.8 Spark Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. Spark leverages the physical memory of the Hadoop system and utilizes Resilient Distributed Datasets (RDDs) to store the data in-memory and lowers latency, if the data fits in the memory size. The data is saved on the hard drive only if required. 2.2 Features of Protegrity Big Data Protector The Protegrity Big Data Protector (Big Data Protector) uses patent-pending vaultless tokenization and central policy control for access management and secures sensitive data at rest in the following areas: • Data in HDFS • Data used during MapReduce, Hive and Pig processing, and with HBase, Impala, HAWQ, and Spark • Data traversing enterprise data systems The data is protected from internal and external threats, and users and business processes can continue to utilize the secured data. Data protection may be by encryption or tokenization. In tokenization, data is converted to similar looking inert data known as tokens where the data format and type can be preserved. These tokens can be detokenized back to the original values when it is required. Protegrity secures files with volume encryption and also protects data inside files using tokenization and strong encryption protection methods. Depending on the user access rights and the policies set using Policy management in ESA, this data is unprotected. The Protegrity Hadoop Big Data Protector provides the following features: • Provides fine grained field-level protection within the MapReduce, Hive, Pig, HBase, and Spark frameworks. Confidential 18 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector • Provides directory and file level protection (encryption). • Retains distributed processing capability as field-level protection is applied to the data. • Protects data in the Hadoop cluster using role-based administration with a centralized security policy. • Provides logging and viewing data access activities and real-time alerts with a centralized monitoring system. • Ensures minimal overhead for processing secured data, with minimal consumption of resources, threads and processes, and network bandwidth. • Performs file and volume encryption including the protection of files on the local file system of Hadoop nodes. • Provides transparent data protection and row level filtering based on the user profile with Protegrity HBase protectors. • Transparently protects files processed by MapReduce and Hive in HDFS using HDFSFP. The following figure illustrates the various components in an Enterprise Hadoop ecosystem. Figure 2-2 Enterprise Hadoop Components Currently, Protegrity supports MapReduce, Hive, Pig, and HBase which utilize HDFS as the data storage layer. The following points can be referred to as general guidelines: • Sqoop: Sqoop can be used for ingestion into HDFSFP protected zone (For Hortonworks, Cloudera and Pivotal HD). • Beeline, Beeswax, and Hue on Cloudera: Beeline, Beeswax, and Hue are certified with Hive protector and Hive with HDFSFP integrations. • Beeline, Beeswax, and Hue on Hortonworks & Pivotal HD: Beeline, Beeswax, and Hue are certified with Hive protector and Hive with HDFSFP integrations. • Ranger (Hortonworks): Ranger is certified to work with the Hive protector and Hive with HDFSFP integrations only. • Sentry (Cloudera): Sentry is certified with Hive protector, Hive with HDFSFP integrations, and Impala protector only. • MapReduce and HDFSFP integration is certified with TEXTFILE format only. Confidential 19 Big Data Protector Guide 6.6.5 • Overview of the Big Data Protector Hive and HDFSFP integration is certified with TEXTFILE, RCFile, and SEQUENCEFILE formats only. • Pig and HDFSFP integration is certified with TEXTFILE format only. We neither support nor have certified other components in the Hadoop stack. We strongly recommend consulting Protegrity, before using any unsupported components from the Hadoop ecosystem with our products. 2.3 Using Protegrity Data Security Platform with Hadoop To protect data, the components of the Protegrity Data Security Platform are integrated into the Hadoop cluster as shown in the following figure. Figure 2-3 Protegrity Data Security Platform with Hadoop The Enterprise Security Administrator (ESA) is a soft appliance that needs to be pre-installed on a separate server, which is used to create and manage policies. The following figure illustrates the inbound and outbound ports that need to be allowed on the network for communication between the ESA and the Big Data Protector nodes in a Hadoop cluster. Figure 2-4 Inbound and Outbound Ports between the ESA and Big Data Protector Nodes Confidential 20 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector For more information about installing the ESA, and creating and managing policies, refer Protegrity Enterprise Security Administrator Guide Release 6.6.5. To achieve a parallel nature for the system, a PEP Server is installed on every data node. It is synchronized with the connection properties of ESA. Each task runs on a node under the same Hadoop user. Every user has a policy deployed for running their jobs on this system. Hadoop manages the accounts and users. You can get the Hadoop user information from the actual job configuration. HDFS implements a permission model for files and directories, based on the Portable Operating System Interface (POSIX) for Unix model. Each file and directory is associated with an owner and a group. Depending on the permissions granted, users for the file and directory can be classified into one of these three groups: • • • 2.4 Owner Other users of the group All other users Overview of Hadoop Application Protection This section describes the various levels of protection provided by Hadoop Application Protection. 2.4.1 Protection in MapReduce Jobs A MapReduce job in the Hadoop cluster involves sensitive data. You can use Protegrity interfaces to protect data when it is saved or retrieved from a protected source. The output data written by the job can be encrypted or tokenized. The protected data can be subsequently used by other jobs in the cluster in a secured manner. Field level data can be secured and ingested into HDFS by independent Hadoop jobs or other ETL tools. For more information about secure ingestion of data in Hadoop, refer to section 2.6.2 Ingesting Files Using Hive Staging. For more information on the list of available APIs, refer to section 4.4 MapReduce APIs. If Hive queries are created to operate on sensitive data, then you can use Protegrity Hive UDFs for securing data. While inserting data to Hive tables, or retrieving data from protected Hive table columns, you can call Protegrity UDFs loaded into Hive during installation. The UDFs protect data based on the input parameters provided. Secure ingestion of data into HDFS to operate Hive queries can be achieved by independent Hadoop jobs or other ETL tools. For more information about securely ingesting data in Hadoop, refer to section 2.6 Ingesting Data Securely. 2.4.2 Protection in Hive Queries Protection in Hive queries is done by Protegrity Hive UDFs, which translates a HiveQL query into a MapReduce program and then sends it to the Hadoop cluster. For more information on the list of available UDFs, refer to section 4.5 Hive UDFs. Confidential 21 Big Data Protector Guide 6.6.5 2.4.3 Overview of the Big Data Protector Protection in Pig Jobs Protection in Pig jobs is done by Protegrity Pig UDFs, which are similar in function to the Protegrity UDFs in Hive. For more information on the list of available UDFs, refer to section 4.6 Pig UDFs. 2.4.4 Protection in HBase HBase is a database which provides random read and write access to tables, consisting of rows and columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions contain a datastore in memory and a persistent datastore(HFile). The Name node assigns multiple regions to a region server. The Name node manages the cluster and the region servers store portions of the HBase tables and perform the work on the data. The Protegrity HBase protector extends the functionality of the data storage framework and provides transparent data protection and unprotection using coprocessors, which provide the functionality to run code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers and protects the data stored in the servers. All clients which work with HBase are supported. The data is transparently protected or unprotected, as required, utilizing the coprocessor framework. For more information about HBase, refer to section 6 HBase. 2.4.5 Protection in Impala Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. The Protegrity Impala protector extends the functionality of the Impala query engine and provides UDFs which protect or unprotect the data as it is stored or retrieved. For more information about the Impala protector, refer to section 7 Impala. 2.4.6 Protection in HAWQ HAWQ is an MPP database, which enable it to achieve data and processing parallelism. The Protegrity HAWQ protector provides UDFs for protecting data using encryption or tokenization, and unprotecting data by using decryption or detokenization. For more information about the HAWQ protector, refer to section 8 HAWQ. 2.4.7 Protection in Spark Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. The Protegrity Spark protector extends the functionality of the Spark engine and provides APIs that protect or unprotect the data as it is stored or retrieved. For more information about the Spark protector, refer to section 9 Spark. Confidential 22 Big Data Protector Guide 6.6.5 2.5 Overview of the Big Data Protector HDFS File Protection (HDFSFP) Files are stored and retrieved by Hadoop system elements, such as file shell commands, MapReduce, Hive, Pig, HBase and so on. The stored files reside in HDFS and span multiple cluster nodes. Most of the files in HDFS are plain text files and stored in the clear, with access control like a POSIX file system. These files contain sensitive data, making it vulnerable with exposure to unwanted users. These files are transparently protected as they are stored in HDFS. In addition, the content is exposed only to authorized users. The content in the files is unprotected transparently to processes or users, authorized to view and process the files. The user is automatically detected from the job information provided by HDFSFP. The job accessing secured files must be initialized by an authorized user having the required privileges in ACL. The files encrypted by HDFSFP are suitable for distributed processing by Hadoop distributed jobs like MapReduce. HDFSFP protects individual files or files stored in a directory. The access control is governed by the security policy and ACL supplied by the security officer. The access control and security policy is controlled through ESA interfaces. Command line and UI options are available to control ACL entries for file paths and directories. 2.6 Ingesting Data Securely This section describes the ways in which data can be secured and ingested by various jobs in Hadoop at a field or file level. 2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway (FPG) Protegrity provides the File Protector Gateway (FPG) for secure field level protection to ingest data through ETL tools. The ingested files data can be used by Hadoop applications for analytics and processing. The sensitive fields are secured by the FPG before Hadoop jobs operate on it. For more information about field level ingestion by custom MapReduce job for data at rest in HDFS, refer to File Protector Gateway Server Guide 6.6.4. 2.6.2 Ingesting Files Using Hive Staging Semi-structured data files can be loaded into a Hive staging table for ingestion into a Hive table with Hive queries and Protegrity UDFs. After loading data in the table, the data will be stored in protected form. 2.6.3 Ingesting Files into HDFS by HDFSFP The HDFSFP component of Big Data Protector can be used for ingesting files securely in HDFS. It provides granular access control for the files in HDFS. You can ingest files using the command shell and Java API in HDFSFP. For more information about using HDFSFP, refer to section 5 HDFS File Protector (HDFSFP). 2.7 Data Security Policy and Protection Methods A data security policy establishes processes to ensure the security and confidentiality of sensitive information. In addition, the data security policy establishes administrative and technical safeguards against unauthorized access or use of the sensitive information. Depending on the requirements, the data security policy typically performs the following functions: • Classifies the data that is sensitive for the organization. Confidential 23 Big Data Protector Guide 6.6.5 Overview of the Big Data Protector Defines the methods to protect sensitive data, such as encryption and tokenization. Defines the methods to present the sensitive data, such as masking the display of sensitive information. • Defines the access privileges of the users that would be able to access the data. • Defines the time frame for privileged users to access the sensitive data. • Enforces the security policies at the location where sensitive data is stored. • Provides a means of auditing authorized and unauthorized accesses to the sensitive data. In addition, it can also provide a means of auditing operations to protect and unprotect the sensitive data. The data security policy contains a number of components, such as, data elements, datastores, member sources, masks, and roles. The following list describes the functions of each of these entities: • • Data elements define the data protection properties for protecting sensitive data, consisting of the data securing method, data element type and its description. In addition, Data elements describe the tokenization or encryption properties, which can be associated with roles. • Datastores consist of enterprise systems, which might contain the data that needs to be processed, where the policy is deployed and the data protection function is utilized. • Member sources are the external sources from which users (or members) and groups of users are accessed. Examples are a file, database, LDAP, and Active Directory. • Masks are a pattern of symbols and characters, that when imposed on a data field, obscures its actual value to the user. Masks effectively aid in hiding sensitive data. • Roles define the levels of member access that are appropriate for various types of information. Combined with a data element, roles determine and define the unique data access privileges for each member. For more information about the data security policies, protection methods, and the data elements supported by the components of the Big Data Protector, refer to Protection Methods Reference Guide 6.6.5. • Confidential 24 Big Data Protector Guide 6.6.5 3 Installing and Uninstalling Big Data Protector Installing and Uninstalling Big Data Protector This section describes the procedure to install and uninstall the Big Data Protector. 3.1 Installing Big Data Protector on a Cluster This section describes the tasks for installing Big Data Protector on a cluster. Starting from the Big Data Protector 6.6.4 release, you do not require root access to install Big Data Protector on a cluster. You need a sudoer user account to install Big Data Protector on a cluster. 3.1.1 Verifying Prerequisites for Installing Big Data Protector Ensure that the following prerequisites are met, before installing Big Data Protector: • • • • • The Hadoop cluster is installed, configured, and running. ESA appliance version 6.6.5 is installed, configured, and running. A sudoer user account with privileges to perform the following tasks: o Update the system by modifying the configuration, permissions, or ownership of directories and files. o Perform third party configuration. o Create directories and files. o Modify the permissions and ownership for the created directories and files. o Set the required permissions to the create directories and files for the Protegrity Service Account. o Permissions for using the SSH service. The sudoer password is the same across the cluster. The following user accounts to perform the required tasks: o ADMINISTRATOR_USER: It is the sudoer user account that is responsible to install and uninstall the Big Data Protector on the cluster. This user account needs to have sudo access to install the product. o EXECUTOR_USER: It is a user that has ownership of all Protegrity files, folders, and services. o OPERATOR_USER: It is responsible for performing tasks such as, starting or stopping tasks, monitoring services, updating the configuration, and maintaining the cluster while the Big Data Protector is installed on it. If you need to start, stop, or restart the Protegrity services, then you need sudoer privileges for this user to impersonate the EXECUTOR_USER. Depending on the requriements, a single user on the system may perform multiple roles. If a single user is performing multiple roles, then ensure that the following conditions are met: • • The user has the required permissions and privileges to impersonate the other user accounts, for performing their roles, and perform tasks as the impersonated user. The user is assigned the highest set of privileges, from the required roles that it needs to perform, to execute the required tasks. For instance, if a single user is performing tasks as ADMINISTRATOR_USER, EXECUTOR_USER, and Confidential 25 Big Data Protector Guide 6.6.5 • • • • • • • • Installing and Uninstalling Big Data Protector OPERATOR_USER, then ensure that the user is assigned the privileges of the ADMINISTRATOR_USER. The management scripts provided by the installer in the cluster_utils directory should be run only by the user (OPERATOR_USER) having privileges to impersonate the EXECUTOR_USER. o If the value of the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No, then ensure that a service group containing a user for running the Protegrity services on all the nodes in the cluster already exists. o If the Hadoop cluster is configured with LDAP or AD for user management, then ensure that the AUTOCREATE_PROTEGRITY_IT_USR parameter in the BDP.config file is set to No and that the required service account user is created on all the nodes in the cluster. If the Big Data Protector with versions lower than 6.6.3 was previously installed with HDFSFP, then ensure that you create the backup of DFSFP on the ESA. For more information about creating the DFSFP backup, refer to section 4.1.4 Backing Up DFSFP before Installing Big Data Protector 6.6.3 in Data Security Platform Upgrade Guide 6.6.5. If Big Data Protector, version 6.6.3, with build version 6.6.3.15, or lower, was previously installed and the following Spark protector APIs for Encryption/Decryption are utilized: o public void protect(String dataElement, ListerrorIndex, short[] input, byte[][] output) o public void protect(String dataElement, List errorIndex, int[] input, byte[][] output) o public void protect(String dataElement, List errorIndex, long[] input, byte[][] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, short[] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, int[] output) o public void unprotect(String dataElement, List errorIndex, byte[][] input, long[] output) For more information, refer to the Advisory for Spark Protector APIs, before installing Big Data Protector, version 6.6.5. If the Big Data Protector was previously installed then uninstall it. In addition, delete the directory from the Lead node. If the /var/log/protegrity/ directory exists on any node in the cluster, then ensure that it is empty. Password based authentication is enabled in the sshd_config file before installation. After the installation is completed, this setting might be reverted back by the system administrator. The lsb_release library is present on the client machine, at least on the Lead node. The Lead node can be any node, such as the Name node, Data node, or Edge node, that can access the Hadoop cluster. The Lead node would be driving the installation of the Big Data Protector across the Hadoop cluster and is responsible for managing the Big Data Protector services throughout the cluster. If the lsb_release library is not present, then the installation of the Big Data Protector fails. This can be verified by running the following command. lsb_release If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster, then ensure that the HDFS superuser (hdfs) has a valid Kerberos ticket. If you are configuring HDFSFP with Big Data Protector, then ensure that the following prerequisites are met: o Ensure that an unstructured policy is created in the ESA, containing the data elements to be linked with the ACL. Confidential 26 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If a sticky bit is set for an HDFS directory, which is required to be protected by HDFSFP, then the user needs to remove the sticky bit before creating ACLs (for Protect/Reprotect/Unprotect/Update) for that HDFS directory. If required, then the user can set the sticky bit again after activating the ACLs. For more information about creating data elements, security policies, and user roles, refer to Enterprise Security Administrator Guide 6.6.5 and Protection Enforcement Point Servers Installation Guide 6.6.5. o 3.1.2 Extracting Files from the Installation Package To extract the files from the installation package: 1. After receiving the installation package from Protegrity, copy it to the Lead node in any temporary folder, such as /opt/bigdata. 2. Extract the files from the installation package using the following command: tar –xf BigDataProtector_ - -nCPU_64_6.6.5.x.tgz The following files are extracted: • • • • • • • • • • • • • • • • • • • • • • • • • • • • BDP.config BdpInstallx.x.x_Linux_ _6.6.5.x.sh FileProtector_ _x86- _AccessControl_6.6.x.x.sh FileProtector_ _x86- _ClusterDeploy_6.6.x.x.sh FileProtector_ _x86- _FileEncryption_6.6.x.x.sh FileProtector_ _x86- _PreInstallCheck_6.6.x.x.sh FileProtector_ _x86- _VolumeEncryption_6.6.x.x.sh FP_ClusterDeploy_hosts INSTALL.txt JpepLiteSetup_Linux_ _6.6.5.x.sh node_uninstall.sh PepHbaseProtectorx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepHdfsFp_Setup_ -x.x_6.6.5.x.sh PepHivex.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepImpalax.xSetup_ _x86- _6.6.5.x.sh, only if it is a Cloudera or MapR distribution PepHawqx.xSetup_ _x86- _6.6.5.x.sh, only if it is a Pivotal distribution PepMapreducex.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepPigx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepServer_Setup_Linux_ _6.6.5.x.sh PepSparkx.x.xSetup_Linux_ _ -x.x_6.6.5.x.sh PepTalendSetup_x.x.x_6.6.5.x.sh Prepackaged_Policyx.x.x_Linux_ _6.6.5.x.sh ptyLogAnalyzer.sh ptyLog_Consolidator.sh samples-mapreduce.tar samples-spark.tar uninstall.sh XCPep2Jni_Setup_Linux_ _6.6.5.x.sh Confidential 27 Big Data Protector Guide 6.6.5 3.1.3 Installing and Uninstalling Big Data Protector Updating the BDP.config File Ensure that the BDP.config file is updated before the Big Data Protector is installed. Do not update the BDP.config file when the installation of the Big Data Protector is in progress. To update the BDP.config file: 1. Create a file containing a list of all nodes in the cluster, except the Lead node, and specify it in the BDP.config file. This file is used by the installer for installing Big Data Protector on the nodes. 2. Open the BDP.config file in any text editor and modify the following parameter values: • HADOOP_DIR – The installation home directory for the Hadoop distribution. • PROTEGRITY_DIR – The directory where the Big Data Protector will be installed. The samples and examples used in this document assume that the Big Data Protector is installed in the /opt/protegrity/ directory. • CLUSTERLIST_FILE – This file contains the host name or IP addresses all the nodes in the cluster, except the Lead node, listing one host name and IP address per line. Ensure that you specify the file name with the complete path. • INSTALL_DEMO – Specifies one of the following values, as required: o Yes – The installer installs the demo. o No – The installer does not install the demo. • HDFSFP – Specifies one of the following values, as required: o Yes – The installer installs HDFSFP. o No – The installer does not install HDFSFP. If HDFSFP is being installed, then XCPep2Jni is installed using the XCPep2Jni_Setup_Linux_ _6.6.5.x.sh script. • • • • SPARK_PROTECTOR – Specifies one of the following values, as required: Yes – The installer installs the Spark protector. This parameter also needs to be set to Yes, if the user needs to run Hive UDFs with Spark SQL, or use the Spark protector samples if the INSTALL_DEMO parameter is set to Yes. o No – The installer does not install the Spark protector. IP_NN – The IP address of the Lead node in the Hadoop cluster, which is required for the installation of HDFSFP. PROTEGRITY_CACHE_PORT – The Protegrity Cache port used in the cluster. This port should be open in the firewall across the cluster. On the Lead node, it should be open only for the corresponding ESA, which is used to manage the cluster protection. This is required for the installation of HDFSFP. Typical value for this port is 6379. AUTOCREATE_PROTEGRITY_IT_USR – This parameter determines the Protegrity service account. The service group and service user name specified in the PROTEGRITY_IT_USR_GROUP and PROTEGRITY_IT_USR parameters respectively will be created if this parameter is set to Yes. One of the following values can be specified, as required: o Yes – The installer creates a service group PROTEGRITY_IT_USR_GROUP containing the user PROTEGRITY_IT_USR for running the Protegrity services on all the nodes in the cluster. If the service group or service user are already present, then the installer exits. If you uninstall the Big Data Protector, then the service group and the service user are deleted. Confidential 28 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector No – The installer does not create a service group PROTEGRITY_IT_USR_GROUP with the service user PROTEGRITY_IT_USR for running the Protegrity services on all the nodes in the cluster. Ensure that a service group containing a service user for running Protegrity services has been created, as described in section 3.1.1 Verifying Prerequisites for Installing Big Data Protector. PROTEGRITY_IT_USR_GROUP – This service group is required for running the Protegrity services on all the nodes in the cluster. All the Protegrity installation directories are owned by this service group. PROTEGRITY_IT_USR – This service account user is required for running the Protegrity services on all the nodes in the cluster and is a part of the group PROTEGRITY_IT_USR_GROUP. All the Protegrity installation directories are owned by this service user. HADOOP_NATIVE_DIR – The Hadoop native directory. This parameter needs to be specified if you are using MapR. HADOOP_SUPER_USER – The Hadoop super user name. This parameter needs to be specified if you are using MapR. o • • • • 3.1.4 Installing Big Data Protector To install the Big Data Protector: 1. As a sudoer user, run BdpInstallx.x.x_Linux_ _6.6.5.x.sh from the folder where it is extracted. A prompt to confirm or cancel the Big Data Protector installation appears. 2. Type yes to continue with the installation. The Big Data Protector installation starts. If you are using a Cloudera or MapR distribution, then the presence of the HDFS connection is also verified. A prompt to enter the sudoer password for the ADMINISTRATOR user appears. 3. Enter the sudoer password. A prompt to enter the ESA user name or IP address appears. 4. Enter the ESA host name or IP address. A prompt to enter the ESA user name appears. 5. Enter the ESA user name (Security Officer). The PEP Server Installation wizard starts and a prompt to configure the host as ESA proxy appears. 6. Depending on the requirements, type Yes or No to configure the host as an ESA proxy. 7. If the ESA proxy is set to Yes, then enter the host password for the required ESA user. 8. When prompted, perform the following steps to download the ESA keys and certificates. a) Specify the Security Officer user with administrative privileges. b) Specify the Security Officer password for the ESA certificates and keys. The installer then installs the Big Data Protector on all the nodes in the cluster. The status of the installation of the individual components appears, and the log files for all the required components on all the nodes in the cluster are stored on the Lead node in the /cluster_utils/logs directory. Verify the installation report, that is generated at /cluster_utils/installation_report.txt to ensure that the installation of all the components is successful on all the nodes in the cluster. Verify the bdp_setup.log file confirm if the Big Data Protector was installed successfully on all the nodes in the cluster. Confidential 29 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 9. Restart the MapReduce (MRv1) or Yarn (MRv2) services on the Hadoop cluster. The installer installs the following components in the installation folder of the Big Data Protector: • PEP server in the /defiance_dps directory • XCPep2Jni in the /defiance_xc directory • JpepLite in the /jpeplite directory • MapReduce protector in the /pepmapreduce/lib directory • Hive protector in the /pephive/lib directory • Pig protector in the /peppig/lib directory • HBase protector in the /pephbase-protector/lib directory • Impala protector in the /pepimpala directory, if you are using a Cloudera or MapR distribution • HAWQ protector in the /pephawq directory, if you are using a Pivotal distribution • hdfsfp-xxx.jar in the /hdfsfp directory, only if the value of the HDFSFP parameter in the BDP.config file is specified as Yes • pepspark-xxx.jar in the /pepspark/lib directory, only if the value of the SPARK parameter in the BDP.config file is specified as Yes • Talend-related files in /etl/talend directory • Cluster Utilities in the /cluster_utils directory The following files and directories are present in the /cluster_utils folder: o BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility to install the Big Data Protector on any node in the cluster. For more information about using the BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility, refer to section 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop Cluster. o cluster_cachesrvctl.sh utility for monitoring the status of the Protegrity Cache on all the nodes in the cluster, only if the value of the HDFSFP parameter in the BDP.config file is specified as Yes. o cluster_pepsrvctl.sh utility for managing PEP servers on all nodes in the cluster. o uninstall.sh utility to uninstall the Big Data Protector from all the nodes in the cluster. o node_uninstall.sh to uninstall the Big Data Protector from any nodes in the cluster. For more information about using the node_uninstall.sh utility, refer to section 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster. o update_cluster_policy.sh utility for updating PEP servers when a new policy is deployed. o BDP.config file o CLUSTERLIST_FILE, which is a file containing a list of all the nodes, except the Lead node. o installation_report.txt file that contains the status of installation of all the components in the cluster. o logs directory that contains the consolidated setup logs from all the nodes in the cluster. Confidential 30 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 10. Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the MapReduce protector will return the detailed error and return codes instead of 0 for failure and 1 for success. For more information about the error codes for Big Data Protector, version 6.6.5, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the older behaviour from the Big Data Protector, version 6.6.3 or lower with the Bulk APIs in the MapReduce protector is desired, then perform the following steps to enable the Backward compatibility mode to retain the same error handling capabilities. a) If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), then append the following entry to the mapreduce.admin.reduce.child.java.opts property in the mapred-site.xml file. -Dpty.mr.compatibility=old b) If you are using CDH, then add the following values to the Yarn Service Mapreduce Advanced Configuration Snippet (Safety Valve) parameter in the mapred-site.xml file. mapreduce.admin.map.child.java.opts -Dpty.mr.compatibility=old 11. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have installed HDFSFP, then perform the following steps. a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file contains the following entries in the order provided. mapreduce.admin.reduce.child.java.opts -Dpty.mr.compatibility=old /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* /hdfsfp/* Ensure that the above entries are mapreduce.application.classpath property. before all other entries in the b) Ensure that the mapred.min.split.size property in the hive-site.xml file is set to the following value. mapred.min.split.size=256000 c) Restart the Yarn service. d) Restart the MRv2 service. e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file contains the following entries in the order provided. /pepmapreduce/lib/* /pephive/lib/* Confidential 31 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector /peppig/lib/* /hdfsfp/* Ensure that the above entries are before tez.cluster.additional.classpath.prefix property. f) all other entries in the Restart the Tez services. 12. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have not installed HDFSFP, then perform the following steps. a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry is before mapreduce.application.classpath property. all other entries in the b) Ensure that the yarn.application.classpath property in the yarn-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry yarn.application.classpath property. c) is before all other entries in the Restart the Yarn service. d) Restart the MRv2 service. e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file contains the following entries. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* Ensure that the above entry is before tez.cluster.additional.classpath.prefix property. f) all other entries in the Restart the Tez services. 13. If HDFSFP is not installed and you need to use the Hive protector, then perform the following steps. a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml file. hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook b) Restart the Hive services to ensure that the updates are propagated to all the nodes in the cluster. Confidential 32 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 14. If HDFSFP is installed and you need to use the Hive protector with HDFSFP, then perform the following steps. a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml file. hive.exec.pre.hooks=com.protegrity.hadoop.fileprotector.hive.PtyHivePr eHook b) Restart the Hive services to ensure that the updates are propagated to all the nodes in the cluster. If you are using Beeline or Hue, then ensure that Protegrity Big Data Protector is installed on the following machines: • • For Beeline: The machines where Hive Metastore, and HiveServer2 are running. For Hue: The machines where HueServer, Hive Metastore, HiveServer2 are running. It is recommended to use the Cluster Policy provider to deploy the policies in a multi-node cluster environment, such as Big Data, Teradata etc. If you require the PEP Server service to start automatically after every reboot of the system, then define the PEP Server service in the startup with the required run levels. For more info about starting the PEP Server service automatically, refer to Protection Enforcements Point Servers Installation Guide Release 6.6.5. 3.1.5 Applying Patches As the functionality of the ESA is extended, it should be updated through patches applied to ESA. The patches are available as .pty files, which should be loaded with the ESA user interface. Receive the ESA_PAP-ALL-64_x86-64_6.6.5.pty, or later patch from Protegrity. Upload this patch on the ESA using the Web UI. Then install this patch using the ESA CLI manager. For more information about applying patches, refer to section 4.4.6.2 Install Patches of Protegrity Appliances Overview. 3.1.6 Installing the DFSFP Service Using the Add/Remove Services tool on the ESA to install the DFSFP service. For more information about installing services, refer to Section 4.4.6 of Protegrity Appliances Overview. To install the DFSFP service using the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Administration Add/Remove Services. 3. Press ENTER. The root password prompt appears. 4. Enter the root password. 5. Press ENTER. The Add/Remove Services screen appears 6. Select Install applications. 7. Press ENTER. 8. Select DFSFP. 9. Press ENTER. The DFSFP service is installed. Confidential 33 Big Data Protector Guide 6.6.5 3.1.7 Installing and Uninstalling Big Data Protector Configuring HDFSFP If HDFSFP is used, then it should be configured after Big Data Protector is installed. To ensure that the user is able to access protected data in the Hadoop cluster, HDFSFP is globally configured so that it can perform checks for access control transparently. Ensure that you set the the value of the mapreduce.output.fileoutputformat.compress.type property to BLOCK in the mapredsite.xml file. 3.1.7.1 Configuring HDFSFP for Yarn (MRv2) To configure Yarn (MRv2) with HDFSFP: 1. Register the Protegrity codec in the Hadoop codec factory configuration. In the io.compression.codecs property in the core-site.xml file, add the codec com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the mapred-site.xml file to true. 3. Add the property mapreduce.output.fileoutputformat.compress.codec to the mapredfile and set the value to site.xml com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. If the property is already present in the mapred-site.xml file, then ensure that the existing value of the property is replaced with com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 4. Include the /hdfsfp/* path as the first value in the yarn.application.classpath property in the yarn-site.xml file. 5. Restart the HDFS and Yarn services. 3.1.7.2 Configuring HDFSFP for MapReduce, v1 (MRv1) A MapReduce job processes large data sets stored in HDFS across the Hadoop cluster. The result of the MapReduce job is stored in HDFS. The HDFSFP stores protected data in encrypted form in HDFS. The Map job reads protected data and the Reduce job saves the result in protected form. This is done by configuring the Protegrity codec at global level for MapReduce jobs. To configure MRv1 with HDFSFP: 1. Register the Protegrity codec in the Hadoop codec factory configuration. In the io.compression.codecs property in the core-site.xml file, add the codec com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to true. 3. Modify the value of the mapred.output.compression.codec property in the mapred-site.xml file to com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec. 4. Restart the HDFS and MapReduce services. 3.1.7.3 Adding a Cluster to the ESA Before configuring the Cache Refresh Server, ensure that a cluster is added to the ESA. For more information about adding a cluster to the ESA, refer to section 5.14.1 Adding a Cluster for Protection. Confidential 34 Big Data Protector Guide 6.6.5 3.1.7.4 Installing and Uninstalling Big Data Protector Configuring the Cache Refresh Server If a cluster is added to the ESA, then the Cache Refresh server periodically validates the cache entries and takes corrective action, if necessary. This server should always be active. The Cache Refresh Server periodically validates the ACL entries in Protegrity Cache with the ACL entries in the ESA. If a Data store is created using ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed, then the Cluster configuration file (clusterconfig.xml), located in the /dfs/dfsadmin/config/ directory, contains the field names RedisPort and RedisAuth. • If a Data store is created using ESA 6.5 SP2 Patch 4 with DFSFPv8 patch installed, then the Cluster configuration file (clusterconfig.xml) contains the field names ProtegrityCachePort and ProtegrityCacheAuth. • If a migration of the ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed to the ESA 6.5 SP2 Patch 4 with DFSFPv8 patch installed is done, then the Cluster configuration file (clusterconfig.xml) contains the field name entries RedisPort and RedisAuth for the old Data stores, and the entries ProtegrityCachePort and ProtegrityCacheAuth for the new Data stores, created after the migration. If the ACL entries present in the appliance are not matching the ACL entries in Protegrity Cache, then logs are generated in the ESA. The logs can be viewed from the ESA Web Interface at the following path: Distributed File System File Protector Logs. • The various error codes are explained in Troubleshooting Guide 6.6.5. To configure the Cache Refresh Server time: 1. Navigate to the path /dfs/cacherefresh/data. 2. Open the dfscacherefresh.cfg file. 3. Modify the cacherefreshtime parameter as required based on the following guidelines: • Default value – 30 minutes • Minimum value – 10 minutes • Maximum value – 720 minutes (12 hours) The Cache Refresh Interval should be entered in minutes. To verify if the Cache Refresh Server is running: 1. Login to the ESA Web Interface. 2. Navigate to System Services DFS Cache Refresh. The Cache Refresh Server would be running. 3. If the Cache Refresh Server is not running, then click on the Start button ( Cache Refresh Server. 3.1.7.5 ) to start the Configuring Hive Support in HDFSFP If Hive is used with HDFSFP, then it should be configured after installing Big Data Protector. To configure Hive support in HDFSFP: 1. If you are using a Hadoop distribution that has a Management UI, then perform the following steps. a) In the hive-site.xml file, set the value of the mapreduce.job.maps property to 1, using the Management UI. Confidential 35 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If the hive-site.xml file does not have any mapreduce.job.maps property, then perform the following tasks. a. Add the property with the name mapreduce.job.maps in the hive-site.xml file. b. Set the value of the mapreduce.job.maps property to 1. b) In the hive-site.xml file, add the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook to the hive.exec.pre.hooks property before any other existing value, using the Management UI. If the hive-site.xml file does not have any hive.exec.pre.hooks property, then perform the following tasks. a. Add the property with the name hive.exec.pre.hooks in the hive-site.xml file. b. Set the value of the hive.exec.pre.hooks property to com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook. 2. If you are using a Hadoop distribution without a Management UI, then perform the following steps. a) Add the following property in the hive-site.xml file on all nodes. If the property is already present in the hive-site.xml file, then ensure that the value of the property is set to 1. b) Add the following property in the hive-site.xml file on all nodes. mapreduce.job.maps 1 If the property is already present in the hive-site.xml file, then ensure that the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook is before any other existing value. For more information about using Hive with HDFSFP, refer to section 13 Appendix: Using Hive with HDFSFP. 3.1.8 Configuring HBase If HBase is used, then it should be configured after Big Data Protector is installed. Ensure that you configure the Protegrity HBase coprocessor on all the region servers. If the Protegrity HBase coprocessor is not configured in some region servers, then an inconsistent state might occur, where some records in a table are protected and some are not protected. This could potentially lead to data corruption, making it difficult to separate the protected data from clear text data. It is recommended to use HBase version 0.98 or above. Confidential 36 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If you are using an HBase version lower than 0.98, then you would need a Java client to perform the protection of data. HBase versions lower than 0.98 do not support ATTRIBUTES, which controls the MIGRATION and BYPASS_COPROCESSOR parameters. To configure HBase: 1. If you are using a Hadoop distribution that has a Management UI, then add the following value to the HBase coprocessor region classes property in the hbase-site.xml file in all the respective region server groups, using the Management UI. com.protegrity.hbase.PTYRegionObserver If the hbase-site.xml file does not have any HBase coprocessor region classes property, then perform the following tasks. a) Add the property with the name hbase.coprocessor.region.classes in the hbase-site.xml file in all the respective region server groups. b) Set the following value for the hbase.coprocessor.region.classes property. com.protegrity.hbase.PTYRegionObserver If any coprocessors are already defined in the HBase coprocessor region class property, then ensure that the value of the Protegrity coprocessor is before any pre-existing coprocessors defined in the hbase-site.xml file. 2. If you are using a Hadoop distribution without a Management UI, then add the following property in the hbase-site.xml file on all region server nodes. hive.exec.pre.hooks com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook If the property is already present in the hbase-site.xml file, then ensure that the value of the Protegrity coprocessor region class is before any other coprocessor in the hbase-site.xml file. 3. Restart all HBase services. 3.1.9 Configuring Impala If Impala is used, then it should be configured after Big Data Protector is installed. To configure Impala: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Navigate to the hbase.coprocessor.region.classes com.protegrity.hbase.PTYRegionObserver /pepimpala/sqlscripts/ folder. This folder contains the Protegrity UDFs for the Impala protector. 3. If you are not using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql script to load the Protegrity UDFs for the Impala protector. impala-shell -i -f /pepimpala/sqlscripts/createobjects.sql 4. If you are using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql script to load the Protegrity UDFs for the Impala protector. impala-shell -i -f /pepimpala/sqlscripts/createobjects.sql -k If the catalogd process is restarted at any point in time, then all the Protegrity UDFs for the Impala protector should be reloaded using the command in Step 3 or 4, as required. Confidential 37 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 3.1.10 Configuring HAWQ If HAWQ is used, then it should be configured after Big Data Protector is installed. Ensure that you are logged as the gpadmin user for configuring HAWQ. To configure HAWQ: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Navigate to the /pephawq/sqlscripts/ folder. This folder contains the Protegrity UDFs for the HAWQ protector. 3. Execute the createobjects.sql script to load the Protegrity UDFs for the HAWQ protector. psql -h -p 5432 -f /pephawq/sqlscripts/createobjects.sql where: HAWQ_Master_Hostname: Hostname or IP Address of the HAWQ Master Node 5432: Port number 3.1.11 Configuring Spark If Spark is used, then it should be configured after Big Data Protector is installed. To configure Spark: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pepspark/lib/* spark.executor.extraClassPath= /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following classpath entries. spark.driver.extraClassPath= /pepspark/lib/*: / hdfsfp/* spark.executor.extraClassPath= /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. If the user needs to run Hive UDFs with Spark SQL, then the following steps need to be performed. To configure Spark SQL: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following classpath entries. Confidential 38 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/*: /hdfsfp/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. 3.2 Installing or Uninstalling Big Data Protector on Specific Nodes This section describes the following procedures: • • 3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop cluster Uninstalling Big Data Protector from a Nodes in the Hadoop cluster Installing Big Data Protector on New Nodes added to a Hadoop Cluster If you need to install Big Data Protector on new nodes added to a Hadoop cluster, then use the BdpInstallx.x.x_Linux_ _6.6.5.x.sh utility in the /cluster_utils directory. Ensure that you install the Big Data Protector from an ADMINISTRATOR user having full sudoer privileges. To install Big Data Protector on New Nodes added to a Hadoop Cluster: 1. Login to the Lead Node. 2. Navigate to the /cluster_utils directory. 3. Add additional entries for each new node, on which the Big Data Protector needs to be installed, in the NEW_HOSTS_FILE file. The new nodes from the NEW_HOSTS_FILE file will be appended to the CLUSTERLIST_FILE. 4. Execute the following command utility to install Big Data Protector on the new nodes. ./BdpInstall1.0.1_Linux_ _6.6.5.X.sh –a The Protegrity Big Data Protector is installed on the new nodes. 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster If you need to uninstall Big Data Protector from selective nodes in the Hadoop cluster, then use the node_uninstall.sh utility in the /cluster_utils directory. Ensure that you uninstall the Big Data Protector from an ADMINISTRATOR user having full sudoer privileges. To uninstall Big Data Protector from Selective Nodes in the Hadoop Cluster: 1. Login to the Lead Node. 2. Navigate to the /cluster_utils directory. 3. Create a new hosts file (such as NEW_HOSTS_FILE). The NEW_HOSTS_FILE file contains the required nodes on which the Big Data Protector needs to be uninstalled. 4. Add the nodes from which the Big Data Protector needs to be uninstalled in the new hosts file. Confidential 39 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 5. Execute the following command to remove the Big Data Protector from the nodes that are listed in the new hosts file. ./node_uninstall.sh -c NEW_HOSTS_FILE The Big Data Protector is uninstalled from the nodes listed in the new hosts file. 6. Remove the nodes from which the Big Data Protector is uninstalled in Step 5 from the CLUSTERLIST_FILE file. 3.3 Utilities This section provides information about the following utilities: • • • • 3.3.1 PEP Server Control (cluster_pepsrvctl.sh) – Manages PEP servers across the cluster. Update Cluster Policy (update_cluster_policy.sh) – Updates the configurations of the PEP servers across the cluster. Protegrity Cache Control (cluster_cachesrvctl.sh) – Monitors the status of the Protegrity Cache on all the nodes in the cluster. This utility is available only for HDFSFP. Recover Utility – Recovers the contents from a protected path. This utility is available only for HDFSFP. Ensure that you run the utilities with a user (OPERATOR_USER) having sudo privileges for impersonating the service account (EXECUTOR_USER or PROTEGRITY_IT_USR, as configured). PEP Server Control This utility (cluster_pepsrvctl.sh), in the /cluster_utils folder, manages the PEP server services on all the nodes in the cluster, except the Lead node. The utility provides the following options: Start – Starts the PEP servers in the cluster. Stop – Stops the PEP servers in the cluster. • Restart – Restarts the PEP servers in the cluster. • Status – Reports the status of the PEP servers. The utility (pepsrvctrl.sh), in the /defiance_dps/bin/ folder, manages the PEP server services on the Lead node. • • When you run the the PEP Server Control utility, then you will be prompted to enter the OPERATOR_USER password, which is same across all the nodes in the cluster. 3.3.2 Update Cluster Policy This utility (update_cluster_policy.sh), in the /cluster_utils folder, updates the configurations of the PEP servers across the cluster. For example, if you need to make any changes to the PEP server configuration, make the changes on the Lead node and then propagate the change to all the PEP servers in the cluster using the update_cluster_policy.sh utility. Ensure that all the PEP servers in the cluster are stopped before running the update_cluster_policy.sh utility. Confidential 40 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector When you run the the Update Cluster Policy utility, then you will be prompted to enter the OPERATOR_USER password, which is same across all the nodes in the cluster. 3.3.3 Protegrity Cache Control This utility (cluster_cachesrvctl.sh), in the /cluster_utils folder, monitors the status of the Protegrity Cache on all the nodes in the cluster. This utility prompts for the OPERATOR_USER password. The utility provides the following options: • • • • 3.3.4 Start – Starts the Protegrity Cache services in the cluster. Stop – Stops the Protegrity Cache services in the cluster. Restart – Restarts the Protegrity Cache services in the cluster. Status – Reports the status of the Protegrity Cache services. Recover Utility The Recover utility is available for HDFSFP only. This utility recovers the contents from protected files of types Text, RC, and Sequence, in the absence of ACL or loss of ACL information. This ensures that the data is not lost under any circumstances. Parameters srcpath: destpath: The protected HDFS path containg the data The destination directory to store unprotected data. to be unprotected. Result • • If srcpath is the file path, then the Recover utility recovers all files. If srcpath is the directory path, then the Recover utility recovers all files inside the directory. Ensure that the user running the Recover utility has unprotect access on the data element which was used to protect the files in the HDFS path. Ensure that an ADMINISTRATOR or OPERATOR_USER is running the Recover Utility and the user has the required read/execute permissions to the /hdfsfp/recover.sh script. Example The following two ACLs are created: 1. /user/root/employee 2. /user/ptyitusr/prot/employee Run the Recover Utility on these two paths with destination local directory as /tmp/HDFSFPrecovered/ by using the following commands. The following would be recovered in the local directory: Confidential 41 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 1. /tmp/HDFSFP-recovered/user/root/employee - The files and sub-directories present in the HDFS location /user/root/employee are recovered in cleartext form. 2. /tmp/HDFSFP-recovered/user/ptyitusr/prot/employee - The files and sub-directories present in the HDFS location /user/ptyitusr/prot/employee are recovered in cleartext form. To recover the protected data from a Hive warehouse directory to a local file system directory: 1. Execute the following command to retrieve the protected data from a Hive warehouse directory. /hdfsfp/recover.sh –srcpath -destpath The cleartext data from the protected HDFS path is stored in the destination directory. 2. If you need to ensure that the existing Hive queries for the table function, then perform the following steps. a) Execute the following command to delete the warehouse directory for the table. hadoop fs –rm –r /tablename b) Move the destination directory with the cleartext data in HDFS using the following command. hadoop fs –put /user/hive/warehouse/table_name /tablename in the local file c) To view the cleartext data in the table, use the following command. Select * from tablename 3.4 Uninstalling Big Data Protector from a Cluster This section describes the procedure for uninstalling the Big Data Protector from the cluster. 3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster, then ensure that the HDFS superuser (hdfs) has a valid Kerberos ticket. 3.4.2 Removing the Cluster from the ESA Before uninstalling Big Data Protector from the cluster, the cluster should be deleted from the ESA. For more information about deleting the cluster from the ESA, refer to section 5.14.3 Removing a Cluster. 3.4.3 Uninstalling Big Data Protector from the Cluster Depending on the requirements, perform the following tasks to uninstall the Big Data Protector from the cluster. 3.4.3.1 Removing HDFSFP Configuration for Yarn (MRv2) If HDFSFP is configured for Yarn (MRv2), then the configuration should be removed before uninstalling Big Data Protector. To remove HDFSFP configuration for Yarn (MRv2) after uninstalling Big Data Protector: 1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec codec from the io.compression.codecs property in the core-site.xml file. 2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the mapred-site.xml file to false. Confidential 42 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector 3. Remove the value of the mapreduce.output.fileoutputformat.compress.codec property in the mapred-site.xml file. 4. Remove the /hdfsfp/* path from the yarn.application.classpath property in the yarn-site.xml file. 5. Restart the HDFS and Yarn services. 3.4.3.2 Removing HDFSFP Configuration for MapReduce, v1 (MRv1) If HDFSFP is configured for MapReduce, v1 (MRv1), then the configuration should be removed before uninstalling Big Data Protector. To remove HDFSFP configuration for MRv1 after uninstalling Big Data Protector: 1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec codec from the io.compression.codecs property in the core-site.xml file. 2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to false. 3. Remove the value of the mapred.output.compression.codec property in the mapred-site.xml file. 4. Restart the HDFS and MapReduce services. 3.4.3.3 Removing Configuration for Hive Protector if HDFSFP is not Installed If the Hive protector is used and HDFSFP is not installed, then the configuration should be removed before uninstalling Big Data Protector. To remove configuration for Hive protector if HDFSFP is not installed: 1. If you are using a Hadoop distribution with a Management UI, then remove the value com.protegrity.hive.PtyHiveUserPreHook from the hive.exec.pre.hooks property, from the hive-site.xml file using the configuration management UI. 2. If you are using a Hadoop distribution without a Management UI, then remove the following property in the hive-site.xml file from all nodes. 3.4.3.4 Removing Configurations for Hive Support in HDFSFP If Hive is used with HDFSFP, then the configuration should be removed before uninstalling Big Data Protector. To remove configurations for Hive support in HDFSFP: 1. If you are using a Hadoop distribution with a Management UI, then perform the following steps. a) In the hive-site.xml file, remove the value of the mapreduce.job.maps property, using the Management UI. b) In the hive-site.xml file, remove the value com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook from the hive.exec.pre.hooks property, using the configuration management UI. 2. If you are using a Hadoop distribution without a Management UI, then perform the following steps. a) Remove the following property in the hive-site.xml file on all nodes. hive.exec.pre.hooks hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook b) Remove the following property in the hive-site.xml file on all nodes. mapreduce.job.maps Confidential 43 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector1 3.4.3.5 Removing the Configuration Properties when HDFSFP is not Installed If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal Hadoop), and you have not installed HDFSFP, then the configuration should be removed before uninstalling Big Data Protector. To remove the configuration properties: 1. Remove the following entries from the mapreduce.application.classpath property in the mapred-site.xml file. hive.exec.pre.hooks com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 2. Remove the following entries from the yarn.application.classpath property in the yarnsite.xml file. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 3. Restart the Yarn service. 4. Restart the MRv2 service. 5. Remove the following entries from the tez.cluster.additional.classpath.prefix property in the tez-site.xml file. /pepmapreduce/lib/* /pephive/lib/* /peppig/lib/* 6. Restart the Tez services. 3.4.3.6 Removing HBase Configuration If HBase is configured, then the configuration should be removed before uninstalling Big Data Protector. To remove HBase configuration: 1. If you are using a Hadoop distribution that has a Management UI, then remove the following HBase coprocessor region classes property value from the hbase-site.xml file in all the respective region server groups, using the Management UI. com.protegrity.hbase.PTYRegionObserver 2. If you are using a Hadoop distribution without a Management UI, then remove the following property in the hbase-site.xml file from all region server nodes. 3. Restart all HBase services. 3.4.3.7 Removing the Defined Impala UDFs If Impala is configured, then the defined Protegrity UDFs for the Impala protector should be removed before uninstalling Big Data Protector. To remove the defined Impala UDFs: If you are not using a Kerberos-enabled Hadoop cluster, then run the following command to remove the defined Protegrity UDFs for the Impala protector using the dropobjects.sql script. impala-shell -i hbase.coprocessor.region.classes com.protegrity.hbase.PTYRegionObserver Confidential 44 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector/pepimpala/sqlscripts/dropobjects.sql slave node> -f If you are using a Kerberos-enabled Hadoop cluster, then run the following command to remove the defined Protegrity UDFs for the Impala protector using the dropobjects.sql script. impala-shell -i /pepimpala/sqlscripts/dropobjects.sql -k 3.4.3.8 slave node> -f Removing the Defined HAWQ UDFs If HAWQ is configured, then the defined Protegrity UDFs for the HAWQ protector should be removed before uninstalling Big Data Protector. To remove the defined HAWQ UDFs: Run the following command to remove the defined Protegrity UDFs for the HAWQ protector using the dropobjects.sql script. psql -h -p 5432 -f /pephawq/sqlscripts/dropobjects.sql 3.4.3.9 Removing the Spark Protector Configuration If the Spark protector is used, then the required configuration settings should be removed before uninstalling the Big Data Protector. To remove the Spark protector configuration: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pepspark/lib/* spark.executor.extraClassPath= /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following classpath entries. spark.driver.extraClassPath= /pepspark/lib/*: / hdfsfp/* spark.executor.extraClassPath= /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. Confidential 45 Big Data Protector Guide 6.6.5 Installing and Uninstalling Big Data Protector If Spark SQL is configured to run Hive UDFs, then the required configuration settings should be removed before uninstalling the Big Data Protector. To remove the Spark SQL configuration: 1. Ensure that the Hadoop cluster is installed, configured, and running. 2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or Pivotal distributions, depending on the environment. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/* 3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following classpath entries. spark.driver.extraClassPath= /pephive/lib/*: /p epspark/lib/*: /hdfsfp/* spark.executor.extraClassPath= /pephive/lib/*: /pepspark/lib/*: /hdfsfp/* 4. Save the spark-defaults.conf file. 5. Deploy the configuration change to all the nodes in the Hadoop cluster. 6. Restart the Spark services. 3.4.3.10 Running the Uninstallation Script To run the scripts for uninstalling the Big Data Protector on all nodes in the cluster: 1. Login as the sudoer user and navigate to the /cluster_utils directory on the Lead node. 2. Run the following script to stop the PEP servers on all the nodes in the cluster. ./cluster_pepsrvctl.sh 3. Run the uninstall.sh utility. A prompt to confirm or cancel the Big Data Protector uninstallation appears. 4. Type yes to continue with the uninstallation. 5. When prompted, enter the sudoer password. The uninstallation script continues with the uninstallation of Big Data Protector. If you are using a Cloudera or MapR distribution, then the presence of an HDFS connection and a valid Kerberos ticket is also verified. The /cluster_utils directory continues to exist on the Lead node. This directory is retained to perform a cleanup in the event of the uninstallation failing on some nodes, due to unavoidable reasons, such as host being down. 6. After Big Data Protector is successfully uninstalled from all nodes, manually delete the directory from the Lead node. 7. If the /defiance_dps_old directory is present on any of the nodes in the cluster, then it can be manually deleted from the respective nodes. 8. Restart all Hadoop services. Confidential 46 Big Data Protector Guide 6.6.5 4 4.1 Hadoop Application Protector Hadoop Application Protector Using the Hadoop Application Protector Various jobs written in the Hadoop cluster require data fields to be stored and retrieved. This data requires protection when it is at rest. The Hadoop Application Protector provides MapReduce, Hive and Pig the power to protect data while it is being processed and stored. Application programmers using these tools can include Protegrity software in their jobs to secure data. For more information about using the protector APIs in various Hadoop applications and samples, refer to the following sections. 4.2 Prerequisites Ensure that the following prerequisites are met before using Hadoop Application Protector: The Big Data Protector is installed and configured in the Hadoop cluster. The security officer has created the necessary security policy which creates data elements and user roles with appropriate permissions. For more information about creating security policies, data elements and user roles, refer to Protection Enforcement Point Servers Installation Guide 6.6.5 and Enterprise Security Administrator Guide 6.6.5. • The policy is deployed across the cluster. For more information about the list of all APIs available to Hadoop applications, refer to sections 4.4 MapReduce APIs, 4.5 Hive UDFs, and 4.6 Pig UDFs. • • 4.3 Samples To run the samples provided with the Big Data Protector, the pre-packaged policy should be deployed from the ESA. During installation, specify the INSTALL_DEMO parameter as Yes in the BDP.config file. The commands in the samples may require Hadoop-super-user permissions. For more information about the samples, refer to section 11 Appendix: Samples. 4.4 MapReduce APIs This section describes the MapReduce APIs available for protection and unprotection in the Big Data Protector to build secure Big Data applications. The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. If you are using the Bulk APIs for the MapReduce protector, then the following two modes for error handling and return codes are available: • Default mode: Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the MapReduce protector will return the detailed error and return codes instead of 0 for failure and 1 for success. In addition, the MapReduce jobs involving Bulk APIs will provide error codes instead of throwing exceptions. Confidential 47 Big Data Protector Guide 6.6.5 Hadoop Application Protector For more information about the error codes for Big Data Protector, version 6.6.5, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. • 4.4.1 Backward compatibility mode: If you need to continue using the error handling capabilities provided with Big Data Protector, version 6.6.3 or lower, that is 0 for failure and 1 for success, then you can set this mode. openSession() This method opens a new user session for protect and unprotect operations. It is a good practice to create one session per user thread. public synchronized int openSession(String parameter) Parameters parameter: An internal API requirement that should be set to 0. Result 1: If session is successfully created Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); Exception (and Error Codes) ptyMapRedProtectorException: if session creation fails 4.4.2 closeSession() This function closes the current open user session. Every instance of ptyMapReduceProtector opens only one session, and a session ID is not required to close it. public synchronized int closeSession() Parameters None Result 1: If session is successfully closed 0: If session closure is a failure Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int closeSessionStatus = mapReduceProtector.closeSession(); Exception (and Error Codes) None 4.4.3 getVersion() This function returns the current version of the MapReduce protector. public java.lang.String getVersion() Parameters None Confidential 48 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result This function returns the current version of MapReduce protector. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); String version = mapReduceProtector.getVersion(); int closeSessionStatus = mapReduceProtector.closeSession(); 4.4.4 getCurrentKeyId() This method returns the current Key ID for the data element which contains the KEY ID attribute, while creating the data element, such as ASE-256, ASE-128, and so on. public int getCurrentKeyId(java.lang.String dataElement) Parameters dataElement: Name of the data element Result This method returns the current Key ID for the data element containing the KEY ID attribute. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int currentKeyId = mapReduceProtector.getCurrentKeyId("ENCRYPTION_DE"); int closeSessionStatus = mapReduceProtector.closeSession(); 4.4.5 checkAccess() This method checks the access of the user for the specified data element. public boolean checkAccess(java.lang.String dataElement, byte bAccessType) Parameters dataElement: Name of the data element bAccessType: Type of the access of the user for the data element. The following are the different values for the bAccessType variable: DELETE 0x01 PROTECT 0x02 REPROTECT 0x04 UNPROTECT 0x08 CREATE 0x10 MANAGE 0x20 Result 1: If the user has access to the data element Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte bAccessType = 0x02; boolean isAccess = mapReduceProtector.checkAccess("DE_PROTECT" , bAccessType ); int closeSessionStatus = mapReduceProtector.closeSession(); Confidential 49 Big Data Protector Guide 6.6.5 4.4.6 Hadoop Application Protector getDefaultDataElement() This method returns default data element configured in security policy. public String getDefaultDataElement(String policyName) Parameters policyName: Name of policy configured using Policy management in ESA. Result Default data element name configured in a given policy. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); String defaultDataElement = mapReduceProtector.getDefaultDataElement("my_policy"); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to return default data element name 4.4.7 protect() Protects the data provided as a byte array. The type of protection applied is defined by dataElement. public byte[] protect(String dataElement, byte[] data) Parameters dataElement: Name of the data element to be protected data: Byte array of data to be protected The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. If you are using the Protect API which accepts byte as input and provides byte as output, then ensure that when unprotecting the data, the Unprotect API, with byte as input and byte as output is utilized. In addition, ensure that the byte data being provided as input to the Protect API has been converted from a string data type only. Result Byte array of protected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] bResult = mapReduceProtector.protect( "DE_PROTECT","protegrity".getBytes()); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data Confidential 50 Big Data Protector Guide 6.6.5 4.4.8 Hadoop Application Protector protect() Protects the data provided as int. The type of protection applied is defined by dataElement. public int protect(String dataElement, int data) Parameters dataElement: Name of the data element to be protected data: int to be protected Result Protected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int bResult = mapReduceProtector.protect( "DE_PROTECT",1234); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data 4.4.9 protect() Protects the data provided as long. The type of protection applied is defined by dataElement. public long protect(String dataElement, long data) Parameters dataElement: Name of the data element to be protected data: long data to be protected Result Protected long data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long bResult = mapReduceProtector.protect( "DE_PROTECT",123412341234); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to protect data 4.4.10 unprotect() This function returns the data in its original form. public byte[] unprotect(String dataElement, byte[] data) Parameters dataElement: Name of data element to be unprotected data: array of data to be unprotected Confidential 51 Big Data Protector Guide 6.6.5 Hadoop Application Protector The Protegrity MapReduce protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. Result Byte array of unprotected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", "protegrity".getBytes() ); byte[] unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.11 unprotect() This function returns the data in its original form. public int unprotect(String dataElement, int data) Parameters dataElement: Name of data element to be unprotected data: int to be unprotected Result Unprotected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", 1234 ); int unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.12 unprotect() This function returns the data in its original form. public long unprotect(String dataElement, long data) Parameters dataElement: Name of data element to be unprotected data: long data to be unprotected Result Unprotected long data Confidential 52 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT", 123412341234 ); long unprotectedResult = mapReduceProtector.unprotect( "DE_PROTECT_UNPROTECT", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If unable to unprotect data 4.4.13 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public byte[][] bulkProtect(String dataElement, List errorIndex, byte[][] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in inputDataItems inputDataItems: Two-dimensional array to store bulk data for protection Result Two-dimensional byte array of protected data. If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(), "protegrity".getBytes(), "protegrity".getBytes()}; byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); System.out.print("Protected Data: "); for(int i = 0; i < protectedData.length; i++) { Confidential 53 Big Data Protector Guide 6.6.5 Hadoop Application Protector //THIS WILL PRINT THE PROTECTED DATA System.out.print(protectedData[i] == null ? null : new String(protectedData[i])); if(i < protectedData.length - 1) { System.out.print(","); } } System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.14 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public int[] bulkProtect(String dataElement, List errorIndex, int[] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in input Data Items inputDataItems: array to store bulk int data for protection Result int array of protected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Confidential 54 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); int[] protectData = {1234, 5678, 9012, 3456}; int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //CHECK THE ERROR INDEXES FOR ERRORS System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.15 bulkProtect() This is used when a set of data needs to be protected in a bulk operation. It helps to improve performance. public long[] bulkProtect(String dataElement, List errorIndex, long[] inputDataItems) Parameters dataElement: Name of data element to be protected errorIndex: array used to store all error indices encountered while protecting each data entry in input Data Items inputDataItems: array to store bulk long data for protection Result Long array of protected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk protect operation: • • 1: The protect operation for the entry is successful. 0: The protect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Confidential 55 Big Data Protector Guide 6.6.5 • Hadoop Application Protector Any other value or garbage return value: The protect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); long[] protectData = {123412341234, 567856785678, 901290129012, 345634563456}; long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //CHECK THE ERROR INDEXES FOR ERRORS System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } //ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error is encountered during bulk protection of data 4.4.16 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public byte[][] bulkUnprotect(String dataElement, List errorIndex, byte[][] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems byte[][] inputDataItems: two-dimensional array to help store bulk data to be unprotected Result Two-dimensional byte array of unprotected data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. Confidential 56 Big Data Protector Guide 6.6.5 Hadoop Application Protector For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. o • Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(), "protegrity".getBytes(), "protegrity".getBytes()}; byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE UNPROTECTED DATA System.out.print("Protected Data: "); for(int i = 0; i < protectedData.length; i++) { System.out.print(protectedData[i] == null ? null : new String(protectedData[i])); if(i < protectedData.length - 1) { System.out.print(","); } } //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } byte[][] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE PROTECTED DATA System.out.print("UnProtected Data: "); for(int i = 0; i < unprotectedData.length; i++) { System.out.print(unprotectedData[i] == null ? null : new String(unprotectedData[i])); if(i < unprotectedData.length - 1) { System.out.print(","); } } //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); Confidential 57 Big Data Protector Guide 6.6.5 Hadoop Application Protector System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.17 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public int[] bulkUnprotect(String dataElement, List errorIndex, int[] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems int[] inputDataItems: int array to be unprotected Result unprotected int array data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); int[] protectData = {1234, 5678,9012,3456 }; int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); Confidential 58 Big Data Protector Guide 6.6.5 Hadoop Application Protector System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.18 bulkUnprotect() This method unprotects in bulk the inputDataItems with the required data element. public long[] bulkUnprotect(String dataElement, List errorIndex, long[] inputDataItems) Parameters String dataElement: Name of data element to be unprotected int[] error index: array of the error indices encountered while unprotecting each data entry in inputDataItems long[] inputDataItems: long array to be unprotected Result Unprotected long array data If the Backward Compatibility mode is not set, then the appropriate error code appears. For more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix: Return Codes. If the Backward Compatibility mode is set, then the Error Index includes one of the following values, per entry in the bulk unprotect operation: • • 1: The unprotect operation for the entry is successful. 0: The unprotect operation for the entry is unsuccessful. o For more information about the failed entry, view the logs available in ESA Forensics. Confidential 59 Big Data Protector Guide 6.6.5 • Hadoop Application Protector Any other value or garbage return value: The unprotect operation for the entry is unsuccessful. For more information about the failed entry, view the logs available in ESA Forensics. Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); List errorIndex = new ArrayList (); long[] protectData = { 123412341234, 567856785678, 901290129012, 345634563456 }; long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT", errorIndex, protectData ); //THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } long[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT", errorIndex, protectedData ); //THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION System.out.println(""); System.out.print("Error Index: "); for(int i = 0; i < errorIndex.size(); i++) { System.out.print(errorIndex.get( i )); if(i < errorIndex.size() - 1) { System.out.print(","); } } int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors when unprotecting data 4.4.19 reprotect() Data that has been protected earlier is protected again with a separate data element. public byte[] reprotect(String oldDataElement, String newDataElement, byte[] data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected byte[] data: array of data to be protected Confidential 60 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result Byte array of reprotected data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", "protegrity".getBytes() ); byte[] reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.20 reprotect() Data that has been protected earlier is protected again with a separate data element. public int reprotect(String oldDataElement, String newDataElement, int data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected int data: array of data to be protected Result Reprotected int data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); int protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", 1234 ); int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.21 reprotect() Data that has been protected earlier is protected again with a separate data element. public long reprotect(String oldDataElement, String newDataElement, long data) Parameters String oldDataElement: Name of data element with which data was protected earlier String newDataElement: Name of new data element with which data is reprotected long data: array of data to be protected Result Reprotected long data Confidential 61 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); long protectedResult = mapReduceProtector.protect( "DE_PROTECT_1", 123412341234 ); int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1", "DE_PROTECT_2", protectedResult ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: For errors while reprotecting data 4.4.22 hmac() This method performs data hashing using the HMAC operation on a single data item with a data element, which is associated with hmac. It returns hmac value of the given data with the given data element. public byte[] hmac(String dataElement, byte[] data) Parameters String dataElement: Name of data element for HMAC byte[] data: array of data for HMAC Result Byte array of HMAC data Example ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector(); int openSessionStatus = mapReduceProtector.openSession("0"); byte[] bResult = mapReduceProtector.hmac( "DE_HMAC", "protegrity".getBytes() ); int closeSessionStatus = mapReduceProtector.closeSession(); Exception ptyMapRedProtectorException: If an error occurs while doing HMAC 4.5 Hive UDFs This section describes all Hive User Defined Functions (UDFs) that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. If you are using Ranger or Sentry, then ensure that your policy provides create access permissions to the required UDFs. 4.5.1 ptyGetVersion() This UDF returns the current version of PEP. ptyGetVersion() Parameters None Result This UDF returns the current version of PEP. Confidential 62 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example create temporary function ptyGetVersion AS 'com.protegrity.hive.udf.ptyGetVersion'; drop table if exists test_data_table; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table; select ptyGetVersion() from test_data_table; 4.5.2 ptyWhoAmI() This UDF returns the current logged in user. ptyWhoAmI() Parameters None Result This UDF returns the current logged in user. Example create temporary function ptyWhoAmI AS 'com.protegrity.hive.udf.ptyWhoAmI'; drop table if exists test_data_table; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table; select ptyWhoAmI() from test_data_table; 4.5.3 ptyProtectStr() This UDF protects string values. ptyProtectStr(String input, String dataElement) Parameters String input: String value to protect String dataElement: Name of data element to protect string value Result This UDF returns protected string value. Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; Confidential 63 Big Data Protector Guide 6.6.5 Hadoop Application Protector LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; select ptyProtectStr(val, 'Token_alpha') from test_data_table; 4.5.4 ` ptyUnprotectStr() This UDF unprotects the existing protected string value. ptyUnprotectStr(String input, String dataElement) Parameters String input: Protected string value to unprotect String dataElement: Name of data element to unprotect string value Result This UDF returns unprotected string value. Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; create temporary function ptyUnprotectStr AS 'com.protegrity.hive.udf.ptyUnprotectStr'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; insert overwrite table protected_data_table select ptyProtectStr(val, 'Token_alpha') from test_data_table; select ptyUnprotectStr(protectedValue, 'Token_alpha') from protected_data_table; 4.5.5 ptyReprotect() This UDF reprotects string format protected data, which was earlier protected using the ptyProtectStr UDF, with a different data element. ptyReprotect(String input, String oldDataElement, String newDataElement) Parameters String input: String value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected string value. Confidential 64 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select trim(val) from temp_table; insert overwrite table test_protected_data_table select ptyProtectStr(val, 'Token_alpha') from test_data_table; create table test_reprotected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, 'Token_alpha', 'new_Token_alpha') from test_protected_data_table; 4.5.6 ptyProtectUnicode() This UDF protects string (Unicode) values. ptyProtectUnicode(String input, String dataElement) Parameters String input: String (Unicode) value to protect String dataElement: Name of data element to protect string (Unicode) value This UDF should be used only if you need to tokenize Unicode data in Hive, and migrate the tokenized data from Hive to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected string value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; drop table if exists temp_table; Confidential 65 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; select ptyProtectUnicode(val, 'Token_unicode') from temp_table; 4.5.7 ptyUnprotectUnicode() This UDF unprotects the existing protected string value. ptyUnprotectUnicode(String input, String dataElement) Parameters String input: Protected string value to unprotect String dataElement: Name of data element to unprotect string value This UDF should be used only if you need to tokenize Unicode data in Teradata using the Protegrity Database Protector, and migrate the tokenized data from a Teradata database to Hive and detokenize the data using the Protegrity Big Data Protector for Hive. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data from a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns unprotected string (Unicode) value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; create temporary function ptyUnprotectUnicode AS 'com.protegrity.hive.udf.ptyUnprotectUnicode'; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table protected_data_table select ptyProtectUnicode(val, 'Token_unicode') from temp_table; 4.5.8 ptyReprotectUnicode() This UDF reprotects string format protected data, which was protected earlier using the ptyProtectUnicode UDF, with a different data element. Confidential 66 Big Data Protector Guide 6.6.5 Hadoop Application Protector ptyReprotectUnicode(String input, String oldDataElement, String newDataElement) Parameters String input: String (Unicode) value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data This UDF should be used only if you need to tokenize Unicode data in Hive, and migrate the tokenized data from Hive to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected string value. Example create temporary function ptyProtectUnicode AS 'com.protegrity.hive.udf.ptyProtectUnicode'; create temporary function ptyReprotectUnicode AS 'com.protegrity.hive.udf.ptyReprotectUnicode'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val)) from temp_table; insert overwrite table test_protected_data_table select ptyProtectUnicode(val, 'Unicode_Token') from test_data_table; create table test_reprotected_data_table(val string) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotectUnicode(val, 'Unicode_Token',’new_Unicode_Token’) from test_data_table; 4.5.9 ptyProtectInt() This UDF protects integer values. ptyProtectInt(int input, String dataElement) Confidential 67 Big Data Protector Guide 6.6.5 Hadoop Application Protector Parameters int input: Integer value to protect String dataElement: Name of data element to protect integer value Result This UDF returns protected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; select ptyProtectInt(val, 'Token_numeric') from test_data_table; 4.5.10 ptyUnprotectInt() ` This UDF unprotects the existing protected integer value. ptyUnprotectInt(int input, String dataElement) Parameters int input: Protected integer value to unprotect String dataElement: Name of data element to unprotect integer value Result This UDF returns unprotected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; create temporary function ptyUnprotectInt AS 'com.protegrity.hive.udf.ptyUnprotectInt'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; Confidential 68 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table protected_data_table select ptyProtectInt(val, 'Token_numeric') from test_data_table; select ptyUnprotectInt(protectedValue, 'Token_numeric') from protected_data_table; 4.5.11 ptyReprotect() This UDF reprotects integer format protected data with a different data element. ptyReprotect(int input, String oldDataElement, String newDataElement) Parameters int input: Integer value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected integer value. Example create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val int) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val int) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val int) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as int) from temp_table; insert overwrite table test_protected_data_table select ptyProtectInt(val, 'Token_Integer') from test_data_table; create table test_reprotected_data_table(val int) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, 'Token_Integer', 'new_Token_Integer') from test_protected_data_table; 4.5.12 ptyProtectFloat() This UDF protects float value. ptyProtectFloat(Float input, String dataElement) Parameters Float input: Float value to protect String dataElement: Name of data element to unprotect float value Confidential 69 Big Data Protector Guide 6.6.5 Hadoop Application Protector Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected float value. Example create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; select ptyProtectFloat(val, 'FLOAT_DE') from test_data_table; 4.5.13 ptyUnprotectFloat() This UDF unprotects protected float value. ptyUnprotectFloat(Float input, String dataElement) Parameters Float input: Protected float value to unprotect String dataElement: Name of data element to unprotect float value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected float value. Example create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat'; create temporary function ptyUnprotectFloat as 'com.protegrity.hive.udf.ptyUnprotectFloat'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue float) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; Confidential 70 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; insert overwrite table protected_data_table select ptyProtectFloat(val, 'FLOAT_DE') from test_data_table; select ptyUnprotectFloat(protectedValue, 'FLOAT_DE') from protected_data_table; 4.5.14 ptyReprotect() This UDF reprotects float format protected data with a different data element. ptyReprotect(Float input, String oldDataElement, String newDataElement) Parameters Float input: Float value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected float value. Example create temporary function ptyProtectFloat AS 'com.protegrity.hive.udf.ptyProtectFloat'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val float) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val float) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val float) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as float) from temp_table; insert overwrite table test_protected_data_table select ptyProtectFloat(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val float) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncryption’) from test_protected_data_table; 4.5.15 ptyProtectDouble() This UDF protects double value. ptyProtectDouble(Double input, String dataElement) Confidential 71 Big Data Protector Guide 6.6.5 Hadoop Application Protector Parameters Double input: Double value to unprotect String dataElement: Name of data element to unprotect double value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected double value. Example create temporary function ptyProtectDouble as 'com.protegrity.hive.udf.ptyProtectDouble'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; select ptyProtectDouble(val, 'DOUBLE_DE') from test_data_table; 4.5.16 ptyUnprotectDouble() This UDF unprotects protected double value. ptyUnprotectDouble(Double input, String dataElement) Parameters Double input: Double value to unprotect String dataElement: Name of data element to unprotect double value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected double value. Example create temporary function ptyProtectDouble as 'com.protegrity.hive.udf.ptyProtectDouble'; create temporary function ptyUnprotectDouble as 'com.protegrity.hive.udf.ptyUnprotectDouble'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val double) row format delimited fields terminated by ',' stored as textfile; Confidential 72 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue double) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; insert overwrite table protected_data_table select ptyProtectDouble(val, 'DOUBLE_DE') from test_data_table; select ptyUnprotectDouble(protectedValue, 'DOUBLE_DE') from protected_data_table; 4.5.17 ptyReprotect() This UDF reprotects double format protected data with a different data element. ptyReprotect(Double input, String oldDataElement, String newDataElement) Parameters Double input: Double value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected double value. Example create temporary function ptyProtectDouble AS 'com.protegrity.hive.udf.ptyProtectDouble'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val double) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val double) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val double) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as double) from temp_table; insert overwrite table test_protected_data_table select ptyProtectDouble(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val double) row format delimited fields terminated by ',' stored as textfile; Confidential 73 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncryption’) from test_protected_data_table; 4.5.18 ptyProtectBigInt() This UDF protects BigInt value. ptyProtectBigInt(BigInt input, String dataElement) Parameters BigInt input: Value to protect String dataElement: Name of data element to protect value Result This UDF returns protected BigInteger value. Example create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; select ptyProtectBigInt(val, 'BIGINT_DE') from test_data_table; 4.5.19 ptyUnprotectBigInt() This UDF unprotects protected BigInt value. ptyUnprotectBigInt(BigInt input, String dataElement) Parameters BigInt input: Protected value to unprotect String dataElement: Name of data element to unprotect value Result This UDF returns unprotected BigInteger value. Example create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt'; create temporary function ptyUnprotectBigInt as 'com.protegrity.hive.udf.ptyUnprotectBigInt'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; Confidential 74 Big Data Protector Guide 6.6.5 Hadoop Application Protector create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue bigint) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; insert overwrite table protected_data_table select ptyProtectBigInt(val, 'BIGINT_DE') from test_data_table; select ptyUnprotectBigInt(protectedValue, 'BIGINT_DE') from protected_data_table; 4.5.20 ptyReprotect() This UDF reprotects BigInt format protected data with a different data element. ptyReprotect(Bigint input, String oldDataElement, String newDataElement) Parameters Bigint input: Bigint value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Result This UDF returns protected bigint value. Example create temporary function ptyProtectBigInt AS 'com.protegrity.hive.udf.ptyProtectBigInt'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table; insert overwrite table test_protected_data_table select ptyProtectBigInt(val, 'Token_BigInteger') from test_data_table; create table test_reprotected_data_table(val bigint) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' 'BIGINT_DE', 'new_BIGINT_DE') from test_protected_data_table; Confidential 75 Big Data Protector Guide 6.6.5 Hadoop Application Protector 4.5.21 ptyProtectDec() This UDF protects decimal value. This API works only with the CDH 4.3 distribution. ptyProtectDec(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to protect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected decimal value. Example create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; select ptyProtectDec(val, 'BIGDECIMAL_DE') from test_data_table; 4.5.22 ptyUnprotectDec() This UDF unprotects protected decimal value. This API works only with the CDH 4.3 distribution. ptyUnprotectDec(Decimal input, String dataElement) Parameters Decimal input: Protected decimal value to unprotect String dataElement: Name of data element to unprotect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns unprotected decimal value. Example create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec'; Confidential 76 Big Data Protector Guide 6.6.5 Hadoop Application Protector create temporary function ptyUnprotectDec as 'com.protegrity.hive.udf.ptyUnprotectDec'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table protected_data_table select ptyProtectDec(val, 'BIGDECIMAL_DE') from test_data_table; select ptyUnprotectDec(protectedValue, 'BIGDECIMAL_DE') from protected_data_table; 4.5.23 ptyProtectHiveDecimal() This UDF protects decimal value. This API works only for distributions which include Hive, Version 0.11 and later. ptyProtectHiveDecimal(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to protect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Before the ptyProtectHiveDecimal() UDF is called, Hive rounds off the decimal value in the table to 18 digits in scale, irrespective of the length of the data. Result This UDF returns protected decimal value. Example reate temporary function ptyProtectHiveDecimal as 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; Confidential 77 Big Data Protector Guide 6.6.5 Hadoop Application Protector insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; select ptyProtectHiveDecimal(val, 'BIGDECIMAL_DE') from test_data_table; 4.5.24 ptyUnprotectHiveDecimal() This UDF unprotects Decimal value. This API works only for distributions which include Hive, Version 0.11 and later. ptyUnprotectHiveDecimal(Decimal input, String dataElement) Parameters Decimal input: Decimal value to protect String dataElement: Name of data element to unprotect decimal value Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Before the ptyUnprotectHiveDecimal() UDF is called, Hive rounds off the decimal value in the table to 18 digits in scale, irrespective of the length of the data. Result This UDF returns unprotected decimal value. Example create temporary function ptyProtectHiveDecimal as 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; create temporary function ptyUnprotectHiveDecimal as 'com.protegrity.hive.udf.ptyUnprotectHiveDecimal'; drop table if exists test_data_table; drop table if exists temp_table; drop table if exists protected_data_table; create table temp_table(val string) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table protected_data_table(protectedValue decimal) row format delimited fields terminated by ',' stored as textfile; load data local inpath 'test_data.csv' overwrite into table temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table protected_data_table select ptyProtectHiveDecimal(val, 'BIGDECIMAL_DE') from test_data_table; select ptyUnprotectHiveDecimal(protectedValue, 'BIGDECIMAL_DE') from protected_data_table; 4.5.25 ptyReprotect() This UDF reprotects decimal format protected data with a different data element. Confidential 78 Big Data Protector Guide 6.6.5 Hadoop Application Protector This API works only for distributions which include Hive, Version 0.11 and later. ptyReprotect(Decimal input, String oldDataElement, String newDataElement) Parameters Decimal input: Decimal value to reprotect String oldDataElement: Name of data element used to protect the data earlier String newDataElement: Name of new data element to reprotect the data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result This UDF returns protected decimal value. Example create temporary function ptyProtectHiveDecimal AS 'com.protegrity.hive.udf.ptyProtectHiveDecimal'; create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect'; drop table if exists test_data_table; drop table if exists temp_table; create table temp_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; create table test_protected_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table; insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table; insert overwrite table test_protected_data_table select ptyProtectHiveDecimal(val, 'NoEncryption') from test_data_table; create table test_reprotected_data_table(val decimal) row format delimited fields terminated by ',' stored as textfile; insert overwrite table test_reprotected_data_table select ptyReprotect(val, ' NoEncryption’,’NoEncyption’) from test_protected_data_table; 4.6 Pig UDFs This section describes all Pig UDFs that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. 4.6.1 ptyGetVersion() This UDF returns the current version of PEP. ptyGetVersion() Parameters None Confidential 79 Big Data Protector Guide 6.6.5 Hadoop Application Protector Result chararray: Version number Example REGISTER /opt/protegrity/Hadoop-protector/lib/peppig-0.10.0.jar; // register pep pig version DEFINE ptyGetVersion com.protegrity.pig.udf.ptyGetVersion; //define UDF employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray,name:chararray, ssn:chararray); // load employee.csv from HDFS path version = FOREACH employees GENERATE ptyGetVersion(); DUMP version; 4.6.2 ptyWhoAmI() This UDF returns the current logged in user name. ptyWhoAmI() Parameters None Result chararray: User name Example REGISTER /opt/protegrity/Hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyWhoAmI com.protegrity.pig.udf.ptyWhoAmI; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); username = FOREACH employees GENERATE ptyWhoAmI(); DUMP username; 4.6.3 ptyProtectInt() This UDF returns protected value for integer data. ptyProtectInt (int data, chararray dataElement) Parameters int data: Data to protect chararray dataElement: Name of data element to use for protection Result Protected value for given numeric data Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’); DUMP data_p; Confidential 80 Big Data Protector Guide 6.6.5 4.6.4 Hadoop Application Protector ptyUnprotectInt() This UDF returns unprotected value for protected integer data. ptyUnprotectInt (int data, chararray dataElement) Parameters int data: Protected data chararray dataElement: Name of data element to use for unprotection Result Unprotected value for given protected integer data Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt; DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectInt; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’); data_u = FOREACH data_p GENERATE ptyUnprotectInt(eid, ‘token_integer’); DUMP data_u; 4.6.5 ptyProtectStr() This UDF protects string value. ptyProtectStr(chararray input, chararray dataElement) Parameters chararray input: String value to protect chararray dataElement: Name of data element to unprotect string value Result chararray Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectStr com.protegrity.pig.udf.ptyProtectStr; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectIntStr(name, ‘token_alphanumeric’); DUMP data_p 4.6.6 ptyUnprotectStr() This UDF unprotects protected string value. ptyUnprotectStr (chararray input, chararray dataElement) Parameters chararray input: Unprotected string value chararray dataElement: Name of data element to unprotect string value Result chararray: Unprotected value Confidential 81 Big Data Protector Guide 6.6.5 Hadoop Application Protector Example REGISTER /opt/protegrity/hadoop_protector/lib/peppig-0.10.0.jar; DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectStr; DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectStr; employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray, ssn:chararray); data_p = FOREACH employees GENERATE ptyProtectStr(name, ‘token_alphanumeric’) as name:chararray DUMP data_p; data_u = FOREACH data_p GENERATE ptyUnprotectStr(ssn, ‘Token_alphanumeric’); DUMP data_u; Confidential 82 Big Data Protector Guide 6.6.5 5 5.1 HDFS File Protector (HDFSFP) HDFS File Protector (HDFSFP) Overview of HDFSFP The files stored in HDFS are plain text files and can be accessed with a POSIX-based file system access control. These files may contain sensitive data which is vulnerable with exposure to unwanted users. The HDFS File Protector (HDFSFP) helps to transparently protect these files as they are stored into HDFS and allow only authorized users to access the content in the files. 5.2 Features of HDFSFP The following are the features of HDFSFP: • • • • • • • • • 5.3 Protects and stores files in HDFS and retrieves the protected files in the clear from HDFS, as per centrally defined security policy and access control. Stores and retrieves from HDFS transparently for the user, depending upon their access control rights. Preserves Hadoop distributed data processing ensuring that protected content is processed on data nodes independently. Blocks the addressing of files by the defined access control pass-through, transparently without any protection or unprotection. Protects temporary data, such as intermediate files generated by the MapReduce job. Provides recursive access control for HDFS directories and files. Protects directories, its subdirectories and files, as per defined security policy and access control. Protects files at rest so that unauthorized users can view only the protected content. Adds minimum overhead for data processing in HDFS. Can be accessed using the command shell and Java API. Protector Usage Files stored in HDFS are plain text files. Access controls for HDFS are implemented by using filebased permissions that follow the UNIX permissions model. These files may contain sensitive data, making it vulnerable when exposed to unwanted users. These files should be transparently protected as they are stored into HDFS and the content should be exposed only to authorized users. The files are stored and retrieved from HDFS using Hadoop ecosystem products, such as file shell commands, MapReduce jobs, and so on. Any user or application with write access to protected data at rest in HDFS can delete, update or move the protected data. Hence though the protected data can be lost, the data is not compromised as the user or application do not access the original data in the clear. Ensure that the Hadoop administrator assigns file permissions in HDFS cautiously. 5.4 File Recover Utility The File Recover utility recovers the contents from a protected file. For more information about the File Recover Utility, refer to section 3.4.3 Recover Utility. Confidential 83 Big Data Protector Guide 6.6.5 5.5 HDFS File Protector (HDFSFP) HDFSFP Commands Hadoop provides shell commands for modifying and administering HDFS. HDFSFP extends the modification commands to control access to files and directories in HDFS. The section describes the commands supported in HDFSFP. 5.5.1 copyFromLocal This command ingests local data into HDFS. hadoop ptyfs -copyFromLocal Result • • • If the destination directory path is protected and the user executing the command has permissions to create and protect, then the data is ingested in encrypted form. If the destination directory path is protected and the user does not have permissions to create and protect, then the copy operation fails. If the destination HDFS directory path is not protected, then the data is ingested in clear form. 5.5.2 put This command ingests local data into HDFS. hadoop ptyfs -put Result • • • If the destination HDFS directory path is protected and the user executing the command has permissions to create and protect, then the data is ingested in encrypted form. If the destination HDFS directory path is protected and the user does not have permissions to create and protect, then the copy operation fails. If the destination HDFS directory path is not protected, then the data is ingested in clear form. 5.5.3 copyToLocal This command is used to copy an HDFS file to a local directory. hadoop ptyfs -copyToLocal Result • • • If the source HDFS file is protected and the user has unprotect permissions, then the file is copied to the destination directory in clear form. If the source HDFS file is not protected, then the file is copied to the destination directory. If the HDFS file is protected the user does not have unprotect permissions, then the copy operation fails. Confidential 84 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.5.4 get This command copies an HDFS file to a local directory. hadoop ptyfs -get Result • • • If the source HDFS file is protected and the user has unprotect permissions, then the file is copied to the destination directory in clear form. If the source HDFS file is not protected, then the file is copied to the destination directory. If the HDFS file is protected the user does not have unprotect permissions, then the copy operation fails. 5.5.5 cp This command copies a file from one HDFS directory to another HDFS directory. hadoop ptyfs -cp Result • • • • • • If the source HDFS file is protected and the user has unprotect permissions for the source HDFS file, the destination directory is protected, and the user has permissions to protect and create on the destination HDFS directory path, then the file gets copied in encrypted form. If the source HDFS file is protected and the user does not have permissions to unprotect, then the copy operation fails. If the destination directory is protected and the user does not have permissions to protect and create, then the copy operation fails. If the source HDFS file is unprotected and destination directory is protected and the user has permissions to protect or create on the destination HDFS directory path, then the file is copied in encrypted form. If the source HDFS file is protected and the user has permissions to unprotect for the source HDFS file and destination HDFS directory path is not protected, then the file is copied in clear form. If the source HDFS file and destination HDFS directory path are unprotected, then the command works similar to the default Hadoop file shell -cp command. 5.5.6 mkdir This command creates a new directory in HDFS. hadoop ptyfs -mkdir Result • • If the new directory is protected and the user has permissions to create, then the new directory is created. If the new directory is not protected, then this command runs similar to the default HDFS file shell -mkdir command. Confidential 85 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.5.7 mv This command moves an HDFS file from one HDFS directory to another HDFS directory. hadoop ptyfs -mv Result • • • • • If the source HDFS file is protected and the user has unprotect and delete permissions and the destination directory is also protected with the user having permissions to protect and create on the destination HDFS directory path, then the file is moved to the destination directory in encrypted form. If the HDFS file is protected and the user does not have unprotect and delete permissions or the destination directory is protected and the user does not have permissions to protect and create, then the move operation fails. If the source HDFS file is unprotected, the destination directory is protected and the user has permissions to protect and create on the destination HDFS directory path, then the file is copied in encrypted form. If the source HDFS file is protected and the user has permissions to unprotect and the destination HDFS directory path is not protected, then the file is copied in clear form. If the source HDFS file and destination HDFS directory path are unprotected, then the command works similar to the default Hadoop file shell -cp command. 5.5.8 rm This command deletes HDFS files. hadoop ptyfs -rm Result • • • If the HDFS file is protected and the user has permissions to delete on the HDFS file path, then the file is deleted. If the HDFS file is protected and the user does not have permissions to delete on the HDFS file path, then the delete operation fails. If the HDFS file is not protected, then the command works similar to the default Hadoop file shell -rm command. 5.5.9 rmr This command deletes an HDFS directory, its subdirectories and files. hadoop ptyfs -rmr Result • • • If the HDFS directory path is protected and the user has permissions to delete on the HDFS directory path, then the directory and its contents are deleted. If the HDFS directory path is protected and the user does not have permissions to delete on the HDFS directory path, then the delete operation fails. If the HDFS directory path is not protected, then the command works as the default Hadoop rm recursive (hadoop fs -rmr or hadoop fs –rm -r) command. Confidential 86 Big Data Protector Guide 6.6.5 5.6 HDFS File Protector (HDFSFP) Ingesting Files Securely To ingest files into HDFS securely, use the put and CopyFromLocal commands. For more information, refer to sections 5.5.2 put and 5.5.1 copyFromLocal. If you need to ingest data to a protected ACL in HDFS using Sqoop, then use the -D target.output.dir parameter before any tool-specific arguments as described in the following command. sqoop import -D target.output.dir="/tmp/src" --driver com.mysql.jdbc.Driver --connect "jdbc:mysql://master.localdomain/test" --username root --table test -target-dir /tmp/src -m 1 In addition, if you need to add additional data to any existing data, then use the --append parameter as described in the following command. sqoop import -D target.output.dir="/tmp/src" --driver com.mysql.jdbc.Driver --connect "jdbc:mysql://master.localdomain/test" --username root --table test target-dir /tmp/src -m 1 --append 5.7 -- Extracting Files Securely To extract files from HDFS securely, use the get and CopyToLocal commands. For more information, refer to sections 5.5.4 get and 5.5.3 copyToLocal. 5.8 HDFSFP Java API Protegrity provides a Java API for working with files and directories using HDFSFP. The Java API for HDFSFP provides an alternate means of working with HDFSFP besides the HDFSFP shell commands, hadoop ptyfs, and enables you to integrate HDFSFP with Java applications. The section describes the Java API commands supported in HDFSFP. 5.8.1 copy This command copies a file from one HDFS directory to another HDFS directory. copy(java.lang.String srcs, java.lang.String dst) Parameters srcs: HDFS file path dst: HDFS file or directory path Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • Input is null. The path does not exist. The user does not have protect and write permissions on the destination path in case the destination path is protected, or the user does not have unprotect permission on the source path or both. Confidential 87 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • • • • • If the source HDFS file is protected and the user has unprotect permission for the source HDFS file, the destination directory is protected, the ACL entry for the directory is activated, and the user has permissions to protect and create on the destination HDFS directory path, then the file gets copied in encrypted form. If the source HDFS file is protected and the user does not have permission to unprotect, then the copy operation fails. If the destination directory is protected and the user does not have permissions to protect and create, then the copy operation fails. If the source HDFS file is unprotected and destination directory is protected, the ACL entry for the directory is activated, and the user has permissions to protect or create on the destination HDFS directory path, then the file is copied in encrypted form. If the source HDFS file is protected and the user has permissions to unprotect for the source HDFS file and destination HDFS directory path is not protected, then the file is copied in clear form. 5.8.2 copyFromLocal This command ingests local data into HDFS. copyFromLocal(java.lang.String[] srcs, java.lang.String dst) Parameters srcs: Array of local file paths dst: HDFS directory path Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • Input is null. The path does not exist. The user does not have protect and write permissions on the destination path if it is protected. For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • • • If the destination directory path is protected, the ACL entry for the directory is activated, and the user executing the command has permissions to create and protect, then the data is ingested in encrypted form. If the destination directory path is protected and the user does not have permissions to create and protect, then the copy operation fails. If the destination HDFS directory path is not protected, then the data is ingested in clear form. Confidential 88 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.8.3 copyToLocal This command is used to copy an HDFS file or directory to a local directory. copyToLocal(java.lang.String srcs, java.lang.String dst) Parameters srcs: HDFS file or directory path dst: Local directory or file path Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • Input is null. The path does not exist. The user does not have unprotect and read permissions on the source path if it is protected. For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • • • If the source HDFS file is protected, the ACL entry for the directory is activated, and the user has unprotect permission, then the file is copied to the destination directory in clear form. If the source HDFS file is not protected, then the file is copied to the destination directory. If the HDFS file is protected the user does not have unprotect permissions, then the copy operation fails. 5.8.4 deleteFile This command deletes files from HDFS. deleteFile(java.lang.String srcf, boolean skipTrash) Parameters srcf: HDFS file path skipTrash: Boolean value which decides if the file should be moved to trash. If the Boolean value is true, then the file is not moved to trash; if false, then the file is moved to trash. Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • Input is null. The path does not exist. The user does not have delete permission to the path. Confidential 89 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • • If the HDFS file is protected and the user has permission to delete on the HDFS file path, then the file is deleted. If the HDFS file is protected and the user does not have permission to delete on the HDFS file path, then the delete operation fails. 5.8.5 deleteDir This command deletes recursively an HDFS directory, its subdirectories, and files. deleteDir(java.lang.String srcdir, boolean skipTrash) Parameters srcdir: HDFS directory path skipTrash: Boolean value which decides if the file should be moved to trash. If the Boolean value is true, then the directory is recursively not moved to trash; if false, then the directory is recursively moved to trash. Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • Input is null. The path does not exist. The user does not have delete permission to the path. For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • • If the new directory path exists in ACL or the ACL path for the parent directory path is activated recursively, and the user has permissions to create, then the new directory with an activated ACL path is created. If the new directory path or its parent directory path is not present in ACL recursively, then the new directory is created without HDFSFP protection. 5.8.7 move This command moves an HDFS file from one HDFS directory to another HDFS directory. move(java.lang.String src, java.lang.String dst) Parameters src: HDFS file path dst: HDFS file or directory path Returns True: If the operation is successful Exception: If the operation fails Exception (and Error Codes) The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException) if any of the following conditions are met: • • • • Input is null. The path does not exist. The user does not have unprotect and read, or protect and write, or create permissions to the path. The user does not have protect and write permissions on the destination path in case the destination path is protected, or the user does not have unprotect permission on the source path or both. For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The Javadoc can be found in /protegrity/hdfsfp/doc on the Data Ingestion Node. Result • If the source HDFS file is protected, the ACL entry for the directory is activated, and the user has unprotect and delete permissions and the destination directory is also protected with the Confidential 91 Big Data Protector Guide 6.6.5 • • • 5.9 HDFS File Protector (HDFSFP) user having permissions to protect and create on the destination HDFS directory path, then the file is moved to the destination directory in encrypted form. If the HDFS file is protected and the user does not have unprotect and delete permissions or the destination directory is protected and the user does not have permissions to protect and create, then the move operation fails. If the source HDFS file is unprotected, the destination directory is protected, the ACL entry for the directory is activated, and the user has permissions to protect and create on the destination HDFS directory path, then the file is copied in encrypted form. If the source HDFS file is protected and the user has permission to unprotect and the destination HDFS directory path is not protected, then the file is copied in clear form. Developing Applications using HDFSFP Java API This section describes the guidelines to follow when developing applications using the HDFSFP Java API. The guidelines described in this section are a sample and assumes that /opt/protegrity is the base installation directory of Big Data Protector. These guidelines would need to be modified based on your requirements. 5.9.1 Setting up the Development Environment Ensure that the following steps are completed before you begin to develop applications using the HDFSFP Java API: • Add the required HDFSFP Java API jar, hdfsfp-x.x.x.jar, in the Classpath. • Instantiate the HDFSFP Java API function using the following command: PtyHdfsProtector protector = new PtyHdfsProtector(); After successful instantiation, you are ready to call the HDFSFP Java API functions. 5.9.2 Protecting Data using the Class file To protect data using the Class file: 1. Compile the Java file to create a Class file with the following command. javac -cp .: /hdfsfp/hdfsfp-x.x.x.jar ProtectData.java -d . 2. Protect data using the Class file with the following command. hadoop ProtectData 5.9.3 Protecting Data using the JAR file To protect data using the JAR file: 1. Compile the Java file to create a Class file with the following command. javac -cp .: /hdfsfp/hdfsfp-x.x.x.jar ProtectData.java -d . 2. Create the JAR file from the Class file with the following command. jar -cvf protectData.jar ProtectData 3. Protect data using the JAR file with the following command. hadoop jar protectData.jar ProtectData 5.9.4 Sample Program for the HDFSFP Java API public class ProtectData { public static PtyHdfsProtector protector = new PtyHdfsProtector(); Confidential 92 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) public void copyFromLocalTest(String[] srcs, String dstf) { boolean result; try { result = protector.copyFromLocal(srcs, dstf); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void copyToLocalTest(String srcs, String dstf) { boolean result; try { result = protector.copyToLocal(srcs, dstf); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void copyTest(String srcs, String dstf) { boolean result; try { result = protector.copy(srcs, dstf); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void mkdirTest(String dir) { boolean result; try { result = protector.mkdir(dir); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void moveTest(String srcs, String dstf) { boolean result; try { result = protector.move(srcs,dstf); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void deleteFileTest(String file,boolean skipTrash) { boolean result; try { result = protector.deleteFile(file, skipTrash); } catch (ProtectorException pe) { pe.printStackTrace(); } } public void deleteDirTest(String dir,boolean skipTrash) { boolean result; try { result = protector.deleteDir(dir, skipTrash); Confidential 93 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) } catch (ProtectorException pe) { pe.printStackTrace(); } } public static void main(String[] args) { ProtectData protect = new ProtectData(); // Ingest Local Data into HDFS String srcsCFL[] = new String[2]; srcsCFL[0] =" "; srcsCFL[1] =" "; String dstfCFL =" "; protect.copyFromLocalTest(srcsCFL, dstfCFL); // Extract HDFS file to Local String srcsCTL= " "; String dstfCTL = " "; protect.copyToLocalTest(srcsCTL, dstfCTL); // Copy File from HDFS to HDFS String srcsCopy=" "; String dstfCopy =" "; protect.copyTest(srcsCopy, dstfCopy); // Create HDFS Sub-Directory String dir = " "; protect.mkdirTest(dir); // Move from HDFS to HDFS String srcsMove = " "; String dstfMove = " "; protect.moveTest(srcsMove, dstfMove); // Delete File from HDFS String fileDelete = " "; boolean skipTrashFile = false; protect.deleteFileTest(fileDelete,skipTrashFile); // Delete Sub-Directory and Children from HDFS String dirDelete = " "; boolean skipTrashDir = false; protect.deleteDirTest(dirDelete,skipTrashDir); } } 5.10 Quick Reference Tasks This section provides a quick reference for the tasks that can be performed by users. 5.10.1 Protecting Existing Data The dfsadmin utility protects existing data after creating ACL for the HDFS path. It is a two-step process. In first step, the user creates new ACL entries. In the second step, the user will activate the newly created ACL entries. After activation, all ACL entries will be protected automatically. The steps for activating ACL entries can be done for single or multiple entries. After activating ACL entries, the HDFSFP infrastructure protects the HDFS path in the ACL entry. While installing HDFSFP, you need to configure the ingestion user in the BDP.config file. The HDFS administrator would have to ensure that the ingestion user has full access to the directories that would be protected with HDFSFP. This user would be the authorized user for protection. Permissions to protect or create are configured in the security policy. After the dfsadmin utility activates ACL entry using the preconfigured ingested user, the HDFS File Protector protects the ACL path. Confidential 94 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) For more information about adding and activating an ACL entry, refer to sections 5.15.1 Adding an ACL Entry for Protecting Files or Folders and 5.15.5 Activating Inactive ACL Entries. 5.10.2 Reprotecting Files Reprotecting Files or Folders 5.11 Sample Demo Use Case For information about the sample demo use cases, refer to 12 Appendix: HDFSFP Demo. HDFSFP can monitor policy and file or folder activity. This auditing can be done on a per policy basis. The following event types can be audited when an event is generated on the action listed: Create or update ACL entry to protect or reprotect a file or folder Read/write an ACL encrypted file or folder • Update or delete an ACL entry Auditing qualifiers include success, failure, and auditing only when the user is audited for the same action. • • 5.12 Appliance components of HDFSFP This section describes the active components, shipped with ESA and required to run HDFSFP. 5.12.1 Dfsdatastore Utility This utility adds the Hadoop cluster under protection for HDFSFP. 5.12.2 Dfsadmin Utility This utility handles management of access control entries for files and folders. 5.13 Access Control Rules for Files and Folders Rules for files and folders stored or accessed in HDFS are managed by Access Control Lists (ACLs). The protection of HDFS files and folders is done after the ACL entry has been created. ACLs for multiple Hadoop clusters can be managed only from the ESA. Protegrity Cache is used to store or propagate secured ACLs across the clusters. If you need to add, delete, search, update, or list a cluster, then use the DFS Cluster Management Utility (dfsdatastore). If you need to protect, unprotect, reprotect, activate, search, or update ACLs, or get information about a job, then use the ACL Management Utility (dfsadmin). For more information about managing access control entries across clusters, refer to sections 5.14 Using the DFS Cluster Management Utility (dfsdatastore) and 5.15 Using the ACL Management Utility (dfsadmin). 5.14 Using the DFS (dfsdatastore) Cluster Management Utility The dfsdatastore utility enables you to manage the configuration of clusters on the ESA. The details of all options supported by this utility are described in this section. Confidential 95 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.14.1 Adding a Cluster for Protection To add a cluster for protection using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option Add. 7. Select Next. 8. Press ENTER. The dfsdatastore credentials screen appears. 9. Enter the following parameters: • Datastore name – The name for the datastore or cluster. Confidential 96 Big Data Protector Guide 6.6.5 • • HDFS File Protector (HDFSFP) This name will be used for managing ACLs for the cluster. Hostname/IP of the Lead node within the cluster – The hostname or IP address of the Lead node of the cluster. Port number – The Protegrity Cache port which was specified in the BDP.config file during installation. 10. Select OK. 11. Press ENTER. The cluster with the specified parameters is added. 12. If the DfsCacheRefresh service is already running, then the datastore is added in an activated state. If the DfsCacheRefresh service is not running, then the datastore is added in an inactive state. The datastore can be activated by starting the DfsCacheRefresh service. If you are using Big Data Protector with version lower than 6.6.3, then on the dfsdatastore credentials screen, a prompt for the Protegrity Cache password appears. You need to specify the Protegrity Cache password that was provided during the installation of the Big Data Protector. To start the DfsCacheRefresh Service: 1. Login to the ESA Web UI. 2. Navigate to System Services. 3. Start the DfsCacheRefresh service. 5.14.2 Updating a Cluster Ensure that you utilize the Update option in the dfsdatastore UI to modify the parameters of an existing datastore only. To update a cluster using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option Update. 7. Select Next. 8. Press ENTER. The dfsdatastore update screen appears. Confidential 97 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 9. Update the following parameters as required: • Hostname/IP of the Lead node within the cluster – The hostname or IP address of the Lead node of the cluster. • Port number – The Protegrity Cache port which was specified in the BDP.config file during installation. 10. Select OK. 11. Press ENTER. If you are using Big Data Protector with version lower than 6.6.3, then on the dfsdatastore credentials screen, a prompt for the Protegrity Cache password appears. You need to specify the Protegrity Cache password that was provided during the installation of the Big Data Protector. The cluster is modified with the required updates. 5.14.3 Removing a Cluster Ensure that the Cache Refresh Service is running in the ESA Web UI before removing a cluster. To remove a cluster using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option Remove. 7. Select Next. 8. Press ENTER. The dfsdatastore remove screen appears. Confidential 98 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 9. Enter the following parameter: • Datastore name – The name for the datastore or cluster. This name will be used for managing ACLs for the cluster. 10. Select OK. 11. Press ENTER. The required cluster is removed. 5.14.4 Monitoring a Cluster To monitor a cluster using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option Execute Command. 7. Select Next. 8. Press ENTER. The dfsdatastore execute command screen appears. 9. Enter the following parameters: • Datastore name - The name of the datastore or cluster. This name is used for managing ACLs for the cluster. • Command - The command to execute on the datastore. In this release, the only command supported is TEST. The command TEST is executed on the cluster, which is used to retrieve the statuses of the following servers: o Cache Refresh Server, running on the ESA o Cache Monitor Server, running on the Lead node of the cluster o Distributed Cache Server, running on the Lead and slave nodes of the cluster 10. Select OK. 11. Press ENTER. The dfsdatastore UI executes the TEST command on the cluster. Confidential 99 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) If you are using Big Data Protector with version lower than 6.6.3, then the Cluster Monitoring feature is not supported. 5.14.5 Searching a Cluster To search for a cluster using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option Search. 7. Select Next. 8. Press ENTER. The dfsdatastore search screen appears. 9. Enter the following parameter: • Datastore name – The name of the datastore or cluster. This name is used for managing ACLs for the cluster. 10. Select OK. 11. Press ENTER. The dfsdatastore UI searches for the required cluster. Confidential 100 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 5.14.6 Listing all Clusters To list all clusters using the dfsdatastore UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS Cluster Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsdatastore UI appears. 6. Select the option List. 7. Select Next. 8. Press ENTER. A list of all the clusters appears. Each cluster description contains one of the following cluster statuses: • 1: Cluster is in active state • 0: Cluster is in inactive state 5.15 Using the ACL Management Utility (dfsadmin) The dfsadmin utility enables you to manage ACLs for cluster. Managing ACLs is a two-step process, which is creating or modifying ACL entries and then activating it. The protection of file or folder paths will not take effect until ACL entries are verified, confirmed and activated. Ensure that an unstructured policy is created in the ESA, which is to be linked with the ACL. The details of all options supported by this utility are described in this section. 5.15.1 Adding an ACL Entry for Protecting Directories in HDFS It is recommended to not create ACLs for file paths. If the ACL for the directory, which is containing a file for which an ACL already exists, is being unprotected, then a decryption failure might occur, if there is a mismatch between the data elements used for the protection of the directory and the file contained in the directory. To add an ACL entry for protecting files or folders using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. Confidential 101 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option Protect. 7. Select Next. 8. Press ENTER. The dfsadmin protection screen appears. 9. Enter the following parameters: • File Path – The directory path to protect. • Data Element Name – The unstructured data element name to protect the HDFS directory path with. • Datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. Confidential 102 Big Data Protector Guide 6.6.5 • HDFS File Protector (HDFSFP) Recursive (yes or no) – Select one of the following options: o Yes – Protect all files, sub-directories and their files, in the directory path. o No – Protect the files in the directory path. 10. Select OK. 11. Press ENTER. The ACL entry required for protecting the directory path is added to the Inactive list. The ACL entries can be activated by selecting the Activate option. After the ACL entries are activated, the following actions occur, as required: If the recursive flag is not set, then all files inside the directory path are protected. If the recursive flag is set, then all the files, sub-directories and its files, in the directory path are protected. If any MapReduce jobs or HDFS file shell commands are initiated on the ACL paths before the ACLs are activated, then the jobs or commands will fail. After the ACLs are activated, any new files that are ingested in the respecticve ACL directory path will get protected. • • 5.15.2 Updating an ACL Entry To update an ACL entry using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option Update. 7. Select Next. 8. Press ENTER. The dfsadmin update screen appears. 9. Update the following parameters as required: • File Path – The directory path to protect. • Datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. • Recursive (yes or no) – Select one of the following options: o Yes – Protect all files, sub-directories and their files, in the directory path. o No – Protect the files in the directory path. Confidential 103 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 10. Select OK. 11. Press ENTER. The ACL entry is updated as required. 5.15.3 Reprotecting Files or Folders To reprotect files or folders using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option Reprotect. 7. Select Next. 8. Press ENTER. The dfsadmin reprotection screen appears. 9. Enter the following parameters: • File Path – The directory path to protect. • datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. • Data Element Name – The data element name to protect the directory path with. If the user has rotated the data element key and needs to reprotect the data, then this field is optional. 10. Select OK. 11. Press ENTER. The files inside the ACL entry are reprotected. 5.15.4 Deleting an ACL Entry to Unprotect Files or Directories To delete an ACL entry to unprotect files or directories using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. Confidential 104 Big Data Protector Guide 6.6.5 4. 5. 6. 7. 8. 9. HDFS File Protector (HDFSFP) The root password screen appears. Enter the root password. Press ENTER. The dfsadmin UI appears. Select the option Unprotect. Select Next. Press ENTER. The dfsadmin unprotection screen appears. Enter the following parameters as required: • File Path – The directory path which is protected. • Datastore Name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. 10. Select OK. 11. Press ENTER. The files inside the ACL entry are unprotected. 5.15.5 Activating Inactive ACL Entries If you are using a Kerberos-enabled Hadoop cluster, then ensure that the user ptyitusr has a valid Kerberos ticket and write access permissions on the HDFS path for which the ACL is being created. To activate inactive ACL entries using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option Activate. 7. Select Next. 8. Press ENTER. The dfsadmin activation screen appears. 9. Enter the following parameter as required: • datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. Confidential 105 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 10. Select OK. 11. Press ENTER. The inactive ACL entries in the Datastore are activated. If an ACL entry for a directory containing files is activated (for Protect/Unprotect/Reprotect), then the ownerships and permissions for only the files contained in the directory are changed. To avoid this issue, ensure that the user configured in the PROTEGRITY_IT_USR property in the BDP.config file during the Big Data Protector installation is added to the HDFS superuser group by running the following command on the Lead node: usermod -a -G hdfs If the protect or unprotect operation fails on the files or folders, which are a part of the ACL entry being activated and the message ACL is locked appears, then ensure that you monitor the beuler.log file for any exceptions and take the required corrective action. To monitor the beuler.log file: 1. 2. 3. 4. 5. 6. Login to the Lead node with root permissions. Switch the user to PROTEGRITY_IT_USR, as configured in the BDP.config file. Navigate to the hdfsfp/ptyitusr directory. Monitor the beuler.log file for any exceptions. If any exceptions appear in the beuler.log file, then resolve the exceptions as required. Login to the Lead node and run the beuler.sh script. The following is a sample beuler.sh script command. sh beuler.sh -path -datastore activationid -beulerjobid - Alternatively, you can restart the DfsCacheRefresh service. To restart the DfsCacheRefresh Service: 1. Login to the ESA Web UI. 2. Navigate to System Services. 3. Restart the DfsCacheRefresh service. 5.15.6 Viewing the ACL Activation Job Progress Information in the Interactive Mode To view the ACL Activation Job Progress Information in the Interactive mode using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option JobProgressInfo. 7. Select Next. 8. Press ENTER. Confidential 106 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) The Activation ID screen appears. 9. Enter the Activation ID. 10. Press ENTER. The filter search criteria screen appears. 11. If you need to specify the filtering criteria, then perform the following steps. c) Type Y or y. d) Select one of the following filtering criteria: o Start Time o Status o ACL Path 12. Select Next. 13. Press ENTER. 14. If you do not need to specify the search criteria, then type N or n. The dfsadmin job progress information screen appears listing all the jobs against the required Activation ID with the following information: • State: One of the following states of the job: o Started o Failed o In progress o Completed o Yet to start o Failed as Path Does not Exist • Percentage Complete: The percentage completion for the directory encryption • Job Start Time: The time when the directory encryption started • Job End Time: The time when the directory encryption ended • Processed Data: The amount of data that is encrypted • Total Data: The total directory size being encrypted • ACL Path: The directory path being encrypted 5.15.7 Viewing the ACL Activation Job Progress Information in the Non Interactive Mode To view the ACL Activation Job Progress Information in the Non Interactive mode using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. Confidential 107 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option JobProgressInfo against ActivationId. The JobProgressInfo Activation ID screen appears. 7. Enter the Activation ID. 8. Select OK. 9. Press ENTER. The dfsadmin job progress information screen for the required Activation ID appears. 5.15.8 Searching ACL Entries To search for ACL entries using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. 6. Select the option Search. 7. Select Next. 8. Press ENTER. The dfsadmin search screen appears. 9. Enter the following parameters as required: • File Path – The directory path to protect. • datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. 10. Select OK. 11. Press ENTER. The dfsadmin UI searches for the required ACL entry. 5.15.9 Listing all ACL Entries To list ACL entries using the dfsadmin UI from the ESA CLI Manager: 1. Login to the ESA CLI Manager. 2. Navigate to Tools DFS ACL Management Utility. 3. Press ENTER. The root password screen appears. 4. Enter the root password. 5. Press ENTER. The dfsadmin UI appears. Confidential 108 Big Data Protector Guide 6.6.5 HDFS File Protector (HDFSFP) 6. Select the option List. 7. Select OK. 8. Press ENTER. The dfsadmin list screen appears. 9. Enter the following parameter as required: • Datastore name – The datastore or cluster name specified while adding it on the ESA using the DFS Cluster Management Utility. 10. Select OK. 11. Press ENTER. A list of all the ACL entries appears. If you are using Big Data Protector with version lower than 6.6.3, then the List option does not show the Activation ID and the Beuler Job ID for the respective ACLs. 5.16 HDFS Codec for Encryption and Decryption A codec is an algorithm which provides compression and decompression. Hadoop provides a codec framework to compress blocks of data before storage. The codec compresses data while writing the blocks and decompresses data while reading the blocks. A split-able codec is an algorithm which is applied after splitting a file, making it possible to recover original data from any part of the split. The Protegrity HDFS codec is a split-able cryptographic codec. It uses encryption, such as AES 128, AES 256, DES and so on. It utilizes the infrastructure of the Protegrity Application Protector for applying cryptographic support. The protection is governed by the Policy deployed by the ESA, as defined by the Security Officer. Confidential 109 Big Data Protector Guide 6.6.5 6 HBase HBase HBase is a database, which provides random read and write access to tables, consisting of rows and columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions contain a datastore in memory and a persistent datastore (HFile). The Name node assigns multiple regions to a region server. The Name node manages the cluster and the region servers store portions of the HBase tables and perform the work on the data. 6.1 Overview of the HBase Protector The Protegrity HBase protector extends the functionality of the data storage framework and provides transparent data protection and unprotection using coprocessors, which provide the functionality to run code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers and protects the data stored in the servers. All clients which work with HBase are supported. The data is transparently protected or unprotected, as required, utilizing the coprocessor framework. 6.2 HBase Protector Usage The Protegrity HBase protector utilizes the get, put, and scan commands and calls the Protegrity coprocessor for the HBase protector. The Protegrity coprocessor for the HBase protector locates the metadata associated with the requested column qualifier and the current logged in user. If the data element is associated with the column qualifier and the current logged in user, then the HBase protector processes the data in a row based on the data elements defined by the security policy deployed in the Big Data Protector. The Protegrity HBase coprocessor only supports bytes converted from the string data type. If any other data type is directly converted to bytes and inserted in an HBase table, which is configured with the Protegrity HBase coprocessor, then data corruption might occur. 6.3 Adding Data Elements and Column Qualifier Mappings to a New Table In an HBase table, every column family of a table stores metadata for that family, which contain the column qualifier and data element mappings. Users need to add metadata to the column families for defining mappings between the data element and column qualifier, when a new HBase table is created. The following command creates a new HBase table with one column family. create 'table', { NAME => 'column_family_1', METADATA => { 'DATA_ELEMENT:credit_card'=>'CC_NUMBER','DATA_ELEMENT:name'=>'TOK_CUSTOMER_NAME' } } Parameters table: Name of the table. column_family_1: Name of the column family. METADATA: Data associated with the column family. Confidential 110 Big Data Protector Guide 6.6.5 HBase DATA_ELEMENT: Contains the column qualifier name. In the example, the column qualifier names credit_card and name, correspond to data elements CC_NUMBER and TOK_CUSTOMER_NAME respectively. 6.4 Adding Data Elements and Column Qualifier Mappings to an Existing Table Users can add data elements and column qualifiers to an existing HBase table. Users need to alter the table to add metadata to the column families for defining mappings between the data element and column qualifier. The following command adds data elements and column qualifier mappings to a column in an existing HBase table. alter 'table', { NAME => 'column_family_1', METADATA => {'DATA_ELEMENT:credit_card'=>'CC_NUMBER', 'DATA_ELEMENT:name'=>'TOK_CUSTOMER_NAME' } } Parameters table: Name of the table. column_family_1: Name of the column family. METADATA: Data associated with the column family. DATA_ELEMENT: Contains the column qualifier name. In the example, the column qualifier names credit_card and name, correspond to data elements CC_NUMBER and TOK_CUSTOMER_NAME respectively. 6.5 Inserting Protected Data into a Protected Table Users can ingest protected data into a protected table in HBase using the BYPASS_COPROCESSOR flag. If the BYPASS_COPROCESSOR flag is set while inserting data in the HBase table, then the Protegrity coprocessor for HBase is bypassed. The following command bypasses the Protegrity coprocessor for HBase and ingests protected data into an HBase table. put 'table', 'row_2', 'column_family:credit_card', '3603144224586181', { ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}} Parameters table: Name of the table. column_family: Name of the column family and the protected data to be inserted in the column. METADATA: Data associated with the column family. ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the example, the flag to bypass the Protegrity coprocessor for HBase is set. 6.6 Retrieving Protected Data from a Table If users need to retrieve protected data from an HBase table, then they need to set the BYPASS_COPROCESSOR flag to retrieve the data. This is necessary to retain the protected data as is since HBase performs protects and unprotects the data transparently. Confidential 111 Big Data Protector Guide 6.6.5 HBase The following command bypasses the Protegrity coprocessor for HBase and retrieves protected data from an HBase table. scan 'table', { ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}} Parameters table: Name of the table. ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the example, the flag to bypass the Protegrity coprocessor for HBase is set. 6.7 Protecting Existing Data Users should define the mappings between the data elements and column qualifiers in the respective column families, which are used to by the coprocessor to protect or unprotect the data. The following command protects the existing data in an HBase table by setting the MIGRATION flag. Data from the table is read, protected, and inserted back into the table. scan 'table', { ATTRIBUTES => {'MIGRATION'=>'1'}} Parameters table: Name of the table. ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the example, the Migration flag is set to protect the existing data in the HBase table. 6.8 HBase Commands Hadoop provides shell commands to ingest, extract, and display the data in an HBase table. The section describes the commands supported by HBase. 6.8.1 put This command ingests the data provided by the user in protected form, using the configured data elements, into the required row and column of an HBase table. You can use this command to ingest data into all the columns for the required row of the HBase table. put ' ',' ', 'column_family: ', '' Parameters table_name: Name of the table. row_number: Number of the row in the HBase table. column_family: Name of the column family and the protected data to be inserted in the column. 6.8.2 get This command displays the protected data from the required row and column of an HBase table in cleartext form. You can use this command to display the data contained in all the columns of the required row of the HBase table. get ' ',' ', 'column_family: ' Parameters table_name: Name of the table. Confidential 112 Big Data Protector Guide 6.6.5 HBase row_number: Number of the row in the HBase table. column_family: Name of the column family. Ensure that the logged in user has the permissions to view the protected data in cleartext form. If the user does not have the permissions to view the protected data, then only the protected data appears. 6.8.3 scan This command displays the data from the HBase table in protected or unprotected form. View the protected data using the following command. scan ' ', { ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}} View the unprotected data using the following command. scan ' ' Parameters table_name: Name of the table. ATTRIBUTES: Additional parameters to consider when displaying the protected or unprotected data. Ensure that the logged in user has the permissions to unprotect the protected data. If the user does not have the permissions to unprotect the protected data, then only the protected data appears. 6.9 Ingesting Files Securely To ingest data into HBase securely, use the put command. For more information, refer to section 6.8.1 put. 6.10 Extracting Files Securely To extract data from HBase securely, use the get command. For more information, refer to section 6.8.2 get. 6.11 Sample Use Cases For information about the HBase protector sample use cases, refer to section 12.8 Protecting Data using HBase. Confidential 113 Big Data Protector Guide 6.6.5 7 Impala Impala Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility of the SQL format and is capable of running the queries on HDFS in HBase. This section provides information about the Impala protector, the UDFs provided, and the commands for protecting and unprotecting data in an Impala table. 7.1 Overview of the Impala Protector Impala is an MPP SQL query engine for querying the data stored in a cluster. The Protegrity Impala protector extends the functionality of the Impala query engine and provides UDFs which protect or unprotect the data as it is stored or retrieved. 7.2 Impala Protector Usage The Protegrity Impala protector provides UDFs for protecting data using encryption or tokenization, and unprotecting data by using decryption or detokenization. Ensure that the /user/impala path exists in HDFS with the Impala supergroup permissions. You can verify this by the following command: # hadoop fs –ls /user To create the /user/impala path in Impala with Supergroup permissions: If the /user/impala path does not exist or does not have supergroup permissions, then perform the following steps. 1. Create the /user/impala directory in HDFS using the following command. # sudo –u hdfs hadoop –mkdir /user/impala 2. Assign Impala supergroup permissions to the /user/impala path using the following command. # sudo –u hdfs hadoop –chown –R impala:supergroup /user/impala 7.3 Impala UDFs This section describes all Impala UDFs that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. 7.3.1 pty_GetVersion() This UDF returns the PEP version number. ptyGetVersion() Parameters None Result This UDF returns the current version of the PEP. Confidential 114 Big Data Protector Guide 6.6.5 Impala Example select pty_GetVersion (); 7.3.2 pty_WhoAmI() This UDF returns the logged in user name. ptyWhoAmI() Parameters None Result Text: Returns the logged in user name Example select pty_WhoAmI(); 7.3.3 pty_GetCurrentKeyId() This UDF returns the current active key identification number of the encryption type data element. pty_GetCurrentKeyId(dataElement string) Parameters dataElement: Variable specifies the protection method Result integer: Returns the current key identification number Example select pty_GetCurrentKeyId('enc_3des_kid'); 7.3.4 pty_GetKeyId() This UDF returns the key ID used for each row in a table. pty_GetKeyId(dataElement string, col string) Parameters dataElement: Variable specifies the protection method col: String array of the data in table Result integer: Returns the key indentification number for the row Example select pty_GetKeyId('enc_3des_kid',column_name) from table_name; 7.3.5 pty_StringEnc() This UDF returns the encrypted value for a column containing String format data. pty_StringEnc(data string, dataElement string) Confidential 115 Big Data Protector Guide 6.6.5 Impala Parameters data: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Result string: Returns a string value Example select pty_StringEnc(column_name,'enc_3des') from table_name; 7.3.6 pty_StringDec() This UDF returns the decrypted value for a column containing String format data. pty_StringDec(data string, dataElement string) Parameters data: Column name of the data to decode in the table dataElement: Variable specifying the unprotection method Result string: Returns a string value Example select pty_StringDec(column_name,'enc_3des') from table_name; 7.3.7 pty_StringIns() This UDF returns the tokenized value for a column containing String format data. pty_StringIns(data string, dataElement string) Parameters data: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Result string: Returns the tokenized string value Example select pty_StringIns(column_name, 'TOK_NAME') from table_name; 7.3.8 pty_StringSel() This UDF returns the detokenized value for a column containing String format data. pty_StringSel(data string, dataElement string) Parameters data: Column name of the data to detokenize in the table dataElement: Variable specifing the unprotection method Result string: Returns the detokenized string value Confidential 116 Big Data Protector Guide 6.6.5 Impala Example select pty_StringSel(column_name, 'TOK_NAME') from table_name; 7.3.9 pty_UnicodeStringIns() This UDF returns the tokenized value for a column containing String (Unicode) format data. pty_UnicodeStringIns(data string, dataElement string) Parameters data: Column name of the string (Unicode) format data to tokenize in the table dataElement: Name of data element to protect string (Unicode) value This UDF should be used only if you need to tokenize Unicode data in Impala, and migrate the tokenized data from Impala to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected string value. Example select pty_UnicodeStringIns(val, 'Token_unicode') from temp_table; 7.3.10 pty_UnicodeStringSel() This UDF unprotects the existing protected String value. pty_UnicodeStringSel(data string, dataElement string) Parameters data: Column name of the string format data to detokenize in the table varchar dataElement: Name of data element to unprotect string value This UDF should be used only if you need to tokenize Unicode data in Teradata using the Protegrity Database Protector, and migrate the tokenized data from a Teradata database to Impala and detokenize the data using the Protegrity Big Data Protector for Impala. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data from a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns detokenized string (Unicode) value. Example select pty_UnicodeStringSel(val, 'Token_unicode') from temp_table; Confidential 117 Big Data Protector Guide 6.6.5 Impala 7.3.11 pty_IntegerEnc() This UDF returns the encrypted value for a column containing Integer format data. pty_IntegerEnc(data integer, dataElement string) Parameters data: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Result string: Returns a string value Example select pty_IntegerEnc(column_name,'enc_3des') from table_name; 7.3.12 pty_IntegerDec() This UDF returns the decrypted value for a column containing Integer format data. pty_IntegerDec(data string, dataElement string) Parameters data: Column name of the data to decode in the table dataElement: Variable specifying the unprotection method Result integer: Returns an integer value Example select pty_IntegerDec(column_name,'enc_3des') from table_name; 7.3.13 pty_IntegerIns() This UDF returns the tokenized value for a column containing Integer format data. pty_IntegerIns(data integer, dataElement string) Parameters data: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Result integer: Returns the tokenized integer value Example select pty_IntegerIns(column_name,'integer_de') from table_name; 7.3.14 pty_IntegerSel() This UDF returns the detokenized value for a column containing Integer format data. pty_IntegerSel(data integer, dataElement string) Parameters data: Column name of the data to detokenize in the table dataElement: Variable specifing the unprotection method Confidential 118 Big Data Protector Guide 6.6.5 Impala Result integer: Returns the detokenized integer value Example select pty_IntegerSel(column_name,'integer_de') from table_name; 7.3.15 pty_FloatEnc() This UDF returns the encrypted value for a column containing Float format data. pty_FloatEnc(data float, dataElement string) Parameters data: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Result string: Returns a string value Example select pty_FloatEnc(column_name,'enc_3des') from table_name; 7.3.16 pty_FloatDec() This UDF returns the decrypted value for a column containing Float format data. pty_FloatDec(data string, dataElement string) Parameters data: Column name of the data to decode in the table dataElement: Variable specifying the unprotection method Result float: Returns a float value Example select pty_FloatDec(column_name,'enc_3des') from table_name; 7.3.17 pty_FloatIns() This UDF returns the tokenized value for a column containing Float format data. pty_FloatIns(data float, dataElement string) Parameters data: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Result float: Returns the tokenized float value Confidential 119 Big Data Protector Guide 6.6.5 Impala Example select pty_FloatIns(cast(12.3 as float), 'no_enc'); Ensure that you use the data element with the No Encryption method only. Using any other data element would return an error mentioning that the operation is not supported for that data type. If you need to tokenize the Float column, then load the Float column into a String column and use the pty_StringIns UDF to tokenize the column. For more information about the pty_StringIns UDF, refer to section 7.3.7 pty_StringIns(). 7.3.18 pty_FloatSel() This UDF returns the detokenized value for a column containing Float format data. pty_FloatSel(data float, dataElement string) Parameters data: Column name of the data to detokenize in the table dataElement: Variable specifing the unprotection method Result float: Returns the detokenized float value Example select pty_FloatSel(tokenized_value, 'no_enc'); Ensure that you use the data element with the No Encryption method only. Using any other data element would return an error mentioning that the operation is not supported for that data type. 7.3.19 pty_DoubleEnc() This UDF returns the encrypted value for a column containing Double format data. pty_DoubleEnc(data double, dataElement string) Parameters data: Integer data column to encrypt in the table dataElement: Variable specifying the protection method Result string: Returns a string Example select pty_DoubleEnc(column_name,'enc_3des') from table_name; Confidential 120 Big Data Protector Guide 6.6.5 Impala 7.3.20 pty_DoubleDec() This UDF returns the decrypted value for a column containing Double format data. Pty_DoubleDec(data string, dataElement string) Parameters data: Column name of the data to decode in the table dataElement: Variable specifying the unprotection method Result double: Returns a double value Example select pty_DoubleDec(column_name,'enc_3des') from table_name; 7.3.21 pty_DoubleIns() This UDF returns the tokenized value for a column containing Double format data. pty_DoubleIns(data double, dataElement string) Parameters data: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Result double: Returns a double value Example select pty_DoubleIns(cast(1.2 as double), 'no_enc'); Ensure that you use the data element with the No Encryption method only. Using any other data element would return an error mentioning that the operation is not supported for that data type. If you need to tokenize the Double column, then load the Double column into a String column and use the pty_StringIns UDF to tokenize the column. For more information about the pty_StringIns UDF, refer to section 7.3.7 pty_StringIns(). 7.3.22 pty_DoubleSel() This UDF returns the detokenized value for a column containing Double format data. pty_DoubleSel(data double, dataElement string) Parameters data: Column name of the data to detokenize in the table dataElement: Variable specifing the unprotection method Result double: Returns the detokenized double value Confidential 121 Big Data Protector Guide 6.6.5 Impala Example select pty_DoubleSel(tokenized_value, 'no_enc'); Ensure that you use the data element with the No Encryption method only. Using any other data element would return an error mentioning that the operation is not supported for that data type. 7.4 Inserting Data from a File into a Table To insert data from a file into an Impala table, ensure that the required user permissions for the directory path in HDFS are assigned for the Impala table. To prepare the environment for the basic_sample.csv file: 1. Assign permissions to the path where data from the basic_sample.csv file needs to be copied using the following command: sudo -u hdfs hadoop fs -chown root:root /tmp/basic_sample/sample/ 2. Copy the data from the basic_sample.csv file into HDFS using the following command: hdfs dfs -put basic_sample.csv /tmp/basic_sample/sample/ 3. Verify the presence of the basic_sample.csv file in the HDFS path using the following command: hdfs dfs -ls /tmp/basic_sample/sample/ 4. Assign permissions for Impala to the path where the basic_sample.csv file is located using the following command: sudo -u hdfs hadoop fs -chown impala:supergroup /path/ To populate the table sample_table from the basic_sample_data.csv file: The following commands basic_sample_data.csv file. populate the table basic_sample with the data from the create table sample_table(colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) row format delimited fields terminated by ','; LOAD DATA INPATH '/tmp/basic_sample/sample/' INTO TABLE sample_table; Parameters sample_table: Name of the Impala table created to load the data from the input CSV file from the required path. colname1, colname2, colname3: Name of the columns. colname1_format, colname2_format, colname3_format: The data types contained in the respective columns. The data types can only be of types STRING, INT, DOUBLE, or FLOAT. ATTRIBUTES: Additional parameters to consider when ingesting the data. In the example, the row format is delimited using the ‘,’ character as the row format in the input file is comma separated. If the input file is tab separated, then the the row format is delimited using '\t'. Confidential 122 Big Data Protector Guide 6.6.5 7.5 Impala Protecting Existing Data To protect existing data, users should define the mappings between the columns and their respective data elements in the data security policy. The following commands ingest cleartext data from the basic_sample basic_sample_protected table in protected form using Impala UDFs. table to the create table basic_sample_protected (colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) insert into basic_sample_protected(colname1, colname2, colname3) select ID,pty_stringins(colname1, dataElement1),pty_stringins(colname2, dataElement2),pty_stringins(colname3, dataElement3) from basic_sample; Parameters basic_sample_protected: Table to store protected data. colname1, colname2, colname3: Name of the columns. dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns. basic_sample: Table containing the original data in cleartext form. 7.6 Unprotecting Protected Data To unprotect protected data, you need to specify the name of the table which contains the protected data, the table which would store the unprotected data, and the columns and their respective data elements. Ensure that the user performing the task has permissions to unprotect the data as required in the data security policy. The following commands unprotect the protected data in a table and stores the data in cleartext form in to a different table, if the user has the required permissions. create table table_unprotected (colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) insert into table_unprotected (colname1, colname2, colname3) select ID,pty_stringsel(colname1, dataElement1),pty_stringsel(colname2, dataElement2),pty_stringsel(colname3, dataElement3) from table_protected; Parameters table_unprotected: Table to store unprotected data. colname1, colname2, colname3: Name of the columns. dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns. table_protected: Table containing protected data. 7.7 Retrieving Data from a Table To retrieve data from a table, the user needs to have access to the table. The following command displays the data contained in the table. select * from table; Parameters table: Name of the table. Confidential 123 Big Data Protector Guide 6.6.5 7.8 Impala Sample Use Cases For information about the Impala protector sample use cases, refer to section 11.9 Protecting Data using Impala. Confidential 124 Big Data Protector Guide 6.6.5 8 HAWQ HAWQ HAWQ is an MPP SQL processing engine for querying the data stored in a Hadoop cluster. It breaks complex queries into smaller tasks and distributes their execution to the query processing units. HAWQ is an MPP database, which uses HDFS to store data. It has the following components: HAWQ Master Server: Enables users to interact with HAWQ using client programs, such as PSQL or APIs, such as JDBC or ODBC. The HAWQ Master Server performs the following functions: o Authenticates client connections o Processes incoming SQL commands o Distributes workload among HAWQ segments o Coordinates the results returned by each segment o Presents the final results to the client application • Name Node: Enables client applications to locate a file. • HAWQ Segments: Are the units which process the individual data modules simultaneously • HAWQ Storage: Is HDFS, which stores all the table data • Interconnect Switch: Is the networking layer of HAWQ, which handles the communication between the segments This section provides information about the HAWQ protector, the UDFs provided, and the commands for protecting and unprotecting data in a HAWQ table. • 8.1 Overview of the HAWQ Protector The Protegrity HAWQ protector extends the functionality of the HAWQ processing engine and provides UDFs which protect or unprotect the data as it is stored or retrieved. 8.2 HAWQ Protector Usage The Protegrity HAWQ protector provides UDFs for protecting data using encryption or tokenization, and unprotecting data by using decryption or detokenization. Ensure that the format of the data is either Varchar, Integer, Date, or Real. Ensure that the HAWQ is configured after the Big Data Protector is installed. For more information about configuring HAWQ, refer to section 3.1.11 Configuring HAWQ. 8.3 HAWQ UDFs This section describes all HAWQ UDFs that are available for protection and unprotection in Big Data Protector to build secure Big Data applications. 8.3.1 pty_GetVersion() This UDF returns the PEP version number. Pty_GetVersion() Parameters None Returns This UDF returns the current PEP server version Confidential 125 Big Data Protector Guide 6.6.5 HAWQ Example select pty_GetVersion(); 8.3.2 pty_WhoAmI() This UDF returns the logged in user name. Pty_WhoAmI() Parameters None Returns This UDF returns the current logged in user name Example select pty_WhoAmI(); 8.3.3 pty_GetCurrentKeyId() This UDF returns the current active key identification number of the encryption type data element. pty_GetCurrentKeyId (dataElement varchar) Parameters dataElement: Variable specifies the protection method Returns This UDF returns the current key identification number of the encryption type data element, which is passed as the parameter. Example select pty_GetCurrentKeyId('enc_de'); 8.3.4 pty_GetKeyId() This UDF returns the key ID for the encryption data element, used for protecting each row in a table. pty_GetKeyId(dataElement string, col byte[]) Parameters dataElement: Variable specifies the protection method col: Byte array of the column in the table Returns This UDF returns the key ID for the encryption data element, used for protecting each row in the table Example select pty_GetKeyId('enc_de',table_name.c) from table_name; 8.3.5 pty_VarcharEnc() This UDF returns the encrypted value for a column containing varchar format data. pty_VarcharEnc(col varchar, dataElement varchar) Confidential 126 Big Data Protector Guide 6.6.5 HAWQ Parameters col: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Returns This UDF returns the encrypted value as a byte array Example select pty_VarcharEnc(column_name,'enc_de') from table_name; 8.3.6 pty_VarcharDec() This UDF returns the decrypted value for a column containing varchar format protected data. pty_VarcharDec(col byte[], dataElement varchar) Parameters col: Column name of the data to decrypt in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the decrypted value Example select pty_VarcharDec(column_name,'enc_de') from table_name; 8.3.7 pty_VarcharHash() This UDF returns the hashed value for a column containing varchar format data. pty_VarcharHash(col varchar, dataElement varchar) Parameters col: Column name of the data to hash in the table dataElement: Variable specifying the protection method Returns The protected value as byte array Example select pty_VarcharHash(column_name,'hash_de') from table_name; 8.3.8 pty_VarcharIns() This UDF returns the tokenized value for a column containing varchar format data. pty_VarcharIns(col varchar, dataElement varchar) Parameters col: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Returns This UDF returns the tokenized value as byte array Example select pty_VarcharIns(column_name,'alpha_num_tk_de') from table_name; Confidential 127 Big Data Protector Guide 6.6.5 8.3.9 HAWQ pty_VarcharSel() This UDF returns the detokenized value for a column containing varchar format tokenized data. pty_VarcharSel(col varchar, dataElement varchar) Parameters col: Column name of the data to detokenize in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the detokenized value Example select pty_VarcharSel(column_name,'alpha_num_tk_de') from table_name; 8.3.10 pty_UnicodeVarcharIns() This UDF protects varchar (Unicode) values. pty_UnicodeVarcharIns(col varchar, dataElement varchar) Parameters col: Column name of the varchar (Unicode) data to protect dataElement: Name of data element to protect varchar (Unicode) data. This UDF should be used only if you need to tokenize Unicode data in HAWQ, and migrate the tokenized data from HAWQ to a Teradata database and detokenize the data using the Protegrity Database Protector. Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns protected varchar value. Example select pty_UnicodeVarcharIns(column_name, 'Token_unicode') from temp_table; 8.3.11 pty_UnicodeVarcharSel() This UDF unprotects varchar values. pty_unicodevarcharsel(col varchar, dataElement varchar) Parameters varchar input: Column name of the varchar data to unprotect varchar dataElement: Name of data element to unprotect varchar data This UDF should be used only if you need to tokenize Unicode data in Teradata using the Protegrity Database Protector, and migrate the tokenized data from a Teradata database to HAWQ and detokenize the data using the Protegrity Big Data Protector for HAWQ. Confidential 128 Big Data Protector Guide 6.6.5 HAWQ Ensure that you use this UDF with a Unicode tokenization data element only. For more information about migrating tokenized Unicode data to a Teradata database, refer to section 15 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database. Result This UDF returns unprotected varchar (Unicode) value. Example select pty_unicodevarcharsel(column_name, 'Token_unicode') from temp_table; 8.3.12 pty_IntegerEnc() This UDF returns the encrypted value for a column containing integer format data. pty_IntegerEnc(col integer, dataElement varchar) Parameters col: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Returns This UDF returns the encrypted value as byte array Example select pty_IntegerEnc(column_name,'enc_de') from table_name; 8.3.13 pty_IntegerDec() This UDF returns the decrypted value for a column containing encrypted data as byte array format. pty_IntegerEnc(col byte[], dataElement varchar) Parameters col: Column name of the data to decrypt in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the decrypted value Example select pty_IntegerDec(column_name,'enc_de') from table_name; 8.3.14 pty_IntegerHash() This UDF returns the hashed value for a column, containing integer format data, as a byte array. pty_IntegerHash(col integer, dataElement varchar) Parameters col: Column name of the data to hash in the table dataElement: Variable specifying the protection method Returns This UDF returns the hash value as byte array Confidential 129 Big Data Protector Guide 6.6.5 HAWQ Example select pty_IntegerHash(column_name,'hash_de') from table_name; 8.3.15 pty_IntegerIns() This UDF returns the tokenized value for a column containing integer format data. pty_IntegerIns(col integer, dataElement varchar) Parameters col: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Returns This UDF returns the tokenized value Example select pty_IntegerIns(column_name,'int_tk_de') from table_name; 8.3.16 pty_IntegerSel() This UDF returns the detokenized value for a column containing integer format data. pty_IntegerSel(col integer, dataElement varchar) Parameters col: Column name of the data to detokenize in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the detokenized value Example select pty_IntegerSel(column_name,'enc_de') from table_name; 8.3.17 pty_DateEnc() This UDF returns the encrypted value for a column containing date format data. pty_DateEnc(col date, dataElement varchar) Parameters col: Date column to encrypt in the table dataElement: Variable specifying the protection method Returns This UDF returns the encrypted value as byte array Example select pty_DateEnc(column_name,'enc_de') from table_name; 8.3.18 pty_DateDec() This UDF returns the decrypted value for a column containing encrypted data in byte array format. pty_DateDec(col byte[], dataElement varchar) Confidential 130 Big Data Protector Guide 6.6.5 HAWQ Parameters col: Date column to decrypt in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the decrypted value Example select pty_DateDec(column_name,'enc_de') from table_name; 8.3.19 pty_DateHash() This UDF returns the hashed value for a column containing data in date format. pty_DateHash(col date, dataElement varchar) Parameters col: Date column to hash in the table dataElement: Variable specifying the protection method Returns This UDF returns the hashed value as byte array Example select pty_DateHash(column_name,'hash_de') from table_name; 8.3.20 pty_DateIns() This UDF returns the tokenized value for a column containing data in date format. pty_DateIns(col date, dataElement varchar) Parameters col: Date column to tokenize in the table dataElement: Variable specifying the protection method Returns This UDF returns the tokenized value as date If the date provided is out of the range than that described in Protection Methods Reference 6.5.2, then an error message appears in the psql shell and the transaction is aborted. An Audit log entry is not generated for this issue. Example select pty_DateIns(column_name,'date_tk_de') from table_name; 8.3.21 pty_DateSel() This UDF returns the detokenized value for a column containing data in date format. pty_DateSel(col date, dataElement varchar) Parameters col: Date column to detokenize in the table dataElement: Variable specifying the unprotection method Confidential 131 Big Data Protector Guide 6.6.5 HAWQ Returns This UDF returns the detokenized value as date Example select pty_DateSel(column_name,'date_tk_de') from table_name; 8.3.22 pty_RealEnc() This UDF returns the encrypted value for a column containing data in decimal format. pty_RealEnc(col real, dataElement varchar) Parameters col: Column name of the data to encrypt in the table dataElement: Variable specifying the protection method Returns This UDF returns the encrypted value in byte array format Example select pty_RealEnc(column_name,'enc_de') from table_name; 8.3.23 pty_RealDec() This UDF returns the decrypted value for a column containing encrypted data in byte array format. pty_RealDec(col real, dataElement varchar) Parameters col: Column name of the data to decrypt in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the decrypted value in real format Example select pty_RealDec(column_name,'enc_de') from table_name; 8.3.24 pty_RealHash() This UDF returns the hashed value for a column containing data in real format. pty_RealHash(col real, dataElement varchar) Parameters col: Column name of the data to hash in the table dataElement: Variable specifying the protection method Returns This UDF returns the hashed value as byte array Example select pty_RealHash(column_name,'hash_de') from table_name; 8.3.25 pty_RealIns() This UDF returns the tokenized value for a column containing data in real format. Confidential 132 Big Data Protector Guide 6.6.5 HAWQ If a decimal value is used, then it is tokenized by first loading the decimal type column into a varchar type column and then using pty_VarcharIns() to tokenize this column. Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. pty_RealIns(col real, dataElement varchar) Parameters col: Column name of the data to tokenize in the table dataElement: Variable specifying the protection method Result This UDF returns the tokenized value in real format Example select pty_RealIns(column_name,'noenc_de') from table_name; 8.3.26 pty_RealSel() This UDF returns the detokenized value for a column containing data in real format. Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. If used with any other type of data, then an error explaining that the datatype is unsupported should be reported. pty_RealSel(col real, dataElement varchar) Parameters col: Column name of the data to detokenize in the table dataElement: Variable specifying the unprotection method Returns This UDF returns the detokenized value Example select pty_RealSel(column_name,'noenc_de') from table_name; 8.4 Inserting Data from a File into a Table To populate the table sample_table from the basic_sample_data.csv file: The following command creates the table sample_table, with the required number of columns. create table sample_table (colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) distributed randomly; The following command grants permissions for the table sample_table to the required user, which will be used to perform the protect or unprotect operations. grant all on sample_table to ; The following command enables you to populate the table sample_table with the data from the basic_sample_data.csv file from the /samples/data directory. \copy sample_table from '/opt/protegrity/samples/data/basic_sample_data.csv' with delimiter ',' Parameters Confidential 133 Big Data Protector Guide 6.6.5 HAWQ sample_table: Name of the HAWQ table created to load the data from the input CSV file from the required path. colname1, colname2, colname3: Name of the columns. colname1_format, colname2_format, colname3_format: The data types contained in the respective columns. The data types can only be of types VARCHAR, INTEGER, DATE or REAL. ATTRIBUTES: Additional parameters to consider when ingesting the data. In the example, the row format is delimited using the ‘,’ character as the row format in the input file is comma separated. If the input file is tab separated, then the the row format is delimited using '\t'. 8.5 Protecting Existing Data To protect existing data, users should define the mappings between the columns and their respective data elements in the data security policy. The following commands create the table basic_sample_protected to store the protected data. drop table if exists basic_sample_protected; create table basic_sample_protected (colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) distributed randomly; Ensure that the user performing the task has the permissions to protect the data, as required, in the data security policy. The following command ingests cleartext data from the basic_sample basic_sample_protected table in protected form using HAWQ UDFs. table to the insert into basic_sample_protected(colname1, colname2, colname3) select colname1, pty_varcharins(colname2,dataElement2), pty_varcharins(colname3,dataElement3) from basic_sample; Parameters basic_sample_protected: Table to store protected data. colname1, colname2, colname3: Name of the columns. dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns. basic_sample: Table containing the original data in cleartext form. 8.6 Unprotecting Protected Data To unprotect protected data, you need to specify the name of the table which contains the protected data, the table which would store the unprotected data, and the columns and their respective data elements. Ensure that the user performing the task has permissions to unprotect the data as required in the data security policy. The following commands create the table basic_sample_unprotected to store the unprotected data. drop table if exists table_unprotected; create table table_unprotected (colname1 colname1_format, colname2 colname2_format, colname3 colname3_format) distributed randomly; Confidential 134 Big Data Protector Guide 6.6.5 HAWQ The following command retrieves the unprotected data and saves it in the basic_sample_unprotected table. insert into table_unprotected (colname1, colname2, colname3) select colname1, pty_varcharsel(colname2,dataElement2), pty_varcharsel(colname3,dataElement3) from table_protected; Parameters table_unprotected: Table to store unprotected data. colname1, colname2, colname3: Name of the columns. dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns. table_protected: Table containing protected data. 8.7 Retrieving Data from a Table To retrieve data from a table, the user needs to have access to the table. The following command displays the data contained in the table. select * from table; Parameters table: Name of the table. 8.8 Sample Use Cases For information about the HAWQ protector sample use cases, refer to section 11.10 Protecting Data using HAWQ. Confidential 135 Big Data Protector Guide 6.6.5 9 Spark Spark Spark is an execution engine that carries out batch processing of jobs in-memory and handles a wider range of computational workloads. In addition to processing a batch of stored data, Spark is capable of manipulating data in real time. Spark leverages the physical memory of the Hadoop system and utilizes Resilient Distributed Datasets (RDDs) to store the data in-memory and lowers latency, if the data fits in the memory size. The data is saved on the hard drive only if required. As RDDs are the basic units of abstraction and computation in Spark, you can use the protection and unprotection APIs, provided by the Spark protector, when performing the transformation operations on an RDD. If you need to use the Spark Protector API in a Spark Java job, then the users will have to implement the function interface as per the Spark Java programming specifications and subsequently use it in the required transformation of an RDD to tokenize the data. This section provides information about the Spark protector, the APIs provided, and the commands for protecting and unprotecting data in a file by using the respective Spark APIs for protection or unprotection. In addition, it provides information about Spark SQL, which is a module that adds relational data processing capabilities to the Spark APIs, and a sample program for Spark Scala. 9.1 Overview of the Spark Protector The Protegrity Spark protector extends the functionality of the Spark engine and provides APIs that protect or unprotect the data as it is stored or retrieved. 9.2 Spark Protector Usage The Protegrity Spark protector provides APIs for protecting and reprotecting the data using encryption or tokenization, and unprotecting data by using decryption or detokenization. Ensure that Spark is configured after the Big Data Protector is installed. For more information about configuring Spark, refer to section 3.1.12 Configuring Spark. 9.3 Spark APIs This section describes the Spark APIs (Java) available for protection and unprotection in the Big Data Protector to build secure Big Data applications. The Protegrity Spark protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. 9.3.1 getVersion() This function returns the current version of the Spark protector. public string getVersion() Parameters None Confidential 136 Big Data Protector Guide 6.6.5 Spark Result This function returns the current version of Spark protector. Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector(applicationId); String version = protector.getVersion(); Exception PtySparkProtectorException: If unable to return the current version of the Spark protector 9.3.2 getCurrentKeyId() This method returns the current Key ID for the data element, which contains the KEY ID attribute, while creating the data element, such as AES-256, AES-128, and so on. public int getCurrentKeyId(String dataElement) Parameters dataElement: Name of the data element Result This method returns the current Key ID for the data element containing the KEY ID attribute. Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector(applicationId); int keyId = protector.getCurrentKeyId("AES-256"); Exception PtySparkProtectorException: If unable to return the current Key ID for the data element 9.3.3 checkAccess() This method checks the access of the user for the specified data element. public boolean checkAccess(String dataElement, Permission permission) Parameters dataElement: Name of the data element Permission: Type of the access of the user for the data element Result true: If the user has access to the data element false: If the user does not have access to the data element Example String applicationId = sparkContext.getConf().getAppId() Protector protector = new PtySparkProtector(applicationId); boolean accessType = protector.checkAccess(dataElement, Permission.PROTECT); Exception PtySparkProtectorException: If unable to verify the access of the user for the data element Confidential 137 Big Data Protector Guide 6.6.5 9.3.4 Spark getDefaultDataElement() This method returns default data element configured in the security policy. public String getDefaultDataElement(String policyName) Parameters policyName: Name of policy configured using Policy management in ESA Result Default data element name configured in the security policy. Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector(applicationId); String dataElement = protector.getDefaultDataElement("sample_policy"); Exception PtySparkProtectorException: If unable to return the default data element name 9.3.5 hmac() This method performs hashing of the data using the HMAC operation on a single data item with a data element, which is associated with HMAC. It returns the hmac value of the data with the data element. public byte[] hmac(String dataElement, byte[] input) Parameters dataElement: Name of the data element for HMAC data: Byte array of data for HMAC Result Byte array of HMAC data Example String applicationId = sparkContext.getConf().getAppId() Protector protector = new PtySparkProtector(applicationId); byte[] output = protector.hmac("HMAC-SHA1", "test1".getBytes()); Exception PtySparkProtectorException: If unable to protect data 9.3.6 protect() Protects the data provided as a byte array. The type of protection applied is defined by dataElement. public void protect(String dataElement, List errorIndex, byte[][] input, byte[][] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index input: Array of a byte array of data to be protected output: Array of a byte array containing protected data Confidential 138 Big Data Protector Guide 6.6.5 Spark The Protegrity Spark protector only supports bytes converted from the string data type. If int, short, or long format data is directly converted to bytes and passed as input to the API that supports byte as input and provides byte as output, then data corruption might occur. If you are using the Protect API which accepts byte as input and provides byte as output, then ensure that when unprotecting the data, the Unprotect API, with byte as input and byte as output is utilized. In addition, ensure that the byte data being provided as input to the Protect API has been converted from a string data type only. Result The output variable in the method signature contains protected data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement=”Binary”; byte[][] input = new byte[][]{“test1”.getbytes(),”test2”.getbytes()}; byte[][] output = new byte[input.length][]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to protect data 9.3.7 protect() Protects the short format data provided as a short array. The type of protection applied is defined by dataElement. public void protect(String dataElement, List errorIndex, short[] input, short[] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index input: Short array of data to be protected output: Short array containing protected data Result The output variable in the method signature contains protected data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement=”short”; short[] input = new short[] {1234, 4545}; short[] output = new short[input.length]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to protect data Confidential 139 Big Data Protector Guide 6.6.5 9.3.8 Spark protect() Encrypts the short format data provided as a short array. The type of encryption applied is defined by dataElement. public void protect(String dataElement, List errorIndex, short[] input, byte[][] output) Parameters dataElement: Name of the data element used for encryption errorIndex: List of the Error Index input: Short array of data to be encrypted output: Array of an encrypted byte array Result The output variable in the method signature contains protected data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "AES-256"; short[] input = new short[]{1234, 4545}; byte[][] output = new byte[input.length][]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to encrypt data 9.3.9 protect() Protects the data provided as int array. The type of protection applied is defined by dataElement. public void protect(String dataElement, List errorIndex, int[] input, int[] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index input: Int array of data to be protected output: Int array containing protected data Result The output variable in the method signature contains protected int data Confidential 140 Big Data Protector Guide 6.6.5 Spark Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "int"; int[] input = new int[]{1234, 4545}; int[] output = new int[input.length]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to protect data 9.3.10 protect() Encrypts the data provided as int array. The type of encryption applied is defined by dataElement. public void protect(String dataElement, List errorIndex, int[] input, byte[][] output) Parameters dataElement: Name of the data element used for encryption errorIndex: List of the Error Index input: Int array of data to be encrypted output: Array of an encrypted byte array Result The output variable in the method signature contains encrypted data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "AES-256"; int[] input = new int[]{1234, 4545}; byte[][] output = new byte[input.length][]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to encrypt data 9.3.11 protect() Protects the data provided as long byte array. The type of protection applied is defined by dataElement. public void protect(String dataElement, List errorIndex, long[] input, long[] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index input: Long array of data to be protected output: Long array containing protected data Confidential 141 Big Data Protector Guide 6.6.5 Spark Result The output variable in the method signature contains protected data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "long"; long[] input = new long[] {1234, 4545}; long[] output = new long[input.length]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to protect data 9.3.12 protect() Encrypts the data provided as long byte array. The type of encryption applied is defined by dataElement. public void protect(String dataElement, List errorIndex, long[] input, byte[][] output) Parameters dataElement: Name of the data element used for encryption errorIndex: List of the Error Index input: Long array of data to be encrypted output: Array of a byte array containing encrypted data Result The output variable in the method signature contains encrypted data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "AES-256"; long[] input = new long[] {1234, 4545}; byte[][] output = new byte[input.length][]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to encrypt data 9.3.13 protect() Protects the data provided as float array. The type of protection applied is defined by dataElement. public void protect(String dataElement, List errorIndex, float[] input, float[] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index Confidential 142 Big Data Protector Guide 6.6.5 Spark input: Float array of data to be protected output: Float array containing protected data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result The output variable in the method signature contains protected float data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "float"; float[] input = new float[] {123.4f, 454.5f}; float[] output = new float[input.length]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to protect data 9.3.14 protect() Encrypts the data provided as float array. The type of encryption applied is defined by dataElement. public void protect(String dataElement, List errorIndex, float[] input, byte[][] output) Parameters dataElement: Name of the data element used for encryption errorIndex: List of the Error Index input: Float array of data to be encrypted output: Array of a byte array containing encrypted data Ensure that you use the data element with either the No Encryption method or Encryption data element only. Using any other data element might cause corruption of data. Result The output variable in the method signature contains encrypted data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "AES-256"; float[] input = new float[] {123.4f, 454.5f}; byte[][] output = new byte[input.length][]; List errorIndexList = new ArrayList (); protector.protect(dataElement, errorIndexList, input, output); Exception PtySparkProtectorException: If unable to encrypt data 9.3.15 protect() Protects the data provided as double array. The type of protection applied is defined by dataElement. Confidential 143 Big Data Protector Guide 6.6.5 Spark public void protect(String dataElement, List errorIndex, double[] input, double[] output) Parameters dataElement: Name of the data element used for protection errorIndex: List of the Error Index input: Double array of data to be protected output: Double array containing protected data Ensure that you use the data element with the No Encryption method only. Using any other data element might cause corruption of data. Result The output variable in the method signature contains protected double data Example String applicationId = sparkContext.getConf().getAppId(); Protector protector = new PtySparkProtector (applicationId); String dataElement = "double"; double[] input = new double[] {123.4, 454.5}; double[] output = new double[input.length]; List