Big Data Protector Guide 6.6.5

User Manual:

Open the PDF directly: View PDF .
Page Count: 259 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Copyright
Contents
1 Introduction to this Guide
2 Overview of the Big Data Protector
3 Installing and Uninstalling Big Data Protector
4 Hadoop Application Protector
5 HDFS File Protector (HDFSFP)
6 HBase
7 Impala
8 HAWQ
9 Spark
10 Data Node and Name Node Security with File Protector
- 10.1 Features of the Protegrity File Protector
11 Appendix: Return Codes
12 Appendix: Samples
13 Appendix: HDFSFP Demo
14 Appendix: Using Hive with HDFSFP
15 Appendix: Configuring Talend with HDFSFP
16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database
- 16.1 Migrating Tokenized Unicode Data from a Teradata Database
- 16.2 Migrating Tokenized Unicode Data to a Teradata Database

Protegrity Big Data Protector Guide

Release 6.6.5

Big Data Protector Guide 6.6.5

Confidential I

Protegrity products are protected by and subject to patent protections;

Patent:http://www.protegrity.com/patents

Protegrity logo is the trademark of Protegrity Corporation.

NOTICE TO ALL PERSONS RECEIVING THIS DOCUMENT

Some of the product names mentioned herein are used for identification purposes only and may be

trademarks and/or registered trademarks of their respective owners.

Windows, MS-SQL Server, Internet Explorer and Internet Explorer logo, Active Directory, and Hyper-V

are registered trademarks of Microsoft Corporation in the United States and/or other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

SCO and SCO UnixWare are registered trademarks of The SCO Group.

Sun, Oracle, Java, and Solaris, and their logos are the trademarks or registered trademarks of Oracle

Corporation and/or its affiliates in the United States and other countries.

Teradata and the Teradata logo are the trademarks or registered trademarks of Teradata Corporation

or its affiliates in the United States and other countries.

Hadoop or Apache Hadoop, Hadoop elephant logo, HDFS, Hive, Pig, HBase, and Spark are trademarks

of Apache Software Foundation.

Cloudera, Impala, and the Cloudera logo are trademarks of Cloudera and its suppliers or licensors.

Hortonworks and the Hortonworks logo are the trademarks of Hortonworks, Inc. in the United States

and other countries.

Greenplum is the registered trademark of EMC Corporation in the U.S. and other countries.

Pivotal HD and HAWQ are the registered trademarks of Pivotal, Inc. in the U.S. and other countries.

MapR logo is a registered trademark of MapR Technologies, Inc.

PostgreSQL or Postgres is the copyright of The PostgreSQL Global Development Group and The

Regents of the University of California.

IBM and the IBM logo, z/OS, AIX, DB2, Netezza, and BigInsights are trademarks or registered

trademarks of International Business Machines Corporation in the United States, other countries, or

both.

Utimaco Safeware AG is a member of the Sophos Group.

Jaspersoft, the Jaspersoft logo, and JasperServer products are trademarks and/or registered

trademarks of Jaspersoft Corporation in the United States and in jurisdictions throughout the world.

Big Data Protector Guide 6.6.5

Confidential II

Xen, XenServer, and Xen Source are trademarks or registered trademarks of Citrix Systems, Inc.

and/or one or more of its subsidiaries, and may be registered in the United States Patent and

Trademark Office and in other countries.

VMware, the VMware “boxes” logo and design, Virtual SMP and VMotion are registered trademarks or

trademarks of VMware, Inc. in the United States and/or other jurisdictions.

HP is a registered trademark of the Hewlett-Packard Company.

Dell is a registered trademark of Dell Inc.

Novell is a registered trademark of Novell, Inc. in the United States and other countries.

POSIX is a registered trademark of the Institute of Electrical and Electronics Engineers, Inc.

Mozilla and Firefox are registered trademarks of Mozilla foundation.

Chrome is a registered trademark of Google Inc.

Big Data Protector Guide 6.6.5

Contents

Confidential 3

Contents

Copyright ............................................................................................................................. I

1 Introduction to this Guide ....................................................................................... 14

1.1. Sections contained in this Guide .................................................................................... 14

1.2. Protegrity Documentation Suite .................................................................................... 14

1.5 Glossary..................................................................................................................... 15

2 Overview of the Big Data Protector ......................................................................... 16

2.1 Components of Hadoop ................................................................................................ 16

2.1.1 Hadoop Distributed File System (HDFS) ..................................................................... 17

2.1.2 MapReduce ............................................................................................................. 17

2.1.3 Hive ...................................................................................................................... 17

2.1.4 Pig ........................................................................................................................ 17

2.1.5 HBase .................................................................................................................... 17

2.1.6 Impala ................................................................................................................... 17

2.1.7 HAWQ .................................................................................................................... 18

2.1.8 Spark .................................................................................................................... 18

2.2 Features of Protegrity Big Data Protector........................................................................ 18

2.3 Using Protegrity Data Security Platform with Hadoop ....................................................... 20

2.4 Overview of Hadoop Application Protection ..................................................................... 21

2.4.1 Protection in MapReduce Jobs ................................................................................... 21

2.4.2 Protection in Hive Queries ........................................................................................ 21

2.4.3 Protection in Pig Jobs ............................................................................................... 22

2.4.4 Protection in HBase ................................................................................................. 22

2.4.5 Protection in Impala ................................................................................................ 22

2.4.6 Protection in HAWQ ................................................................................................. 22

2.4.7 Protection in Spark .................................................................................................. 22

2.5 HDFS File Protection (HDFSFP)...................................................................................... 23

2.6 Ingesting Data Securely ............................................................................................... 23

2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway (FPG) ................................. 23

2.6.2 Ingesting Files Using Hive Staging ............................................................................. 23

2.6.3 Ingesting Files into HDFS by HDFSFP ......................................................................... 23

2.7 Data Security Policy and Protection Methods ................................................................... 23

3 Installing and Uninstalling Big Data Protector ........................................................ 25

3.1 Installing Big Data Protector on a Cluster ....................................................................... 25

3.1.1 Verifying Prerequisites for Installing Big Data Protector ................................................ 25

3.1.2 Extracting Files from the Installation Package ............................................................. 27

3.1.3 Updating the BDP.config File ..................................................................................... 28

3.1.4 Installing Big Data Protector ..................................................................................... 29

Big Data Protector Guide 6.6.5

Contents

Confidential 4

3.1.5 Applying Patches ..................................................................................................... 33

3.1.6 Installing the DFSFP Service ..................................................................................... 33

3.1.7 Configuring HDFSFP................................................................................................. 34

3.1.8 Configuring HBase ................................................................................................... 36

3.1.9 Configuring Impala .................................................................................................. 37

3.1.10 Configuring HAWQ ................................................................................................... 38

3.1.11 Configuring Spark ................................................................................................... 38

3.2 Installing or Uninstalling Big Data Protector on Specific Nodes .......................................... 39

3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop Cluster .......................... 39

3.2.2 Uninstalling Big Data Protector from Selective Nodes in the Hadoop Cluster .................... 39

3.3 Utilities ...................................................................................................................... 40

3.3.1 PEP Server Control .................................................................................................. 40

3.3.2 Update Cluster Policy ............................................................................................... 40

3.3.3 Protegrity Cache Control .......................................................................................... 41

3.3.4 Recover Utility ........................................................................................................ 41

3.4 Uninstalling Big Data Protector from a Cluster ................................................................. 42

3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector ....................................... 42

3.4.2 Removing the Cluster from the ESA ........................................................................... 42

3.4.3 Uninstalling Big Data Protector from the Cluster .......................................................... 42

4 Hadoop Application Protector .................................................................................. 47

4.1 Using the Hadoop Application Protector .......................................................................... 47

4.2 Prerequisites............................................................................................................... 47

4.3 Samples ..................................................................................................................... 47

4.4 MapReduce APIs ......................................................................................................... 47

4.4.1 openSession()......................................................................................................... 48

4.4.2 closeSession() ........................................................................................................ 48

4.4.3 getVersion() ........................................................................................................... 48

4.4.4 getCurrentKeyId() ................................................................................................... 49

4.4.5 checkAccess() ......................................................................................................... 49

4.4.6 getDefaultDataElement().......................................................................................... 50

4.4.7 protect() ................................................................................................................ 50

4.4.8 protect() ................................................................................................................ 51

4.4.9 protect() ................................................................................................................ 51

4.4.10 unprotect() ............................................................................................................. 51

4.4.11 unprotect() ............................................................................................................. 52

4.4.12 unprotect() ............................................................................................................. 52

4.4.13 bulkProtect() .......................................................................................................... 53

4.4.14 bulkProtect() .......................................................................................................... 54

4.4.15 bulkProtect() .......................................................................................................... 55

Big Data Protector Guide 6.6.5

Contents

Confidential 5

4.4.16 bulkUnprotect() ...................................................................................................... 56

4.4.17 bulkUnprotect() ...................................................................................................... 58

4.4.18 bulkUnprotect() ...................................................................................................... 59

4.4.19 reprotect() ............................................................................................................. 60

4.4.20 reprotect() ............................................................................................................. 61

4.4.21 reprotect() ............................................................................................................. 61

4.4.22 hmac() .................................................................................................................. 62

4.5 Hive UDFs .................................................................................................................. 62

4.5.1 ptyGetVersion() ...................................................................................................... 62

4.5.2 ptyWhoAmI() .......................................................................................................... 63

4.5.3 ptyProtectStr()........................................................................................................ 63

4.5.4 ptyUnprotectStr() .................................................................................................... 64

4.5.5 ptyReprotect() ........................................................................................................ 64

4.5.6 ptyProtectUnicode() ................................................................................................. 65

4.5.7 ptyUnprotectUnicode() ............................................................................................. 66

4.5.8 ptyReprotectUnicode() ............................................................................................. 66

4.5.9 ptyProtectInt() ........................................................................................................ 67

4.5.10 ptyUnprotectInt() .................................................................................................... 68

4.5.11 ptyReprotect() ........................................................................................................ 69

4.5.12 ptyProtectFloat() ..................................................................................................... 69

4.5.13 ptyUnprotectFloat() ................................................................................................. 70

4.5.14 ptyReprotect() ........................................................................................................ 71

4.5.15 ptyProtectDouble() .................................................................................................. 71

4.5.16 ptyUnprotectDouble() .............................................................................................. 72

4.5.17 ptyReprotect() ........................................................................................................ 73

4.5.18 ptyProtectBigInt() ................................................................................................... 74

4.5.19 ptyUnprotectBigInt() ............................................................................................... 74

4.5.20 ptyReprotect() ........................................................................................................ 75

4.5.21 ptyProtectDec() ...................................................................................................... 76

4.5.22 ptyUnprotectDec() ................................................................................................... 76

4.5.23 ptyProtectHiveDecimal() .......................................................................................... 77

4.5.24 ptyUnprotectHiveDecimal() ....................................................................................... 78

4.5.25 ptyReprotect() ........................................................................................................ 78

4.6 Pig UDFs .................................................................................................................... 79

4.6.1 ptyGetVersion() ...................................................................................................... 79

4.6.2 ptyWhoAmI() .......................................................................................................... 80

4.6.3 ptyProtectInt() ........................................................................................................ 80

4.6.4 ptyUnprotectInt() .................................................................................................... 81

4.6.5 ptyProtectStr()........................................................................................................ 81

Big Data Protector Guide 6.6.5

Contents

Confidential 6

4.6.6 ptyUnprotectStr() .................................................................................................... 81

5 HDFS File Protector (HDFSFP) ................................................................................. 83

5.1 Overview of HDFSFP .................................................................................................... 83

5.2 Features of HDFSFP ..................................................................................................... 83

5.3 Protector Usage .......................................................................................................... 83

5.4 File Recover Utility ...................................................................................................... 83

5.5 HDFSFP Commands ..................................................................................................... 84

5.5.1 copyFromLocal ........................................................................................................ 84

5.5.2 put ........................................................................................................................ 84

5.5.3 copyToLocal............................................................................................................ 84

5.5.4 get ........................................................................................................................ 85

5.5.5 cp ......................................................................................................................... 85

5.5.6 mkdir..................................................................................................................... 85

5.5.7 mv ........................................................................................................................ 86

5.5.8 rm ......................................................................................................................... 86

5.5.9 rmr ....................................................................................................................... 86

5.6 Ingesting Files Securely ............................................................................................... 87

5.7 Extracting Files Securely .............................................................................................. 87

5.8 HDFSFP Java API ......................................................................................................... 87

5.8.1 copy ...................................................................................................................... 87

5.8.2 copyFromLocal ........................................................................................................ 88

5.8.3 copyToLocal............................................................................................................ 89

5.8.4 deleteFile ............................................................................................................... 89

5.8.5 deleteDir ................................................................................................................ 90

5.8.6 mkdir..................................................................................................................... 90

5.8.7 move ..................................................................................................................... 91

5.9 Developing Applications using HDFSFP Java API .............................................................. 92

5.9.1 Setting up the Development Environment .................................................................. 92

5.9.2 Protecting Data using the Class file ............................................................................ 92

5.9.3 Protecting Data using the JAR file .............................................................................. 92

5.9.4 Sample Program for the HDFSFP Java API .................................................................. 92

5.10 Quick Reference Tasks ................................................................................................. 94

5.10.1 Protecting Existing Data ........................................................................................... 94

5.10.2 Reprotecting Files .................................................................................................... 95

5.11 Sample Demo Use Case ............................................................................................... 95

5.12 Appliance components of HDFSFP .................................................................................. 95

5.12.1 Dfsdatastore Utility .................................................................................................. 95

5.12.2 Dfsadmin Utility ...................................................................................................... 95

Big Data Protector Guide 6.6.5

Contents

Confidential 7

5.13 Access Control Rules for Files and Folders ...................................................................... 95

5.14 Using the DFS Cluster Management Utility (dfsdatastore) ................................................. 95

5.14.1 Adding a Cluster for Protection .................................................................................. 96

5.14.2 Updating a Cluster ................................................................................................... 97

5.14.3 Removing a Cluster ................................................................................................. 98

5.14.4 Monitoring a Cluster ................................................................................................ 99

5.14.5 Searching a Cluster ............................................................................................... 100

5.14.6 Listing all Clusters ................................................................................................. 101

5.15 Using the ACL Management Utility (dfsadmin) ............................................................... 101

5.15.1 Adding an ACL Entry for Protecting Directories in HDFS .............................................. 101

5.15.2 Updating an ACL Entry ........................................................................................... 103

5.15.3 Reprotecting Files or Folders ................................................................................... 104

5.15.4 Deleting an ACL Entry to Unprotect Files or Directories .............................................. 104

5.15.5 Activating Inactive ACL Entries ............................................................................... 105

5.15.6 Viewing the ACL Activation Job Progress Information in the Interactive Mode................ 106

5.15.7 Viewing the ACL Activation Job Progress Information in the Non Interactive Mode ......... 107

5.15.8 Searching ACL Entries ............................................................................................ 108

5.15.9 Listing all ACL Entries ............................................................................................ 108

5.16 HDFS Codec for Encryption and Decryption................................................................... 109

6 HBase .................................................................................................................... 110

6.1 Overview of the HBase Protector ................................................................................. 110

6.2 HBase Protector Usage ............................................................................................... 110

6.3 Adding Data Elements and Column Qualifier Mappings to a New Table ............................. 110

6.4 Adding Data Elements and Column Qualifier Mappings to an Existing Table ...................... 111

6.5 Inserting Protected Data into a Protected Table ............................................................. 111

6.6 Retrieving Protected Data from a Table ........................................................................ 111

6.7 Protecting Existing Data ............................................................................................. 112

6.8 HBase Commands ..................................................................................................... 112

6.8.1 put ...................................................................................................................... 112

6.8.2 get ...................................................................................................................... 112

6.8.3 scan .................................................................................................................... 113

6.9 Ingesting Files Securely ............................................................................................. 113

6.10 Extracting Files Securely ............................................................................................ 113

6.11 Sample Use Cases ..................................................................................................... 113

7 Impala .................................................................................................................. 114

7.1 Overview of the Impala Protector ................................................................................ 114

7.2 Impala Protector Usage .............................................................................................. 114

7.3 Impala UDFs ............................................................................................................. 114

Big Data Protector Guide 6.6.5

Contents

Confidential 8

7.3.1 pty_GetVersion() .................................................................................................. 114

7.3.2 pty_WhoAmI() ...................................................................................................... 115

7.3.3 pty_GetCurrentKeyId() .......................................................................................... 115

7.3.4 pty_GetKeyId() ..................................................................................................... 115

7.3.5 pty_StringEnc() .................................................................................................... 115

7.3.6 pty_StringDec() .................................................................................................... 116

7.3.7 pty_StringIns() ..................................................................................................... 116

7.3.8 pty_StringSel() ..................................................................................................... 116

7.3.9 pty_UnicodeStringIns() .......................................................................................... 117

7.3.10 pty_UnicodeStringSel() .......................................................................................... 117

7.3.11 pty_IntegerEnc() ................................................................................................... 118

7.3.12 pty_IntegerDec() .................................................................................................. 118

7.3.13 pty_IntegerIns() ................................................................................................... 118

7.3.14 pty_IntegerSel() ................................................................................................... 118

7.3.15 pty_FloatEnc() ...................................................................................................... 119

7.3.16 pty_FloatDec() ...................................................................................................... 119

7.3.17 pty_FloatIns() ....................................................................................................... 119

7.3.18 pty_FloatSel() ....................................................................................................... 120

7.3.19 pty_DoubleEnc() ................................................................................................... 120

7.3.20 pty_DoubleDec() ................................................................................................... 121

7.3.21 pty_DoubleIns() .................................................................................................... 121

7.3.22 pty_DoubleSel() .................................................................................................... 121

7.4 Inserting Data from a File into a Table ......................................................................... 122

7.5 Protecting Existing Data ............................................................................................. 123

7.6 Unprotecting Protected Data ....................................................................................... 123

7.7 Retrieving Data from a Table ...................................................................................... 123

7.8 Sample Use Cases ..................................................................................................... 124

8 HAWQ .................................................................................................................... 125

8.1 Overview of the HAWQ Protector ................................................................................. 125

8.2 HAWQ Protector Usage .............................................................................................. 125

8.3 HAWQ UDFs ............................................................................................................. 125

8.3.1 pty_GetVersion() .................................................................................................. 125

8.3.2 pty_WhoAmI() ...................................................................................................... 126

8.3.3 pty_GetCurrentKeyId() .......................................................................................... 126

8.3.4 pty_GetKeyId() ..................................................................................................... 126

8.3.5 pty_VarcharEnc() .................................................................................................. 126

8.3.6 pty_VarcharDec() .................................................................................................. 127

8.3.7 pty_VarcharHash() ................................................................................................ 127

8.3.8 pty_VarcharIns() ................................................................................................... 127

Big Data Protector Guide 6.6.5

Contents

Confidential 9

8.3.9 pty_VarcharSel() ................................................................................................... 128

8.3.10 pty_UnicodeVarcharIns() ....................................................................................... 128

8.3.11 pty_UnicodeVarcharSel() ........................................................................................ 128

8.3.12 pty_IntegerEnc() ................................................................................................... 129

8.3.13 pty_IntegerDec() .................................................................................................. 129

8.3.14 pty_IntegerHash() ................................................................................................. 129

8.3.15 pty_IntegerIns() ................................................................................................... 130

8.3.16 pty_IntegerSel() ................................................................................................... 130

8.3.17 pty_DateEnc() ...................................................................................................... 130

8.3.18 pty_DateDec() ...................................................................................................... 130

8.3.19 pty_DateHash() .................................................................................................... 131

8.3.20 pty_DateIns() ....................................................................................................... 131

8.3.21 pty_DateSel() ....................................................................................................... 131

8.3.22 pty_RealEnc() ....................................................................................................... 132

8.3.23 pty_RealDec()....................................................................................................... 132

8.3.24 pty_RealHash() ..................................................................................................... 132

8.3.25 pty_RealIns() ....................................................................................................... 132

8.3.26 pty_RealSel() ....................................................................................................... 133

8.4 Inserting Data from a File into a Table ......................................................................... 133

8.5 Protecting Existing Data ............................................................................................. 134

8.6 Unprotecting Protected Data ....................................................................................... 134

8.7 Retrieving Data from a Table ...................................................................................... 135

8.8 Sample Use Cases ..................................................................................................... 135

9 Spark..................................................................................................................... 136

9.1 Overview of the Spark Protector .................................................................................. 136

9.2 Spark Protector Usage ............................................................................................... 136

9.3 Spark APIs ............................................................................................................... 136

9.3.1 getVersion() ......................................................................................................... 136

9.3.2 getCurrentKeyId() ................................................................................................. 137

9.3.3 checkAccess() ....................................................................................................... 137

9.3.4 getDefaultDataElement()........................................................................................ 138

9.3.5 hmac() ................................................................................................................ 138

9.3.6 protect() .............................................................................................................. 138

9.3.7 protect() .............................................................................................................. 139

9.3.8 protect() .............................................................................................................. 140

9.3.9 protect() .............................................................................................................. 140

9.3.10 protect() .............................................................................................................. 141

9.3.11 protect() .............................................................................................................. 141

9.3.12 protect() .............................................................................................................. 142

Big Data Protector Guide 6.6.5

Contents

Confidential 10

9.3.13 protect() .............................................................................................................. 142

9.3.14 protect() .............................................................................................................. 143

9.3.15 protect() .............................................................................................................. 143

9.3.16 protect() .............................................................................................................. 144

9.3.17 protect() .............................................................................................................. 145

9.3.18 protect() .............................................................................................................. 145

9.3.19 unprotect() ........................................................................................................... 146

9.3.20 unprotect() ........................................................................................................... 146

9.3.21 unprotect() ........................................................................................................... 147

9.3.22 unprotect() ........................................................................................................... 148

9.3.23 unprotect() ........................................................................................................... 148

9.3.24 unprotect() ........................................................................................................... 149

9.3.25 unprotect() ........................................................................................................... 149

9.3.26 unprotect() ........................................................................................................... 150

9.3.27 unprotect() ........................................................................................................... 151

9.3.28 unprotect() ........................................................................................................... 151

9.3.29 unprotect() ........................................................................................................... 152

9.3.30 unprotect() ........................................................................................................... 152

9.3.31 unprotect() ........................................................................................................... 153

9.3.32 reprotect() ........................................................................................................... 154

9.3.33 reprotect() ........................................................................................................... 154

9.3.34 reprotect() ........................................................................................................... 155

9.3.35 reprotect() ........................................................................................................... 155

9.3.36 reprotect() ........................................................................................................... 156

9.3.37 reprotect() ........................................................................................................... 157

9.3.38 reprotect() ........................................................................................................... 157

9.4 Displaying the Cleartext Data from a File ..................................................................... 158

9.5 Protecting Existing Data ............................................................................................. 158

9.6 Unprotecting Protected Data ....................................................................................... 158

9.7 Retrieving the Unprotected Data from a File ................................................................. 159

9.8 Spark APIs and Supported Protection Methods .............................................................. 159

9.9 Sample Use Cases ..................................................................................................... 160

9.10 Spark SQL ................................................................................................................ 160

9.10.1 DataFrames .......................................................................................................... 161

9.10.2 SQLContext .......................................................................................................... 161

9.10.3 Accessing the Hive Protector UDFs .......................................................................... 161

9.10.4 Sample Use Cases ................................................................................................. 162

9.11 Spark Scala .............................................................................................................. 162

9.11.1 Sample Use Cases ................................................................................................. 162

Big Data Protector Guide 6.6.5

Contents

Confidential 11

10 Data Node and Name Node Security with File Protector ........................................ 163

10.1 Features of the Protegrity File Protector ....................................................................... 163

10.1.1 Protegrity File Encryption ....................................................................................... 163

10.1.2 Protegrity Volume Encryption .................................................................................. 163

10.1.3 Protegrity Access Control ....................................................................................... 163

11 Appendix: Return Codes ........................................................................................ 164

12 Appendix: Samples ................................................................................................ 169

12.1 Roles in the Samples ................................................................................................. 170

12.2 Data Elements in the Security Policy ............................................................................ 170

12.3 Role-based Permissions for Data Elements in the Sample ............................................... 171

12.4 Data Used by the Samples ......................................................................................... 171

12.5 Protecting Data using MapReduce ................................................................................ 171

12.5.1 Basic Use Case ..................................................................................................... 172

12.5.2 Role-based Use Cases ............................................................................................ 173

12.5.3 Sample Code Usage ............................................................................................... 176

12.6 Protecting Data using Hive ......................................................................................... 179

12.6.1 Basic Use Case ..................................................................................................... 179

12.6.2 Role-based Use Cases ............................................................................................ 181

12.7 Protecting Data using Pig ........................................................................................... 183

12.7.1 Basic Use Case ..................................................................................................... 184

12.7.2 Role-based Use Cases ............................................................................................ 185

12.8 Protecting Data using HBase ....................................................................................... 189

12.8.1 Basic Use Case ..................................................................................................... 189

12.8.2 Role-based Use Cases ............................................................................................ 190

12.9 Protecting Data using Impala ...................................................................................... 195

12.9.1 Basic Use Case ..................................................................................................... 195

12.9.2 Role-based Use Cases ............................................................................................ 197

12.10 Protecting Data using HAWQ .................................................................................... 201

12.10.1 Basic Use Case ..................................................................................................... 201

12.10.2 Role-based Use Cases ............................................................................................ 203

12.11 Protecting Data using Spark ..................................................................................... 207

12.11.1 Basic Use Case ..................................................................................................... 208

12.11.2 Role-based Use Cases ............................................................................................ 209

12.11.3 Sample Code Usage for Spark (Java) ....................................................................... 212

12.11.4 Sample Code Usage for Spark (Scala) ...................................................................... 217

13 Appendix: HDFSFP Demo ....................................................................................... 221

13.1 Roles in the Demo ..................................................................................................... 221

13.2 HDFS Directories used in Demo ................................................................................... 221

Big Data Protector Guide 6.6.5

Contents

Confidential 12

13.3 User Permissions for HDFS Directories ......................................................................... 221

13.4 Prerequisites for the Demo ......................................................................................... 222

13.5 Running the Demo .................................................................................................... 224

13.5.1 Protecting Existing Data in HDFS ............................................................................. 224

13.5.2 Ingesting Data into a Protected Directory ................................................................. 225

13.5.3 Ingesting Data into an Unprotected Public Directory .................................................. 225

13.5.4 Reading the Data by Authorized Users ..................................................................... 225

13.5.5 Reading the Data by Unauthorized Users .................................................................. 226

13.5.6 Copying Data from One Directory to Another by Authorized Users ............................... 226

13.5.7 Copying Data from One Directory to Another by Unauthorized Users ........................... 227

13.5.8 Deleting Data by Authorized Users .......................................................................... 227

13.5.9 Deleting Data by Unauthorized Users ....................................................................... 228

13.5.10 Copying Data to a Public Directory by Authorized Users ............................................. 228

13.5.11 Running MapReduce Job by Authorized Users ........................................................... 228

13.5.12 Reading Data for Analysis by Authorized Users.......................................................... 229

14 Appendix: Using Hive with HDFSFP ....................................................................... 230

14.1 Data Used by the Samples ......................................................................................... 230

14.2 Ingesting Data to Hive Table ...................................................................................... 230

14.2.1 Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive

Table 230

14.2.2 Ingesting Protected Data from HDFSFP Protected Hive Table to another HDFSFP Protected Hive

Table 231

14.3 Tokenization and Detokenization with HDFSFP .............................................................. 232

14.3.1 Verifying Prerequisites for Using Hadoop Application Protector .................................... 232

14.3.2 Ingesting Data from HDFSFP Protected External Hive Table to HDFSFP Protected Internal Hive

Table in Tokenized Form ...................................................................................................... 232

14.3.3 Ingesting Detokenized Data from HDFSFP Protected Internal Hive Table to HDFSFP Protected

External Hive Table ............................................................................................................. 233

14.3.4 Ingesting Data from HDFSFP Protected External Hive Table to Internal Hive Table not protected

by HDFSFP in Tokenized Form............................................................................................... 233

14.3.5 Ingesting Detokenized Data from Internal Hive Table not protected by HDFSFP to HDFSFP

Protected External Hive Table ............................................................................................... 234

15 Appendix: Configuring Talend with HDFSFP .......................................................... 235

15.1 Verifying Prerequisites before Configuring Talend with HDFSFP ....................................... 235

15.2 Verifying the Talend Packages .................................................................................... 235

15.3 Configuring Talend with HDFSFP ................................................................................. 235

15.4 Starting a Project in Talend ........................................................................................ 236

15.5 Configuring the Preferences for Talend ......................................................................... 237

15.6 Ingesting Data in the Target HDFS Directory in Protected Form....................................... 238

15.7 Accessing the Data from the Protected Directory in HDFS ............................................... 243

Big Data Protector Guide 6.6.5

Contents

Confidential 13

15.8 Configuring Talend Jobs to run with HDFSFP with Target Exec as Remote ......................... 247

15.9 Using Talend with HDFSFP and MapReduce ................................................................... 249

15.9.1 Protecting Data Using Talend with HDFSFP and MapReduce ........................................ 249

15.9.2 Unprotecting Data Using Talend with HDFSFP and MapReduce .................................... 252

15.9.3 Sample Code Usage ............................................................................................... 254

16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database ... 257

16.1 Migrating Tokenized Unicode Data from a Teradata Database ......................................... 257

16.2 Migrating Tokenized Unicode Data to a Teradata Database ............................................. 258

Big Data Protector Guide 6.6.5

Introduction to this Guide

Confidential 14

1 Introduction to this Guide

This guide provides information about installing, configuring, and using the Protegrity Big Data

Protector (BDP) for Hadoop.

1.1. Sections contained in this Guide

The guide is broadly divided into the following sections:

• Section 1 Introduction to this Guide defines the purpose and scope for this guide. In

addition, it explains how information is organized in this guide.

• Section 2 Overview of the Big Data Protector provides a general idea of Hadoop and how it

has been integrated with the Big Data Protector. In addition, it describes the protection

coverage of various Hadoop ecosystem applications, such as MapReduce, Hive and Pig, and

information about HDFS File Protection (HDFSFP).

• Section 3 Installing and Uninstalling Big Data Protector includes information common to all

distributions, such as prerequisites for installation, installation procedure and uninstallation

of the product from the cluster. In addition, it provides information about the tools and

utilities.

• Section 4 Hadoop Application Protector provides information about Hadoop Application

Protector. In addition, it covers information about MapReduce APIs and Hive and Pig UDFs.

• Section 5 HDFS File Protector (HDFSFP) provides information about the protection of files

stored in HDFSFP and the commands supported.

• Section 6 HBase provides information about the Protegrity HBase protector.

• Section 7 Impala provides information about the Protegrity Impala protector.

• Section 8 HAWQ provides information about the Protegrity HAWQ protector.

• Section 9 Spark provides information about the Protegrity Spark protector. In addition, it

provides information about Spark SQL and Spark Scala.

• Section 10 Data Node and Name Node Security with File Protector provides information

about the protection of the Data and Name nodes using the File Protector.

• Section 11 Appendix: Return Codes provides information about all possible error codes and

error descriptions for Big Data Protector.

• Section 12 Appendix: Samples provides information about sample data protection for

MapReduce, Hive, Pig, HBase, Impala, HAWQ, and Spark using Big Data Protector.

• Section 13 Appendix: HDFSFP Demo provides information about sample data protection for

HDFSFP using Big Data Protector.

• Section 14 Appendix: Using Hive with HDFSFP provides information about using Hive with

HDFSFP.

• Section 15 Appendix: Configuring Talend with HDFSFP provides the procedures for

configuring Talend with HDFSFP.

• Section 16 Appendix: Migrating Tokenized Unicode Data from and to a Teradata Database

describes procedures for migrating tokenized Unicode data from and to a Teradata

database.

1.2. Protegrity Documentation Suite

The Protegrity Documentation Suite comprises of the following documents:

• Protegrity Documentation Master Index Release 6.6.5

• Protegrity Appliances Overview Release 6.6.5

• Protegrity Enterprise Security Administrator Guide Release 6.6.5

• Protegrity File Protector Gateway Server User Guide Release 6.6.4

• Protegrity Protection Server Guide Release 6.6.5

Big Data Protector Guide 6.6.5

Introduction to this Guide

Confidential 15

• Protegrity Data Security Platform Feature Guide Release 6.6.5

• Protegrity Data Security Platform Licensing Guide Release 6.6

• Protegrity Data Security Platform Upgrade Guide Release 6.6.5

• Protegrity Reports Guide Release 6.6.5

• Protegrity Troubleshooting Guide Release 6.6.5

• Protegrity Application Protector Guide Release 6.5 SP2

• Protegrity Big Data Protector Guide Release 6.6.5

• Protegrity Database Protector Guide Release 6.6.5

• Protegrity File Protector Guide Release 6.6.4

• Protegrity Protection Enforcements Point Servers Installation Guide Release 6.6.5

• Protegrity Protection Methods Reference Release 6.6.5

• Protegrity Row Level Protector Guide Release 6.6.5

• Protegrity Enterprise Security Administrator Quick Start Guide Release 6.6

• Protegrity File Protector Gateway Server Quick Start Guide Release 6.6.2

• Protegrity Protection Server Quick Start Guide Release 6.6

1.5 Glossary

This section includes Protegrity specific terms, products, and abbreviations used in this document.

Name Description

BDP

The Big Data Protector (BDP) is the API for protecting data on platforms such as

Hive, Impala and HBase.

ESA

Enterprise Security Administrator (ESA)

DPS roles

The DPS roles relate to the security policy in the ESA and control the access

permissions to the Access Keys. For instance, if a user does not have the required

DPS role, then the user would not have access to Access Keys.

DPS

Protegrity Data Protection System (DPS) is the entire system where security policies

are defined and enforced, including ESA and Protectors.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 16

2 Overview of the Big Data Protector

The Protegrity Big Data Protector for Apache Hadoop is based on the Protegrity Application Protector.

The data is split and shared with all the data nodes in the Hadoop cluster. The Big Data Protector is

deployed on each of these nodes and the PEP Server, where the protection enforcement policies are

shared.

The Protegrity Big Data Protector is scalable and new nodes can be added as required. It is cost

effective since massively parallel computing is done on commodity servers, and it is flexible as it can

work with data from any number of sources. The Big Data Protector is fault tolerant as the system

redirects the work to another node if a node is lost. It can handle all types of data, such as structured

and unstructured data, irrespective of their native formats.

The Big Data Protector protects data, which is handled by various Hadoop applications and protects

files stored in the cluster. MapReduce, Hive, Pig, HBase, and Impala can use Protegrity protection

interfaces to protect data as it is stored or retrieved from the Hadoop cluster. All standard protection

techniques offered by Protegrity are applicable to Big Data Protector.

For more information about the available protection options, such as data types, Tokenization or

Encryption types, or length preserving and non-preserving tokens, refer to Protection Methods

Reference Guide 6.6.5.

2.1 Components of Hadoop

The Big Data Protector works on the Hadoop framework as shown in the following figure.

Figure 2-1 Hadoop Components

The illustration of Hadoop components is an example.

Based on requirements, the components of Hadoop might be different.

Hadoop interfaces have been used extensively to develop the Big Data Protector. It is a common

deployment practice to utilize Hadoop Distributed File System (HDFS) to store the data, and let

MapReduce process the data and store the result back in HDFS.

Data Access Framework

HBase

Other

Data Storage Framework

(HDFS)

BI Applications

Data Processing Framework

(MapReduce)

Hive

Pig

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 17

2.1.1 Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS) spans across all nodes in a Hadoop cluster for data storage.

It links together the file systems on many nodes to make them into one big file system. HDFS

assumes that nodes will fail, so data is replicated across multiple nodes to achieve reliability.

2.1.2 MapReduce

The MapReduce framework assigns work to every node in large clusters of commodity machines.

MapReduce programs are sets of instructions to parse the data, create a map or index, and aggregate

the results. Since data is distributed across multiple nodes, MapReduce programs run in parallel,

working on smaller sets of data.

A MapReduce job is executed by splitting each job into small Map tasks, and these tasks are executed

on the node where a portion of the data is stored. If a node containing the required data is saturated

and not able to execute a task, then MapReduce shifts the task to the least busy node by replicating

the data to that node. A Reduce task combines results from multiple Map tasks, and store all of them

back to the HDFS.

2.1.3 Hive

The Hive framework resides above Hadoop to enable ad hoc queries on the data in Hadoop. Hive

supports HiveQL, which is similar to SQL. Hive translates a HiveQL query into a MapReduce program

and then sends it to the Hadoop cluster.

2.1.4 Pig

Pig is a high-level platform for creating MapReduce programs used with Hadoop.

2.1.5 HBase

HBase is a column-oriented datastore, meaning it stores data by columns rather than by rows. This

makes certain data access patterns much less expensive than with traditional row-oriented relational

database systems. The data in HBase is protected transparently using Protegrity HBase coprocessors.

2.1.6 Impala

Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility

of the SQL format and is capable of running the queries on HDFS in HBase.

The Impala daemon runs on each node in the cluster, reading and writing to data in the files, and

accepts queries from the Impala shell command. The following are the core components of Impala:

• Impala daemon (impalad) – This component is the Impala daemon which runs on each node

in the cluster. It reads and writes the data in the files and accepts queries from the Impala

shell command.

• Impala Statestore (statestored) – This component checks the health of the Impala daemons

on all the nodes contained in the cluster. If a node is unavailable due to any error or failure,

then the Impala statestore component informs all other nodes about the failed node to ensure

that new queries are not sent to the failed node.

• Impala Catalog (catalogd) – This component is responsible for communicating any changes

in the metadata received from the Impala SQL statements to all the nodes in the cluster.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 18

2.1.7 HAWQ

HAWQ is an MPP database, which uses several Postgres database instances and HDFS storage. The

database is distributed across HAWQ segments, which enable it to achieve data and processing

parallelism.

Since HAWQ uses the Postgres engine for processing queries, the query language is similar to

PostgresSQL. Users connect to the HAWQ Master and interact using SQL statements, similar to the

Postgres database.

The following are the core components of HAWQ:

• HAWQ Master Server: Enables users to interact with HAWQ using client programs, such as

PSQL or APIs, such as JDBC or ODBC

• Name Node: Enables client applications to locate a file

• HAWQ Segments: Are the units which process the individual data modules simultaneously

• HAWQ Storage: Is HDFS, which stores all the table data

• Interconnect Switch: Is the networking layer of HAWQ, which handles the communication

between the segments

2.1.8 Spark

Spark is an execution engine that carries out batch processing of jobs in-memory and handles a

wider range of computational workloads. In addition to processing a batch of stored data, Spark is

capable of manipulating data in real time.

Spark leverages the physical memory of the Hadoop system and utilizes Resilient Distributed

Datasets (RDDs) to store the data in-memory and lowers latency, if the data fits in the memory size.

The data is saved on the hard drive only if required.

2.2 Features of Protegrity Big Data Protector

The Protegrity Big Data Protector (Big Data Protector) uses patent-pending vaultless tokenization

and central policy control for access management and secures sensitive data at rest in the following

areas:

• Data in HDFS

• Data used during MapReduce, Hive and Pig processing, and with HBase, Impala, HAWQ, and

Spark

• Data traversing enterprise data systems

The data is protected from internal and external threats, and users and business processes can

continue to utilize the secured data.

Data protection may be by encryption or tokenization. In tokenization, data is converted to similar

looking inert data known as tokens where the data format and type can be preserved. These tokens

can be detokenized back to the original values when it is required.

Protegrity secures files with volume encryption and also protects data inside files using tokenization

and strong encryption protection methods. Depending on the user access rights and the policies set

using Policy management in ESA, this data is unprotected.

The Protegrity Hadoop Big Data Protector provides the following features:

• Provides fine grained field-level protection within the MapReduce, Hive, Pig, HBase, and Spark

frameworks.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 19

• Provides directory and file level protection (encryption).

• Retains distributed processing capability as field-level protection is applied to the data.

• Protects data in the Hadoop cluster using role-based administration with a centralized security

policy.

• Provides logging and viewing data access activities and real-time alerts with a centralized

monitoring system.

• Ensures minimal overhead for processing secured data, with minimal consumption of

resources, threads and processes, and network bandwidth.

• Performs file and volume encryption including the protection of files on the local file system

of Hadoop nodes.

• Provides transparent data protection and row level filtering based on the user profile with

Protegrity HBase protectors.

• Transparently protects files processed by MapReduce and Hive in HDFS using HDFSFP.

The following figure illustrates the various components in an Enterprise Hadoop ecosystem.

Figure 2-2 Enterprise Hadoop Components

Currently, Protegrity supports MapReduce, Hive, Pig, and HBase which utilize HDFS as the data

storage layer. The following points can be referred to as general guidelines:

• Sqoop: Sqoop can be used for ingestion into HDFSFP protected zone (For Hortonworks,

Cloudera and Pivotal HD).

• Beeline, Beeswax, and Hue on Cloudera: Beeline, Beeswax, and Hue are certified with Hive

protector and Hive with HDFSFP integrations.

• Beeline, Beeswax, and Hue on Hortonworks & Pivotal HD: Beeline, Beeswax, and Hue are

certified with Hive protector and Hive with HDFSFP integrations.

• Ranger (Hortonworks): Ranger is certified to work with the Hive protector and Hive with

HDFSFP integrations only.

• Sentry (Cloudera): Sentry is certified with Hive protector, Hive with HDFSFP integrations, and

Impala protector only.

• MapReduce and HDFSFP integration is certified with TEXTFILE format only.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 20

• Hive and HDFSFP integration is certified with TEXTFILE, RCFile, and SEQUENCEFILE formats

only.

• Pig and HDFSFP integration is certified with TEXTFILE format only.

We neither support nor have certified other components in the Hadoop stack. We strongly

recommend consulting Protegrity, before using any unsupported components from the Hadoop

ecosystem with our products.

2.3 Using Protegrity Data Security Platform with Hadoop

To protect data, the components of the Protegrity Data Security Platform are integrated into the

Hadoop cluster as shown in the following figure.

Figure 2-3 Protegrity Data Security Platform with Hadoop

The Enterprise Security Administrator (ESA) is a soft appliance that needs to be pre-installed on a

separate server, which is used to create and manage policies.

The following figure illustrates the inbound and outbound ports that need to be allowed on the

network for communication between the ESA and the Big Data Protector nodes in a Hadoop cluster.

Figure 2-4 Inbound and Outbound Ports between the ESA and Big Data Protector Nodes

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 21

For more information about installing the ESA, and creating and managing policies, refer Protegrity

Enterprise Security Administrator Guide Release 6.6.5.

To achieve a parallel nature for the system, a PEP Server is installed on every data node. It is

synchronized with the connection properties of ESA.

Each task runs on a node under the same Hadoop user. Every user has a policy deployed for running

their jobs on this system. Hadoop manages the accounts and users. You can get the Hadoop user

information from the actual job configuration.

HDFS implements a permission model for files and directories, based on the Portable Operating

System Interface (POSIX) for Unix model. Each file and directory is associated with an owner and a

group. Depending on the permissions granted, users for the file and directory can be classified into

one of these three groups:

• Owner

• Other users of the group

• All other users

2.4 Overview of Hadoop Application Protection

This section describes the various levels of protection provided by Hadoop Application Protection.

2.4.1 Protection in MapReduce Jobs

A MapReduce job in the Hadoop cluster involves sensitive data. You can use Protegrity interfaces to

protect data when it is saved or retrieved from a protected source. The output data written by the

job can be encrypted or tokenized. The protected data can be subsequently used by other jobs in

the cluster in a secured manner. Field level data can be secured and ingested into HDFS by

independent Hadoop jobs or other ETL tools.

For more information about secure ingestion of data in Hadoop, refer to section 2.6.2 Ingesting Files

Using Hive Staging.

For more information on the list of available APIs, refer to section 4.4 MapReduce APIs.

If Hive queries are created to operate on sensitive data, then you can use Protegrity Hive UDFs for

securing data. While inserting data to Hive tables, or retrieving data from protected Hive table

columns, you can call Protegrity UDFs loaded into Hive during installation. The UDFs protect data

based on the input parameters provided.

Secure ingestion of data into HDFS to operate Hive queries can be achieved by independent Hadoop

jobs or other ETL tools.

For more information about securely ingesting data in Hadoop, refer to section 2.6 Ingesting Data

Securely.

2.4.2 Protection in Hive Queries

Protection in Hive queries is done by Protegrity Hive UDFs, which translates a HiveQL query into a

MapReduce program and then sends it to the Hadoop cluster.

For more information on the list of available UDFs, refer to section 4.5 Hive UDFs.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 22

2.4.3 Protection in Pig Jobs

Protection in Pig jobs is done by Protegrity Pig UDFs, which are similar in function to the Protegrity

UDFs in Hive.

For more information on the list of available UDFs, refer to section 4.6 Pig UDFs.

2.4.4 Protection in HBase

HBase is a database which provides random read and write access to tables, consisting of rows and

columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as

more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase

tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions

contain a datastore in memory and a persistent datastore(HFile). The Name node assigns multiple

regions to a region server. The Name node manages the cluster and the region servers store portions

of the HBase tables and perform the work on the data.

The Protegrity HBase protector extends the functionality of the data storage framework and provides

transparent data protection and unprotection using coprocessors, which provide the functionality to

run code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers

and protects the data stored in the servers. All clients which work with HBase are supported.

The data is transparently protected or unprotected, as required, utilizing the coprocessor framework.

For more information about HBase, refer to section 6 HBase.

2.4.5 Protection in Impala

Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility

of the SQL format and is capable of running the queries on HDFS in HBase.

The Protegrity Impala protector extends the functionality of the Impala query engine and provides

UDFs which protect or unprotect the data as it is stored or retrieved.

For more information about the Impala protector, refer to section 7 Impala.

2.4.6 Protection in HAWQ

HAWQ is an MPP database, which enable it to achieve data and processing parallelism.

The Protegrity HAWQ protector provides UDFs for protecting data using encryption or tokenization,

and unprotecting data by using decryption or detokenization.

For more information about the HAWQ protector, refer to section 8 HAWQ.

2.4.7 Protection in Spark

Spark is an execution engine that carries out batch processing of jobs in-memory and handles a

wider range of computational workloads. In addition to processing a batch of stored data, Spark is

capable of manipulating data in real time.

The Protegrity Spark protector extends the functionality of the Spark engine and provides APIs that

protect or unprotect the data as it is stored or retrieved.

For more information about the Spark protector, refer to section 9 Spark.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 23

2.5 HDFS File Protection (HDFSFP)

Files are stored and retrieved by Hadoop system elements, such as file shell commands, MapReduce,

Hive, Pig, HBase and so on. The stored files reside in HDFS and span multiple cluster nodes.

Most of the files in HDFS are plain text files and stored in the clear, with access control like a POSIX

file system. These files contain sensitive data, making it vulnerable with exposure to unwanted users.

These files are transparently protected as they are stored in HDFS. In addition, the content is exposed

only to authorized users. The content in the files is unprotected transparently to processes or users,

authorized to view and process the files. The user is automatically detected from the job information

provided by HDFSFP. The job accessing secured files must be initialized by an authorized user having

the required privileges in ACL. The files encrypted by HDFSFP are suitable for distributed processing

by Hadoop distributed jobs like MapReduce.

HDFSFP protects individual files or files stored in a directory. The access control is governed by the

security policy and ACL supplied by the security officer. The access control and security policy is

controlled through ESA interfaces. Command line and UI options are available to control ACL entries

for file paths and directories.

2.6 Ingesting Data Securely

This section describes the ways in which data can be secured and ingested by various jobs in Hadoop

at a field or file level.

2.6.1 Ingesting Data Using ETL Tools and File Protector Gateway

(FPG)

Protegrity provides the File Protector Gateway (FPG) for secure field level protection to ingest data

through ETL tools. The ingested files data can be used by Hadoop applications for analytics and

processing. The sensitive fields are secured by the FPG before Hadoop jobs operate on it.

For more information about field level ingestion by custom MapReduce job for data at rest in HDFS,

refer to File Protector Gateway Server Guide 6.6.4.

2.6.2 Ingesting Files Using Hive Staging

Semi-structured data files can be loaded into a Hive staging table for ingestion into a Hive table with

Hive queries and Protegrity UDFs. After loading data in the table, the data will be stored in protected

form.

2.6.3 Ingesting Files into HDFS by HDFSFP

The HDFSFP component of Big Data Protector can be used for ingesting files securely in HDFS. It

provides granular access control for the files in HDFS. You can ingest files using the command shell

and Java API in HDFSFP.

For more information about using HDFSFP, refer to section 5 HDFS File Protector (HDFSFP).

2.7 Data Security Policy and Protection Methods

A data security policy establishes processes to ensure the security and confidentiality of sensitive

information. In addition, the data security policy establishes administrative and technical safeguards

against unauthorized access or use of the sensitive information.

Depending on the requirements, the data security policy typically performs the following functions:

• Classifies the data that is sensitive for the organization.

Big Data Protector Guide 6.6.5

Overview of the Big Data Protector

Confidential 24

• Defines the methods to protect sensitive data, such as encryption and tokenization.

• Defines the methods to present the sensitive data, such as masking the display of

sensitive information.

• Defines the access privileges of the users that would be able to access the data.

• Defines the time frame for privileged users to access the sensitive data.

• Enforces the security policies at the location where sensitive data is stored.

• Provides a means of auditing authorized and unauthorized accesses to the sensitive data.

In addition, it can also provide a means of auditing operations to protect and unprotect

the sensitive data.

The data security policy contains a number of components, such as, data elements, datastores,

member sources, masks, and roles. The following list describes the functions of each of these entities:

• Data elements define the data protection properties for protecting sensitive data,

consisting of the data securing method, data element type and its description. In addition,

Data elements describe the tokenization or encryption properties, which can be associated

with roles.

• Datastores consist of enterprise systems, which might contain the data that needs to be

processed, where the policy is deployed and the data protection function is utilized.

• Member sources are the external sources from which users (or members) and groups

of users are accessed. Examples are a file, database, LDAP, and Active Directory.

• Masks are a pattern of symbols and characters, that when imposed on a data field,

obscures its actual value to the user. Masks effectively aid in hiding sensitive data.

• Roles define the levels of member access that are appropriate for various types of

information. Combined with a data element, roles determine and define the unique data

access privileges for each member.

For more information about the data security policies, protection methods, and the data elements

supported by the components of the Big Data Protector, refer to Protection Methods Reference Guide

6.6.5.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 25

3 Installing and Uninstalling Big Data Protector

This section describes the procedure to install and uninstall the Big Data Protector.

3.1 Installing Big Data Protector on a Cluster

This section describes the tasks for installing Big Data Protector on a cluster.

Starting from the Big Data Protector 6.6.4 release, you do not require root access to install

Big Data Protector on a cluster.

You need a sudoer user account to install Big Data Protector on a cluster.

3.1.1 Verifying Prerequisites for Installing Big Data Protector

Ensure that the following prerequisites are met, before installing Big Data Protector:

• The Hadoop cluster is installed, configured, and running.

• ESA appliance version 6.6.5 is installed, configured, and running.

• A sudoer user account with privileges to perform the following tasks:

o Update the system by modifying the configuration, permissions, or ownership

of directories and files.

o Perform third party configuration.

o Create directories and files.

o Modify the permissions and ownership for the created directories and files.

o Set the required permissions to the create directories and files for the Protegrity

Service Account.

o Permissions for using the SSH service.

• The sudoer password is the same across the cluster.

• The following user accounts to perform the required tasks:

o ADMINISTRATOR_USER: It is the sudoer user account that is responsible to

install and uninstall the Big Data Protector on the cluster.

This user account needs to have sudo access to install the product.

o EXECUTOR_USER: It is a user that has ownership of all Protegrity files,

folders, and services.

o OPERATOR_USER: It is responsible for performing tasks such as, starting or

stopping tasks, monitoring services, updating the configuration, and

maintaining the cluster while the Big Data Protector is installed on it.

If you need to start, stop, or restart the Protegrity services, then you need

sudoer privileges for this user to impersonate the EXECUTOR_USER.

Depending on the requriements, a single user on the system may

perform multiple roles.

If a single user is performing multiple roles, then ensure that the

following conditions are met:

• The user has the required permissions and privileges to

impersonate the other user accounts, for performing their

roles, and perform tasks as the impersonated user.

• The user is assigned the highest set of privileges, from the

required roles that it needs to perform, to execute the required

tasks.

For instance, if a single user is performing tasks as

ADMINISTRATOR_USER, EXECUTOR_USER, and

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 26

OPERATOR_USER, then ensure that the user is assigned the

privileges of the ADMINISTRATOR_USER.

• The management scripts provided by the installer in the cluster_utils directory should

be run only by the user (OPERATOR_USER) having privileges to impersonate the

EXECUTOR_USER.

o If the value of the AUTOCREATE_PROTEGRITY_IT_USR parameter in the

BDP.config file is set to No, then ensure that a service group containing a user

for running the Protegrity services on all the nodes in the cluster already

exists.

o If the Hadoop cluster is configured with LDAP or AD for user management,

then ensure that the AUTOCREATE_PROTEGRITY_IT_USR parameter in the

BDP.config file is set to No and that the required service account user is

created on all the nodes in the cluster.

• If the Big Data Protector with versions lower than 6.6.3 was previously installed with

HDFSFP, then ensure that you create the backup of DFSFP on the ESA.

For more information about creating the DFSFP backup, refer to section 4.1.4

Backing Up DFSFP before Installing Big Data Protector 6.6.3 in Data Security

Platform Upgrade Guide 6.6.5.

• If Big Data Protector, version 6.6.3, with build version 6.6.3.15, or lower, was

previously installed and the following Spark protector APIs for Encryption/Decryption

are utilized:

o public void protect(String dataElement, List<Integer> errorIndex, short[]

input, byte[][] output)

o public void protect(String dataElement, List<Integer> errorIndex, int[] input,

byte[][] output)

o public void protect(String dataElement, List<Integer> errorIndex, long[]

input, byte[][] output)

o public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, short[] output)

o public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, int[] output)

o public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, long[] output)

For more information, refer to the Advisory for Spark Protector APIs, before installing

Big Data Protector, version 6.6.5.

• If the Big Data Protector was previously installed then uninstall it. In addition, delete

the <PROTEGRITY_DIR> directory from the Lead node. If the /var/log/protegrity/

directory exists on any node in the cluster, then ensure that it is empty.

• Password based authentication is enabled in the sshd_config file before installation.

After the installation is completed, this setting might be reverted back by the system

administrator.

• The lsb_release library is present on the client machine, at least on the Lead node.

The Lead node can be any node, such as the Name node, Data node, or Edge node,

that can access the Hadoop cluster. The Lead node would be driving the installation of

the Big Data Protector across the Hadoop cluster and is responsible for managing the

Big Data Protector services throughout the cluster.

If the lsb_release library is not present, then the installation of the Big Data Protector

fails. This can be verified by running the following command.

lsb_release

• If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster,

then ensure that the HDFS superuser (hdfs) has a valid Kerberos ticket.

• If you are configuring HDFSFP with Big Data Protector, then ensure that the following

prerequisites are met:

o Ensure that an unstructured policy is created in the ESA, containing the data

elements to be linked with the ACL.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 27

o If a sticky bit is set for an HDFS directory, which is required to be protected by

HDFSFP, then the user needs to remove the sticky bit before creating ACLs (for

Protect/Reprotect/Unprotect/Update) for that HDFS directory. If required, then the

user can set the sticky bit again after activating the ACLs.

For more information about creating data elements, security policies, and user roles,

refer to Enterprise Security Administrator Guide 6.6.5 and Protection Enforcement

Point Servers Installation Guide 6.6.5.

3.1.2 Extracting Files from the Installation Package

 To extract the files from the installation package:

1. After receiving the installation package from Protegrity, copy it to the Lead node in any

temporary folder, such as /opt/bigdata.

2. Extract the files from the installation package using the following command:

tar –xf BigDataProtector_<OS>-<arch>-nCPU_<Big data distribution>-

64_6.6.5.x.tgz

The following files are extracted:

• BDP.config

• BdpInstallx.x.x_Linux_<arch>_6.6.5.x.sh

• FileProtector_<OS>_x86-<arch>_AccessControl_6.6.x.x.sh

• FileProtector_<OS>_x86-<arch>_ClusterDeploy_6.6.x.x.sh

• FileProtector_<OS>_x86-<arch>_FileEncryption_6.6.x.x.sh

• FileProtector_<OS>_x86-<arch>_PreInstallCheck_6.6.x.x.sh

• FileProtector_<OS>_x86-<arch>_VolumeEncryption_6.6.x.x.sh

• FP_ClusterDeploy_hosts

• INSTALL.txt

• JpepLiteSetup_Linux_<arch>_6.6.5.x.sh

• node_uninstall.sh

• PepHbaseProtectorx.x.xSetup_Linux_<arch>_<distribution>-x.x_6.6.5.x.sh

• PepHdfsFp_Setup_<distribution>-x.x_6.6.5.x.sh

• PepHivex.x.xSetup_Linux_<arch>_<distribution>-x.x_6.6.5.x.sh

• PepImpalax.xSetup_<OS>_x86-<arch>_6.6.5.x.sh, only if it is a Cloudera or MapR

distribution

• PepHawqx.xSetup_<OS>_x86-<arch>_6.6.5.x.sh, only if it is a Pivotal distribution

• PepMapreducex.x.xSetup_Linux_<arch>_<distribution>-x.x_6.6.5.x.sh

• PepPigx.x.xSetup_Linux_<arch>_<distribution>-x.x_6.6.5.x.sh

• PepServer_Setup_Linux_<arch>_6.6.5.x.sh

• PepSparkx.x.xSetup_Linux_<arch>_<distribution>-x.x_6.6.5.x.sh

• PepTalendSetup_x.x.x_6.6.5.x.sh

• Prepackaged_Policyx.x.x_Linux_<arch>_6.6.5.x.sh

• ptyLogAnalyzer.sh

• ptyLog_Consolidator.sh

• samples-mapreduce.tar

• samples-spark.tar

• uninstall.sh

• XCPep2Jni_Setup_Linux_<arch>_6.6.5.x.sh

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 28

3.1.3 Updating the BDP.config File

Ensure that the BDP.config file is updated before the Big Data Protector is installed.

Do not update the BDP.config file when the installation of the Big Data Protector is in

progress.

 To update the BDP.config file:

1. Create a file containing a list of all nodes in the cluster, except the Lead node, and specify it

in the BDP.config file.

This file is used by the installer for installing Big Data Protector on the nodes.

2. Open the BDP.config file in any text editor and modify the following parameter values:

• HADOOP_DIR – The installation home directory for the Hadoop distribution.

• PROTEGRITY_DIR – The directory where the Big Data Protector will be installed.

The samples and examples used in this document assume that the Big Data Protector

is installed in the /opt/protegrity/ directory.

• CLUSTERLIST_FILE – This file contains the host name or IP addresses all the nodes

in the cluster, except the Lead node, listing one host name and IP address per line.

Ensure that you specify the file name with the complete path.

• INSTALL_DEMO – Specifies one of the following values, as required:

o Yes – The installer installs the demo.

o No – The installer does not install the demo.

• HDFSFP – Specifies one of the following values, as required:

o Yes – The installer installs HDFSFP.

o No – The installer does not install HDFSFP.

If HDFSFP is being installed, then XCPep2Jni is installed using the

XCPep2Jni_Setup_Linux_<arch>_6.6.5.x.sh script.

• SPARK_PROTECTOR – Specifies one of the following values, as required:

Yes – The installer installs the Spark protector. This parameter also needs to

be set to Yes, if the user needs to run Hive UDFs with Spark SQL, or use the

Spark protector samples if the INSTALL_DEMO parameter is set to Yes.

o No – The installer does not install the Spark protector.

• IP_NN – The IP address of the Lead node in the Hadoop cluster, which is required

for the installation of HDFSFP.

• PROTEGRITY_CACHE_PORT – The Protegrity Cache port used in the cluster. This port

should be open in the firewall across the cluster. On the Lead node, it should be open

only for the corresponding ESA, which is used to manage the cluster protection. This

is required for the installation of HDFSFP. Typical value for this port is 6379.

• AUTOCREATE_PROTEGRITY_IT_USR – This parameter determines the Protegrity

service account. The service group and service user name specified in the

PROTEGRITY_IT_USR_GROUP and PROTEGRITY_IT_USR parameters respectively

will be created if this parameter is set to Yes. One of the following values can be

specified, as required:

o Yes – The installer creates a service group PROTEGRITY_IT_USR_GROUP

containing the user PROTEGRITY_IT_USR for running the Protegrity services

on all the nodes in the cluster.

If the service group or service user are already present, then the installer

exits.

If you uninstall the Big Data Protector, then the service group and the service

user are deleted.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 29

o No – The installer does not create a service group

PROTEGRITY_IT_USR_GROUP with the service user PROTEGRITY_IT_USR for

running the Protegrity services on all the nodes in the cluster.

Ensure that a service group containing a service user for running Protegrity

services has been created, as described in section 3.1.1 Verifying

Prerequisites for Installing Big Data Protector.

• PROTEGRITY_IT_USR_GROUP – This service group is required for running the

Protegrity services on all the nodes in the cluster. All the Protegrity installation

directories are owned by this service group.

• PROTEGRITY_IT_USR – This service account user is required for running the

Protegrity services on all the nodes in the cluster and is a part of the group

PROTEGRITY_IT_USR_GROUP. All the Protegrity installation directories are owned

by this service user.

• HADOOP_NATIVE_DIR – The Hadoop native directory. This parameter needs to be

specified if you are using MapR.

• HADOOP_SUPER_USER – The Hadoop super user name. This parameter needs to

be specified if you are using MapR.

3.1.4 Installing Big Data Protector

 To install the Big Data Protector:

1. As a sudoer user, run BdpInstallx.x.x_Linux_<arch>_6.6.5.x.sh from the folder where it is

extracted.

A prompt to confirm or cancel the Big Data Protector installation appears.

2. Type yes to continue with the installation.

The Big Data Protector installation starts.

If you are using a Cloudera or MapR distribution, then the presence of the HDFS connection

is also verified.

A prompt to enter the sudoer password for the ADMINISTRATOR user appears.

3. Enter the sudoer password.

A prompt to enter the ESA user name or IP address appears.

4. Enter the ESA host name or IP address.

A prompt to enter the ESA user name appears.

5. Enter the ESA user name (Security Officer).

The PEP Server Installation wizard starts and a prompt to configure the host as ESA proxy

appears.

6. Depending on the requirements, type Yes or No to configure the host as an ESA proxy.

7. If the ESA proxy is set to Yes, then enter the host password for the required ESA user.

8. When prompted, perform the following steps to download the ESA keys and certificates.

a) Specify the Security Officer user with administrative privileges.

b) Specify the Security Officer password for the ESA certificates and keys.

The installer then installs the Big Data Protector on all the nodes in the cluster.

The status of the installation of the individual components appears, and the log files for all

the required components on all the nodes in the cluster are stored on the Lead node in the

<PROTEGRITY_DIR>/cluster_utils/logs directory.

Verify the installation report, that is generated at

<PROTEGRITY_DIR>/cluster_utils/installation_report.txt to ensure that the installation of

all the components is successful on all the nodes in the cluster.

Verify the bdp_setup.log file confirm if the Big Data Protector was installed successfully on

all the nodes in the cluster.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 30

9. Restart the MapReduce (MRv1) or Yarn (MRv2) services on the Hadoop cluster.

The installer installs the following components in the installation folder of the Big Data

Protector:

• PEP server in the <PROTEGRITY_DIR>/defiance_dps directory

• XCPep2Jni in the <PROTEGRITY_DIR>/defiance_xc directory

• JpepLite in the <PROTEGRITY_DIR>/jpeplite directory

• MapReduce protector in the <PROTEGRITY_DIR>/pepmapreduce/lib directory

• Hive protector in the <PROTEGRITY_DIR>/pephive/lib directory

• Pig protector in the <PROTEGRITY_DIR>/peppig/lib directory

• HBase protector in the <PROTEGRITY_DIR>/pephbase-protector/lib directory

• Impala protector in the <PROTEGRITY_DIR>/pepimpala directory, if you are using a

Cloudera or MapR distribution

• HAWQ protector in the <PROTEGRITY_DIR>/pephawq directory, if you are using a

Pivotal distribution

• hdfsfp-xxx.jar in the <PROTEGRITY_DIR>/hdfsfp directory, only if the value of the

HDFSFP parameter in the BDP.config file is specified as Yes

• pepspark-xxx.jar in the <PROTEGRITY_DIR>/pepspark/lib directory, only if the

value of the SPARK parameter in the BDP.config file is specified as Yes

• Talend-related files in <PROTEGRITY_DIR>/etl/talend directory

• Cluster Utilities in the <PROTEGRITY_DIR>/cluster_utils directory

The following files and directories are present in the

<PROTEGRITY_DIR>/cluster_utils folder:

o BdpInstallx.x.x_Linux_<arch>_6.6.5.x.sh utility to install the Big Data

Protector on any node in the cluster.

For more information about using the

BdpInstallx.x.x_Linux_<arch>_6.6.5.x.sh utility, refer to section 3.2.1

Installing Big Data Protector on New Nodes added to a Hadoop Cluster.

o cluster_cachesrvctl.sh utility for monitoring the status of the Protegrity Cache

on all the nodes in the cluster, only if the value of the HDFSFP parameter in the

BDP.config file is specified as Yes.

o cluster_pepsrvctl.sh utility for managing PEP servers on all nodes in the cluster.

o uninstall.sh utility to uninstall the Big Data Protector from all the nodes in the

cluster.

o node_uninstall.sh to uninstall the Big Data Protector from any nodes in the

cluster.

For more information about using the node_uninstall.sh utility, refer to

section 3.2.2 Uninstalling Big Data Protector from Selective Nodes in the

Hadoop Cluster.

o update_cluster_policy.sh utility for updating PEP servers when a new policy is

deployed.

o BDP.config file

o CLUSTERLIST_FILE, which is a file containing a list of all the nodes, except the

Lead node.

o installation_report.txt file that contains the status of installation of all the

components in the cluster.

o logs directory that contains the consolidated setup logs from all the nodes in

the cluster.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 31

10. Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the MapReduce

protector will return the detailed error and return codes instead of 0 for failure and 1 for

success.

For more information about the error codes for Big Data Protector, version 6.6.5, refer

to Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11

Appendix: Return Codes.

If the older behaviour from the Big Data Protector, version 6.6.3 or lower with the Bulk

APIs in the MapReduce protector is desired, then perform the following steps to enable

the Backward compatibility mode to retain the same error handling capabilities.

a) If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or

higher (Pivotal Hadoop), then append the following entry to the

mapreduce.admin.reduce.child.java.opts property in the mapred-site.xml file.

-Dpty.mr.compatibility=old

b) If you are using CDH, then add the following values to the Yarn Service Mapreduce

Advanced Configuration Snippet (Safety Valve) parameter in the mapred-site.xml

file.

<name>mapreduce.admin.map.child.java.opts</name>

<value>-Dpty.mr.compatibility=old</value>

</property>

<name>mapreduce.admin.reduce.child.java.opts</name>

<value>-Dpty.mr.compatibility=old</value>

</property>

11. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher

(Pivotal Hadoop), and you have installed HDFSFP, then perform the following steps.

a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file

contains the following entries in the order provided.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

<PROTEGRITY_DIR>/hdfsfp/*

Ensure that the above entries are before all other entries in the

mapreduce.application.classpath property.

b) Ensure that the mapred.min.split.size property in the hive-site.xml file is set to the

following value.

mapred.min.split.size=256000

c) Restart the Yarn service.

d) Restart the MRv2 service.

e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file

contains the following entries in the order provided.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 32

<PROTEGRITY_DIR>/peppig/lib/*

<PROTEGRITY_DIR>/hdfsfp/*

Ensure that the above entries are before all other entries in the

tez.cluster.additional.classpath.prefix property.

f) Restart the Tez services.

12. If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher

(Pivotal Hadoop), and you have not installed HDFSFP, then perform the following steps.

a) Ensure that the mapreduce.application.classpath property in the mapred-site.xml file

contains the following entries.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

Ensure that the above entry is before all other entries in the

mapreduce.application.classpath property.

b) Ensure that the yarn.application.classpath property in the yarn-site.xml file contains

the following entries.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

Ensure that the above entry is before all other entries in the

yarn.application.classpath property.

c) Restart the Yarn service.

d) Restart the MRv2 service.

e) Ensure that the tez.cluster.additional.classpath.prefix property in the tez-site.xml file

contains the following entries.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

Ensure that the above entry is before all other entries in the

tez.cluster.additional.classpath.prefix property.

f) Restart the Tez services.

13. If HDFSFP is not installed and you need to use the Hive protector, then perform the following

steps.

a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml

file.

hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook

b) Restart the Hive services to ensure that the updates are propagated to all the nodes

in the cluster.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 33

14. If HDFSFP is installed and you need to use the Hive protector with HDFSFP, then perform the

following steps.

a) Specify the following value for the hive.exec.pre.hooks property in the hive-site.xml

file.

hive.exec.pre.hooks=com.protegrity.hadoop.fileprotector.hive.PtyHivePr

eHook

b) Restart the Hive services to ensure that the updates are propagated to all the nodes

in the cluster.

If you are using Beeline or Hue, then ensure that Protegrity Big Data Protector is installed on the

following machines:

• For Beeline: The machines where Hive Metastore, and HiveServer2 are running.

• For Hue: The machines where HueServer, Hive Metastore, HiveServer2 are running.

It is recommended to use the Cluster Policy provider to deploy the policies in a multi-node cluster

environment, such as Big Data, Teradata etc.

If you require the PEP Server service to start automatically after every reboot of the system, then

define the PEP Server service in the startup with the required run levels.

For more info about starting the PEP Server service automatically, refer to Protection Enforcements

Point Servers Installation Guide Release 6.6.5.

3.1.5 Applying Patches

As the functionality of the ESA is extended, it should be updated through patches applied to ESA.

The patches are available as .pty files, which should be loaded with the ESA user interface.

Receive the ESA_PAP-ALL-64_x86-64_6.6.5.pty, or later patch from Protegrity. Upload this patch on

the ESA using the Web UI. Then install this patch using the ESA CLI manager.

For more information about applying patches, refer to section 4.4.6.2 Install Patches of Protegrity

Appliances Overview.

3.1.6 Installing the DFSFP Service

Using the Add/Remove Services tool on the ESA to install the DFSFP service.

For more information about installing services, refer to Section 4.4.6 of Protegrity Appliances

Overview.

 To install the DFSFP service using the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Administration



Add/Remove Services.

3. Press ENTER.

The root password prompt appears.

4. Enter the root password.

5. Press ENTER.

The Add/Remove Services screen appears

6. Select Install applications.

7. Press ENTER.

8. Select DFSFP.

9. Press ENTER.

The DFSFP service is installed.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 34

3.1.7 Configuring HDFSFP

If HDFSFP is used, then it should be configured after Big Data Protector is installed. To ensure that

the user is able to access protected data in the Hadoop cluster, HDFSFP is globally configured so that

it can perform checks for access control transparently.

Ensure that you set the the value of the

mapreduce.output.fileoutputformat.compress.type property to BLOCK in the mapred-

site.xml file.

3.1.7.1 Configuring HDFSFP for Yarn (MRv2)

 To configure Yarn (MRv2) with HDFSFP:

1. Register the Protegrity codec in the Hadoop codec factory configuration. In the

io.compression.codecs property in the core-site.xml file, add the codec

com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec.

2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the

mapred-site.xml file to true.

3. Add the property mapreduce.output.fileoutputformat.compress.codec to the mapred-

site.xml

file and set the value to

com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec.

If the property is already present in the mapred-site.xml file, then ensure that the existing

value of the property is replaced with

com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec.

4. Include the /hdfsfp/* path as the first value in the

yarn.application.classpath property in the yarn-site.xml file.

5. Restart the HDFS and Yarn services.

3.1.7.2 Configuring HDFSFP for MapReduce, v1 (MRv1)

A MapReduce job processes large data sets stored in HDFS across the Hadoop cluster. The result of

the MapReduce job is stored in HDFS. The HDFSFP stores protected data in encrypted form in HDFS.

The Map job reads protected data and the Reduce job saves the result in protected form. This is done

by configuring the Protegrity codec at global level for MapReduce jobs.

 To configure MRv1 with HDFSFP:

1. Register the Protegrity codec in the Hadoop codec factory configuration. In the

io.compression.codecs property in the core-site.xml file, add the codec

com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec.

2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to true.

3. Modify the value of the mapred.output.compression.codec property in the mapred-site.xml

file to com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec.

4. Restart the HDFS and MapReduce services.

3.1.7.3 Adding a Cluster to the ESA

Before configuring the Cache Refresh Server, ensure that a cluster is added to the ESA.

For more information about adding a cluster to the ESA, refer to section 5.14.1 Adding a Cluster for

Protection.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 35

3.1.7.4 Configuring the Cache Refresh Server

If a cluster is added to the ESA, then the Cache Refresh server periodically validates the cache entries

and takes corrective action, if necessary. This server should always be active.

The Cache Refresh Server periodically validates the ACL entries in Protegrity Cache with the ACL

entries in the ESA.

• If a Data store is created using ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed, then the

Cluster configuration file (clusterconfig.xml), located in the

<PROTEGRITY_DIR>/dfs/dfsadmin/config/ directory, contains the field names RedisPort and

RedisAuth.

• If a Data store is created using ESA 6.5 SP2 Patch 4 with DFSFPv8 patch installed, then the

Cluster configuration file (clusterconfig.xml) contains the field names ProtegrityCachePort

and ProtegrityCacheAuth.

• If a migration of the ESA 6.5 SP2 Patch 3 with DFSFPv3 patch installed to the ESA 6.5 SP2

Patch 4 with DFSFPv8 patch installed is done, then the Cluster configuration file

(clusterconfig.xml) contains the field name entries RedisPort and RedisAuth for the old Data

stores, and the entries ProtegrityCachePort and ProtegrityCacheAuth for the new Data

stores, created after the migration.

If the ACL entries present in the appliance are not matching the ACL entries in Protegrity Cache, then

logs are generated in the ESA. The logs can be viewed from the ESA Web Interface at the following

path: Distributed File System File Protector Logs.

The various error codes are explained in Troubleshooting Guide 6.6.5.

 To configure the Cache Refresh Server time:

1. Navigate to the path <PROTEGRITY_DIR>/dfs/cacherefresh/data.

2. Open the dfscacherefresh.cfg file.

3. Modify the cacherefreshtime parameter as required based on the following guidelines:

• Default value – 30 minutes

• Minimum value – 10 minutes

• Maximum value – 720 minutes (12 hours)

The Cache Refresh Interval should be entered in minutes.

 To verify if the Cache Refresh Server is running:

1. Login to the ESA Web Interface.

2. Navigate to System Services DFS Cache Refresh.

The Cache Refresh Server would be running.

3. If the Cache Refresh Server is not running, then click on the Start button () to start the

Cache Refresh Server.

3.1.7.5 Configuring Hive Support in HDFSFP

If Hive is used with HDFSFP, then it should be configured after installing Big Data Protector.

 To configure Hive support in HDFSFP:

1. If you are using a Hadoop distribution that has a Management UI, then perform the following

steps.

a) In the hive-site.xml file, set the value of the mapreduce.job.maps property to 1,

using the Management UI.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 36

If the hive-site.xml file does not have any mapreduce.job.maps property, then

perform the following tasks.

a. Add the property with the name mapreduce.job.maps in the hive-site.xml

file.

b. Set the value of the mapreduce.job.maps property to 1.

b) In the hive-site.xml file, add the value

com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook to the

hive.exec.pre.hooks property before any other existing value, using the

Management UI.

If the hive-site.xml file does not have any hive.exec.pre.hooks property, then

perform the following tasks.

a. Add the property with the name hive.exec.pre.hooks in the hive-site.xml

file.

b. Set the value of the hive.exec.pre.hooks property to

com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook.

2. If you are using a Hadoop distribution without a Management UI, then perform the following

steps.

a) Add the following property in the hive-site.xml file on all nodes.

<name>mapreduce.job.maps</name>

</property>

If the property is already present in the hive-site.xml file, then ensure that the value

of the property is set to 1.

b) Add the following property in the hive-site.xml file on all nodes.

<name>hive.exec.pre.hooks</name>

<value>com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook</value>

</property>

If the property is already present in the hive-site.xml file, then ensure that the value

com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook is before any other

existing value.

For more information about using Hive with HDFSFP, refer to section 13 Appendix: Using Hive with

HDFSFP.

3.1.8 Configuring HBase

If HBase is used, then it should be configured after Big Data Protector is installed.

Ensure that you configure the Protegrity HBase coprocessor on all the region

servers. If the Protegrity HBase coprocessor is not configured in some region

servers, then an inconsistent state might occur, where some records in a table are

protected and some are not protected.

This could potentially lead to data corruption, making it difficult to separate the

protected data from clear text data.

It is recommended to use HBase version 0.98 or above.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 37

If you are using an HBase version lower than 0.98, then you would need a Java

client to perform the protection of data. HBase versions lower than 0.98 do not

support ATTRIBUTES, which controls the MIGRATION and BYPASS_COPROCESSOR

parameters.

 To configure HBase:

1. If you are using a Hadoop distribution that has a Management UI, then add the following

value to the HBase coprocessor region classes property in the hbase-site.xml file in all the

respective region server groups, using the Management UI.

com.protegrity.hbase.PTYRegionObserver

If the hbase-site.xml file does not have any HBase coprocessor region classes property, then

perform the following tasks.

a) Add the property with the name hbase.coprocessor.region.classes in the hbase-site.xml file in

all the respective region server groups.

b) Set the following value for the hbase.coprocessor.region.classes property.

com.protegrity.hbase.PTYRegionObserver

If any coprocessors are already defined in the HBase coprocessor region class

property, then ensure that the value of the Protegrity coprocessor is before any

pre-existing coprocessors defined in the hbase-site.xml file.

2. If you are using a Hadoop distribution without a Management UI, then add the following

property in the hbase-site.xml file on all region server nodes.

<name>hbase.coprocessor.region.classes</name>

<value>com.protegrity.hbase.PTYRegionObserver</value>

</property>

If the property is already present in the hbase-site.xml file, then ensure that the value of the

Protegrity coprocessor region class is before any other coprocessor in the hbase-site.xml file.

3. Restart all HBase services.

3.1.9 Configuring Impala

If Impala is used, then it should be configured after Big Data Protector is installed.

 To configure Impala:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Navigate to the <PROTEGRITY_DIR>/pepimpala/sqlscripts/ folder.

This folder contains the Protegrity UDFs for the Impala protector.

3. If you are not using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql

script to load the Protegrity UDFs for the Impala protector.

impala-shell -i <IP address of any Impala slave node> -f

<PROTEGRITY_DIR>/pepimpala/sqlscripts/createobjects.sql

4. If you are using a Kerberos-enabled Hadoop cluster, then execute the createobjects.sql script

to load the Protegrity UDFs for the Impala protector.

impala-shell -i <IP address of any Impala slave node> -f

<PROTEGRITY_DIR>/pepimpala/sqlscripts/createobjects.sql -k

If the catalogd process is restarted at any point in time, then all the Protegrity

UDFs for the Impala protector should be reloaded using the command in Step 3

or 4, as required.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 38

3.1.10 Configuring HAWQ

If HAWQ is used, then it should be configured after Big Data Protector is installed.

Ensure that you are logged as the gpadmin user for configuring HAWQ.

 To configure HAWQ:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Navigate to the <PROTEGRITY_DIR>/pephawq/sqlscripts/ folder.

This folder contains the Protegrity UDFs for the HAWQ protector.

3. Execute the createobjects.sql script to load the Protegrity UDFs for the HAWQ protector.

psql -h <HAWQ_Master_Hostname> -p 5432 -f

<PROTEGRITY_DIR>/pephawq/sqlscripts/createobjects.sql

where:

HAWQ_Master_Hostname: Hostname or IP Address of the HAWQ Master Node

5432: Port number

3.1.11 Configuring Spark

If Spark is used, then it should be configured after Big Data Protector is installed.

 To configure Spark:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop

services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or

Pivotal distributions, depending on the environment.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*

3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following

classpath entries.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*:<PROTEGRITY_DIR>/

hdfsfp/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*:<PROTEGRITY_DIR

>/hdfsfp/*

4. Save the spark-defaults.conf file.

5. Deploy the configuration change to all the nodes in the Hadoop cluster.

6. Restart the Spark services.

If the user needs to run Hive UDFs with Spark SQL, then the following steps need to be performed.

 To configure Spark SQL:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Update the spark-defaults.conf file to include the following classpath entries, using Hadoop

services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or

Pivotal distributions, depending on the environment.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>/p

epspark/lib/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>

/pepspark/lib/*

3. If HDFSFP is installed, then update the spark-defaults.conf file to include the following

classpath entries.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 39

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>/p

epspark/lib/*:<PROTEGRITY_DIR>/hdfsfp/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>

/pepspark/lib/*:<PROTEGRITY_DIR>/hdfsfp/*

4. Save the spark-defaults.conf file.

5. Deploy the configuration change to all the nodes in the Hadoop cluster.

6. Restart the Spark services.

3.2 Installing or Uninstalling Big Data Protector on Specific

Nodes

This section describes the following procedures:

• Installing Big Data Protector on New Nodes added to a Hadoop cluster

• Uninstalling Big Data Protector from a Nodes in the Hadoop cluster

3.2.1 Installing Big Data Protector on New Nodes added to a Hadoop

Cluster

If you need to install Big Data Protector on new nodes added to a Hadoop cluster, then use the

BdpInstallx.x.x_Linux_<arch>_6.6.5.x.sh utility in the <PROTEGRITY_DIR>/cluster_utils directory.

Ensure that you install the Big Data Protector from an ADMINISTRATOR user having

full sudoer privileges.

 To install Big Data Protector on New Nodes added to a Hadoop Cluster:

1. Login to the Lead Node.

2. Navigate to the <PROTEGRITY_DIR>/cluster_utils directory.

3. Add additional entries for each new node, on which the Big Data Protector needs to be

installed, in the NEW_HOSTS_FILE file.

The new nodes from the NEW_HOSTS_FILE file will be appended to the CLUSTERLIST_FILE.

4. Execute the following command utility to install Big Data Protector on the new nodes.

./BdpInstall1.0.1_Linux_<arch>_6.6.5.X.sh –a <NEW_HOSTS_FILE>

The Protegrity Big Data Protector is installed on the new nodes.

3.2.2 Uninstalling Big Data Protector from Selective Nodes in the

Hadoop Cluster

If you need to uninstall Big Data Protector from selective nodes in the Hadoop cluster, then use the

node_uninstall.sh utility in the <PROTEGRITY_DIR>/cluster_utils directory.

Ensure that you uninstall the Big Data Protector from an ADMINISTRATOR user

having full sudoer privileges.

 To uninstall Big Data Protector from Selective Nodes in the Hadoop Cluster:

1. Login to the Lead Node.

2. Navigate to the <PROTEGRITY_DIR>/cluster_utils directory.

3. Create a new hosts file (such as NEW_HOSTS_FILE).

The NEW_HOSTS_FILE file contains the required nodes on which the Big Data Protector needs

to be uninstalled.

4. Add the nodes from which the Big Data Protector needs to be uninstalled in the new hosts

file.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 40

5. Execute the following command to remove the Big Data Protector from the nodes that are

listed in the new hosts file.

./node_uninstall.sh -c NEW_HOSTS_FILE

The Big Data Protector is uninstalled from the nodes listed in the new hosts file.

6. Remove the nodes from which the Big Data Protector is uninstalled in Step 5 from the

CLUSTERLIST_FILE file.

3.3 Utilities

This section provides information about the following utilities:

• PEP Server Control (cluster_pepsrvctl.sh) – Manages PEP servers across the cluster.

• Update Cluster Policy (update_cluster_policy.sh) – Updates the configurations of the PEP

servers across the cluster.

• Protegrity Cache Control (cluster_cachesrvctl.sh) – Monitors the status of the Protegrity Cache

on all the nodes in the cluster. This utility is available only for HDFSFP.

• Recover Utility – Recovers the contents from a protected path. This utility is available only for

HDFSFP.

Ensure that you run the utilities with a user (OPERATOR_USER) having sudo

privileges for impersonating the service account (EXECUTOR_USER or

PROTEGRITY_IT_USR, as configured).

3.3.1 PEP Server Control

This utility (cluster_pepsrvctl.sh), in the <PROTEGRITY_DIR>/cluster_utils folder, manages the PEP

server services on all the nodes in the cluster, except the Lead node.

The utility provides the following options:

• Start – Starts the PEP servers in the cluster.

• Stop – Stops the PEP servers in the cluster.

• Restart – Restarts the PEP servers in the cluster.

• Status – Reports the status of the PEP servers.

The utility (pepsrvctrl.sh), in the <PROTEGRITY_DIR>/defiance_dps/bin/ folder, manages the PEP

server services on the Lead node.

When you run the the PEP Server Control utility, then you will be prompted to enter

the OPERATOR_USER password, which is same across all the nodes in the cluster.

3.3.2 Update Cluster Policy

This utility (update_cluster_policy.sh), in the <PROTEGRITY_DIR>/cluster_utils folder, updates the

configurations of the PEP servers across the cluster.

For example, if you need to make any changes to the PEP server configuration, make the changes

on the Lead node and then propagate the change to all the PEP servers in the cluster using the

update_cluster_policy.sh utility.

Ensure that all the PEP servers in the cluster are stopped before running the

update_cluster_policy.sh utility.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 41

When you run the the Update Cluster Policy utility, then you will be prompted to

enter the OPERATOR_USER

password, which is same across all the nodes in the

cluster.

3.3.3 Protegrity Cache Control

This utility (cluster_cachesrvctl.sh), in the <PROTEGRITY_DIR>/cluster_utils folder, monitors the

status of the Protegrity Cache on all the nodes in the cluster. This utility prompts for the

OPERATOR_USER password.

The utility provides the following options:

• Start – Starts the Protegrity Cache services in the cluster.

• Stop – Stops the Protegrity Cache services in the cluster.

• Restart – Restarts the Protegrity Cache services in the cluster.

• Status – Reports the status of the Protegrity Cache services.

3.3.4 Recover Utility

The Recover utility is available for HDFSFP only. This utility recovers the contents from protected

files of types Text, RC, and Sequence, in the absence of ACL or loss of ACL information. This ensures

that the data is not lost under any circumstances.

Parameters

srcpath: The protected HDFS path containg the data to be unprotected.

destpath: The destination directory to store unprotected data.

Result

• If srcpath is the file path, then the Recover utility recovers all files.

• If srcpath is the directory path, then the Recover utility recovers all files inside the directory.

Ensure that the user running the Recover utility has unprotect access on the data

element which was used to protect the files in the HDFS path.

Ensure that an ADMINISTRATOR or OPERATOR_USER is running the Recover Utility

and the user has the required read/execute permissions to the

<PROTEGRITY_DIR>/hdfsfp/recover.sh script.

Example

The following two ACLs are created:

1. /user/root/employee

2. /user/ptyitusr/prot/employee

Run the Recover Utility on these two paths with destination local directory as /tmp/HDFSFP-

recovered/ by using the following commands.

The following would be recovered in the local directory:

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 42

1. /tmp/HDFSFP-recovered/user/root/employee - The files and sub-directories present in the

HDFS location /user/root/employee are recovered in cleartext form.

2. /tmp/HDFSFP-recovered/user/ptyitusr/prot/employee - The files and sub-directories present

in the HDFS location /user/ptyitusr/prot/employee are recovered in cleartext form.

 To recover the protected data from a Hive warehouse directory to a local file system

directory:

1. Execute the following command to retrieve the protected data from a Hive warehouse

directory.

<PROTEGRITY_DIR>/hdfsfp/recover.sh –srcpath <protected HDFS path to be

unprotected> -destpath <destination directory in the local file system>

The cleartext data from the protected HDFS path is stored in the destination directory.

2. If you need to ensure that the existing Hive queries for the table function, then perform the

following steps.

a) Execute the following command to delete the warehouse directory for the table.

hadoop fs –rm –r <hive.metastore.warehouse.dir>/tablename

b) Move the destination directory with the cleartext data in HDFS using the following command.

hadoop fs –put <destination directory in the local file

system>/user/hive/warehouse/table_name

<hive.metastore.warehouse.dir>/tablename

c) To view the cleartext data in the table, use the following command.

Select * from tablename

3.4 Uninstalling Big Data Protector from a Cluster

This section describes the procedure for uninstalling the Big Data Protector from the cluster.

3.4.1 Verifying the Prerequisites for Uninstalling Big Data Protector

If you are configuring the Big Data Protector with a Kerberos-enabled Hadoop cluster, then ensure

that the HDFS superuser (hdfs) has a valid Kerberos ticket.

3.4.2 Removing the Cluster from the ESA

Before uninstalling Big Data Protector from the cluster, the cluster should be deleted from the ESA.

For more information about deleting the cluster from the ESA, refer to section 5.14.3 Removing a

Cluster.

3.4.3 Uninstalling Big Data Protector from the Cluster

Depending on the requirements, perform the following tasks to uninstall the Big Data Protector from

the cluster.

3.4.3.1 Removing HDFSFP Configuration for Yarn (MRv2)

If HDFSFP is configured for Yarn (MRv2), then the configuration should be removed before

uninstalling Big Data Protector.

 To remove HDFSFP configuration for Yarn (MRv2) after uninstalling Big Data Protector:

1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec

codec from the io.compression.codecs property in the core-site.xml file.

2. Modify the value of the mapreduce.output.fileoutputformat.compress property in the

mapred-site.xml file to false.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 43

3. Remove the value of the mapreduce.output.fileoutputformat.compress.codec property in

the mapred-site.xml file.

4. Remove the /hdfsfp/* path from the yarn.application.classpath

property in the yarn-site.xml file.

5. Restart the HDFS and Yarn services.

3.4.3.2 Removing HDFSFP Configuration for MapReduce, v1 (MRv1)

If HDFSFP is configured for MapReduce, v1 (MRv1), then the configuration should be removed before

uninstalling Big Data Protector.

 To remove HDFSFP configuration for MRv1 after uninstalling Big Data Protector:

1. Remove the com.protegrity.hadoop.fileprotector.crypto.codec.PtyCryptoCodec

codec from the io.compression.codecs property in the core-site.xml file.

2. Modify the value of the mapred.output.compress property in the mapred-site.xml file to false.

3. Remove the value of the mapred.output.compression.codec property in the mapred-site.xml

file.

4. Restart the HDFS and MapReduce services.

3.4.3.3 Removing Configuration for Hive Protector if HDFSFP is not Installed

If the Hive protector is used and HDFSFP is not installed, then the configuration should be removed

before uninstalling Big Data Protector.

 To remove configuration for Hive protector if HDFSFP is not installed:

1. If you are using a Hadoop distribution with a Management UI, then remove the value

com.protegrity.hive.PtyHiveUserPreHook from the hive.exec.pre.hooks property, from the

hive-site.xml file using the configuration management UI.

2. If you are using a Hadoop distribution without a Management UI, then remove the following

property in the hive-site.xml file from all nodes.

<name>hive.exec.pre.hooks</name>

<value>hive.exec.pre.hooks=com.protegrity.hive.PtyHiveUserPreHook</value>

</property>

3.4.3.4 Removing Configurations for Hive Support in HDFSFP

If Hive is used with HDFSFP, then the configuration should be removed before uninstalling Big Data

Protector.

 To remove configurations for Hive support in HDFSFP:

1. If you are using a Hadoop distribution with a Management UI, then perform the following

steps.

a) In the hive-site.xml file, remove the value of the mapreduce.job.maps property,

using the Management UI.

b) In the hive-site.xml file, remove the value

com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook from the

hive.exec.pre.hooks property, using the configuration management UI.

2. If you are using a Hadoop distribution without a Management UI, then perform the following

steps.

a) Remove the following property in the hive-site.xml file on all nodes.

<name>mapreduce.job.maps</name>

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 44

</property>

b) Remove the following property in the hive-site.xml file on all nodes.

<name>hive.exec.pre.hooks</name>

<value>com.protegrity.hadoop.fileprotector.hive.PtyHivePreHook</value>

</property>

3.4.3.5 Removing the Configuration Properties when HDFSFP is not Installed

If you are using HDP, version 2.2 or higher (Hortonworks), or PHD, version 3.0 or higher (Pivotal

Hadoop), and you have not installed HDFSFP, then the configuration should be removed before

uninstalling Big Data Protector.

 To remove the configuration properties:

1. Remove the following entries from the mapreduce.application.classpath property in the

mapred-site.xml file.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

2. Remove the following entries from the yarn.application.classpath property in the yarn-

site.xml file.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

3. Restart the Yarn service.

4. Restart the MRv2 service.

5. Remove the following entries from the tez.cluster.additional.classpath.prefix property in the

tez-site.xml file.

<PROTEGRITY_DIR>/pepmapreduce/lib/*

<PROTEGRITY_DIR>/pephive/lib/*

<PROTEGRITY_DIR>/peppig/lib/*

6. Restart the Tez services.

3.4.3.6 Removing HBase Configuration

If HBase is configured, then the configuration should be removed before uninstalling Big Data

Protector.

 To remove HBase configuration:

1. If you are using a Hadoop distribution that has a Management UI, then remove the following

HBase coprocessor region classes property value from the hbase-site.xml file in all the

respective region server groups, using the Management UI.

com.protegrity.hbase.PTYRegionObserver

2. If you are using a Hadoop distribution without a Management UI, then remove the following

property in the hbase-site.xml file from all region server nodes.

<name>hbase.coprocessor.region.classes</name>

<value>com.protegrity.hbase.PTYRegionObserver</value>

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 45

</property>

3. Restart all HBase services.

3.4.3.7 Removing the Defined Impala UDFs

If Impala is configured, then the defined Protegrity UDFs for the Impala protector should be

removed before uninstalling Big Data Protector.

 To remove the defined Impala UDFs:

If you are not using a Kerberos-enabled Hadoop cluster, then run the following command to remove

the defined Protegrity UDFs for the Impala protector using the dropobjects.sql script.

impala-shell -i <IP address of any Impala slave node> -f

<PROTEGRITY_DIR>/pepimpala/sqlscripts/dropobjects.sql

If you are using a Kerberos-enabled Hadoop cluster, then run the following command to remove the

defined Protegrity UDFs for the Impala protector using the dropobjects.sql script.

impala-shell -i <IP address of any Impala slave node> -f

<PROTEGRITY_DIR>/pepimpala/sqlscripts/dropobjects.sql -k

3.4.3.8 Removing the Defined HAWQ UDFs

If HAWQ is configured, then the defined Protegrity UDFs for the HAWQ protector should be

removed before uninstalling Big Data Protector.

 To remove the defined HAWQ UDFs:

Run the following command to remove the defined Protegrity UDFs for the HAWQ protector using the

dropobjects.sql script.

psql -h <HAWQ Master Hostname> -p 5432 -f

<PROTEGRITY_DIR>/pephawq/sqlscripts/dropobjects.sql

3.4.3.9 Removing the Spark Protector Configuration

If the Spark protector is used, then the required configuration settings should be removed before

uninstalling the Big Data Protector.

 To remove the Spark protector configuration:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop

services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or

Pivotal distributions, depending on the environment.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*

3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following

classpath entries.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*:<PROTEGRITY_DIR>/

hdfsfp/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pepspark/lib/*:<PROTEGRITY_DIR

>/hdfsfp/*

4. Save the spark-defaults.conf file.

5. Deploy the configuration change to all the nodes in the Hadoop cluster.

6. Restart the Spark services.

Big Data Protector Guide 6.6.5

Installing and Uninstalling Big Data Protector

Confidential 46

If Spark SQL is configured to run Hive UDFs, then the required configuration settings should be

removed before uninstalling the Big Data Protector.

 To remove the Spark SQL configuration:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Update the spark-defaults.conf file to remove the following classpath entries, using Hadoop

services, Cloudera Manager for Cloudera distributions, or Ambari Server for Hortonworks or

Pivotal distributions, depending on the environment.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>/p

epspark/lib/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>

/pepspark/lib/*

3. If HDFSFP is installed, then update the spark-defaults.conf file to remove the following

classpath entries.

spark.driver.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>/p

epspark/lib/*:<PROTEGRITY_DIR>/hdfsfp/*

spark.executor.extraClassPath=<PROTEGRITY_DIR>/pephive/lib/*:<PROTEGRITY_DIR>

/pepspark/lib/*:<PROTEGRITY_DIR>/hdfsfp/*

4. Save the spark-defaults.conf file.

5. Deploy the configuration change to all the nodes in the Hadoop cluster.

6. Restart the Spark services.

3.4.3.10 Running the Uninstallation Script

 To run the scripts for uninstalling the Big Data Protector on all nodes in the cluster:

1. Login as the sudoer user and navigate to the <PROTEGRITY_DIR>/cluster_utils directory on

the Lead node.

2. Run the following script to stop the PEP servers on all the nodes in the cluster.

./cluster_pepsrvctl.sh

3. Run the uninstall.sh utility.

A prompt to confirm or cancel the Big Data Protector uninstallation appears.

4. Type yes to continue with the uninstallation.

5. When prompted, enter the sudoer password.

The uninstallation script continues with the uninstallation of Big Data Protector.

If you are using a Cloudera or MapR distribution, then the presence of an HDFS connection

and a valid Kerberos ticket is also verified.

The <PROTEGRITY_DIR>/cluster_utils directory continues to exist on the Lead

node.

This directory is retained to perform a cleanup in the event of the uninstallation

failing on some nodes, due to unavoidable reasons, such as host being down.

6. After Big Data Protector is successfully uninstalled from all nodes, manually delete the

<PROTEGRITY_DIR> directory from the Lead node.

7. If the <PROTEGRITY_DIR>/defiance_dps_old directory is present on any of the nodes in the

cluster, then it can be manually deleted from the respective nodes.

8. Restart all Hadoop services.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 47

4 Hadoop Application Protector

4.1 Using the Hadoop Application Protector

Various jobs written in the Hadoop cluster require data fields to be stored and retrieved. This data

requires protection when it is at rest. The Hadoop Application Protector provides MapReduce, Hive

and Pig the power to protect data while it is being processed and stored. Application programmers

using these tools can include Protegrity software in their jobs to secure data.

For more information about using the protector APIs in various Hadoop applications and samples,

refer to the following sections.

4.2 Prerequisites

Ensure that the following prerequisites are met before using Hadoop Application Protector:

• The Big Data Protector is installed and configured in the Hadoop cluster.

• The security officer has created the necessary security policy which creates data elements

and user roles with appropriate permissions.

For more information about creating security policies, data elements and user roles, refer to

Protection Enforcement Point Servers Installation Guide 6.6.5 and Enterprise Security

Administrator Guide 6.6.5.

• The policy is deployed across the cluster.

For more information about the list of all APIs available to Hadoop applications, refer to sections 4.4

MapReduce APIs, 4.5 Hive UDFs, and 4.6 Pig UDFs.

4.3 Samples

To run the samples provided with the Big Data Protector, the pre-packaged policy should be deployed

from the ESA. During installation, specify the INSTALL_DEMO parameter as Yes in the BDP.config

file.

The commands in the samples may require Hadoop-super-user permissions.

For more information about the samples, refer to section 11 Appendix: Samples.

4.4 MapReduce APIs

This section describes the MapReduce APIs available for protection and unprotection in the Big Data

Protector to build secure Big Data applications.

The Protegrity MapReduce protector only supports bytes converted from the string data

type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

If you are using the Bulk APIs for the MapReduce protector, then the following two modes for error

handling and return codes are available:

• Default mode: Starting with the Big Data Protector, version 6.6.4, the Bulk APIs in the

MapReduce protector will return the detailed error and return codes instead of 0 for failure

and 1 for success. In addition, the MapReduce jobs involving Bulk APIs will provide error codes

instead of throwing exceptions.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 48

For more information about the error codes for Big Data Protector, version 6.6.5, refer to

Table 11-2 PEP Log Return Codes and Table 11-3 PEP Result Codes in section 11 Appendix:

Return Codes.

• Backward compatibility mode: If you need to continue using the error handling capabilities

provided with Big Data Protector, version 6.6.3 or lower, that is 0 for failure and 1 for success,

then you can set this mode.

4.4.1 openSession()

This method opens a new user session for protect and unprotect operations. It is a good practice to

create one session per user thread.

public synchronized int openSession(String parameter)

Parameters

parameter: An internal API requirement that should be set to 0.

Result

1: If session is successfully created

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

Exception (and Error Codes)

ptyMapRedProtectorException: if session creation fails

4.4.2 closeSession()

This function closes the current open user session. Every instance of ptyMapReduceProtector opens

only one session, and a session ID is not required to close it.

public synchronized int closeSession()

Parameters

None

Result

1: If session is successfully closed

0: If session closure is a failure

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

int closeSessionStatus = mapReduceProtector.closeSession();

Exception (and Error Codes)

None

4.4.3 getVersion()

This function returns the current version of the MapReduce protector.

public java.lang.String getVersion()

Parameters

None

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 49

Result

This function returns the current version of MapReduce protector.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

String version = mapReduceProtector.getVersion();

int closeSessionStatus = mapReduceProtector.closeSession();

4.4.4 getCurrentKeyId()

This method returns the current Key ID for the data element which contains the KEY ID attribute,

while creating the data element, such as ASE-256, ASE-128, and so on.

public int getCurrentKeyId(java.lang.String dataElement)

Parameters

dataElement: Name of the data element

Result

This method returns the current Key ID for the data element containing the KEY ID attribute.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

int currentKeyId = mapReduceProtector.getCurrentKeyId("ENCRYPTION_DE");

int closeSessionStatus = mapReduceProtector.closeSession();

4.4.5 checkAccess()

This method checks the access of the user for the specified data element.

public boolean checkAccess(java.lang.String dataElement, byte bAccessType)

Parameters

dataElement: Name of the data element

bAccessType: Type of the access of the user for the data element.

The following are the different values for the bAccessType variable:

DELETE 0x01

PROTECT 0x02

REPROTECT 0x04

UNPROTECT 0x08

CREATE 0x10

MANAGE 0x20

Result

1: If the user has access to the data element

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

byte bAccessType = 0x02;

boolean isAccess = mapReduceProtector.checkAccess("DE_PROTECT" , bAccessType );

int closeSessionStatus = mapReduceProtector.closeSession();

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 50

4.4.6 getDefaultDataElement()

This method returns default data element configured in security policy.

public String getDefaultDataElement(String policyName)

Parameters

policyName: Name of policy configured using Policy management in ESA.

Result

Default data element name configured in a given policy.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

String defaultDataElement = mapReduceProtector.getDefaultDataElement("my_policy");

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to return default data element name

4.4.7 protect()

Protects the data provided as a byte array. The type of protection applied is defined by dataElement.

public byte[] protect(String dataElement, byte[] data)

Parameters

dataElement: Name of the data element to be protected

data: Byte array of data to be protected

The Protegrity MapReduce protector only supports bytes converted from the string data

type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

If you are using the Protect API which accepts byte as input and provides byte as output,

then ensure that when unprotecting the data, the Unprotect API, with byte as input and

byte as output is utilized. In addition, ensure that the byte data being provided as input

to the Protect API has been converted from a string data type only.

Result

Byte array of protected data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

byte[] bResult = mapReduceProtector.protect(

"DE_PROTECT","protegrity".getBytes());

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to protect data

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 51

4.4.8 protect()

Protects the data provided as int. The type of protection applied is defined by dataElement.

public int protect(String dataElement, int data)

Parameters

dataElement: Name of the data element to be protected

data: int to be protected

Result

Protected int data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

int bResult = mapReduceProtector.protect(

"DE_PROTECT",1234);

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to protect data

4.4.9 protect()

Protects the data provided as long. The type of protection applied is defined by dataElement.

public long protect(String dataElement, long data)

Parameters

dataElement: Name of the data element to be protected

data: long data to be protected

Result

Protected long data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

long bResult = mapReduceProtector.protect(

"DE_PROTECT",123412341234);

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to protect data

4.4.10 unprotect()

This function returns the data in its original form.

public byte[] unprotect(String dataElement, byte[] data)

Parameters

dataElement: Name of data element to be unprotected

data: array of data to be unprotected

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 52

The Protegrity MapReduce protector only supports bytes converted from the string data

type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

Result

Byte array of unprotected data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT",

"protegrity".getBytes() );

byte[] unprotectedResult = mapReduceProtector.unprotect(

"DE_PROTECT_UNPROTECT", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to unprotect data

4.4.11 unprotect()

This function returns the data in its original form.

public int unprotect(String dataElement, int data)

Parameters

dataElement: Name of data element to be unprotected

data: int to be unprotected

Result

Unprotected int data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

int protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT",

1234 );

int unprotectedResult = mapReduceProtector.unprotect(

"DE_PROTECT_UNPROTECT", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to unprotect data

4.4.12 unprotect()

This function returns the data in its original form.

public long unprotect(String dataElement, long data)

Parameters

dataElement: Name of data element to be unprotected

data: long data to be unprotected

Result

Unprotected long data

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 53

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

long protectedResult = mapReduceProtector.protect( "DE_PROTECT_UNPROTECT",

123412341234 );

long unprotectedResult = mapReduceProtector.unprotect(

"DE_PROTECT_UNPROTECT", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If unable to unprotect data

4.4.13 bulkProtect()

This is used when a set of data needs to be protected in a bulk operation. It helps to improve

performance.

public byte[][] bulkProtect(String dataElement, List <Integer> errorIndex,

byte[][] inputDataItems)

Parameters

dataElement: Name of data element to be protected

errorIndex: array used to store all error indices encountered while protectin

g each data entry

in inputDataItems

inputDataItems: Two-dimensional array to store bulk data for protection

Result

Two-dimensional byte array of protected data.

If the Backward Compatibility mode is not set, then the appropriate error code appears. For more

information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP

Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk protect operation:

• 1: The protect operation for the entry is successful.

• 0: The protect operation for the entry is unsuccessful.

o For more information about the failed entry, view the logs available in ESA

Forensics.

• Any other value or garbage return value: The protect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(),

"protegrity".getBytes(), "protegrity".getBytes()};

byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

System.out.print("Protected Data: ");

for(int i = 0; i < protectedData.length; i++)

{

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 54

//THIS WILL PRINT THE PROTECTED DATA

System.out.print(protectedData[i] == null ? null : new

String(protectedData[i]));

if(i < protectedData.length - 1)

{

System.out.print(",");

}

System.out.println("");

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

//ABOVE CODE WILL PRINT THE ERROR INDEXES

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If an error is encountered during bulk protection of data

4.4.14 bulkProtect()

This is used when a set of data needs to be protected in a bulk operation. It helps to improve

performance.

public int[] bulkProtect(String dataElement, List <Integer> errorIndex, int[]

inputDataItems)

Parameters

dataElement: Name of data element to be protected

errorIndex: array used to store all error indices encountered while protecting each data entry

in input Data Items

inputDataItems: array to store bulk int data for protection

Result

int array of protected data

If the Backward Compatibility mode is not set, then the appropriate error code appears. For more

information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP

Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk protect operation:

• 1: The protect operation for the entry is successful.

• 0: The protect operation for the entry is unsuccessful.

o For more information about the failed entry, view the logs available in ESA

Forensics.

• Any other value or garbage return value: The protect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 55

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

int[] protectData = {1234, 5678, 9012, 3456};

int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

//CHECK THE ERROR INDEXES FOR ERRORS

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

//ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If an error is encountered during bulk protection of data

4.4.15 bulkProtect()

This is used when a set of data needs to be protected in a bulk operation. It helps to improve

performance.

public long[] bulkProtect(String dataElement, List <Integer> errorIndex, long[]

inputDataItems)

Parameters

dataElement: Name of data element to be protected

errorIndex: array used to store all error indices encountered while protecting each data entry

in input Data Items

inputDataItems: array to store bulk long data for protection

Result

Long array of protected data

If the Backward Compatibility mode is not set, then the appropriate error code appears. For more

information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3 PEP

Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk protect operation:

• 1: The protect operation for the entry is successful.

• 0: The protect operation for the entry is unsuccessful.

o For more information about the failed entry, view the logs available in ESA

Forensics.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 56

• Any other value or garbage return value: The protect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

long[] protectData = {123412341234, 567856785678, 901290129012,

345634563456};

long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

//CHECK THE ERROR INDEXES FOR ERRORS

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

//ABOVE CODE WILL ONLY PRINT THE ERROR INDEXES

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If an error is encountered during bulk protection of data

4.4.16 bulkUnprotect()

This method unprotects in bulk the inputDataItems with the required data element.

public byte[][] bulkUnprotect(String dataElement, List<Integer> errorIndex,

byte[][] inputDataItems)

Parameters

String dataElement: Name of data element to be unprotected

int[] error index: array of the error indices encountered while unprotecting each data entry

in inputDataItems

byte[][] inputDataItems: two-dimensional array to help store bulk data to be unprotected

Result

Two-dimensional byte array of unprotected data

If the Backward Compatibility mode is not set, then the appropriate error code appears. For

more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3

PEP Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk unprotect operation:

• 1: The unprotect operation for the entry is successful.

• 0: The unprotect operation for the entry is unsuccessful.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 57

o For more information about the failed entry, view the logs available in ESA

Forensics.

• Any other value or garbage return value: The unprotect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

byte[][] protectData = {"protegrity".getBytes{}, "protegrity".getBytes(),

"protegrity".getBytes(), "protegrity".getBytes()};

byte[][] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

//THIS WILL PRINT THE UNPROTECTED DATA

System.out.print("Protected Data: ");

for(int i = 0; i < protectedData.length; i++)

{

System.out.print(protectedData[i] == null ? null : new

String(protectedData[i]));

if(i < protectedData.length - 1)

{

System.out.print(",");

}

//THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION

System.out.println("");

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

byte[][] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT",

errorIndex, protectedData );

//THIS WILL PRINT THE PROTECTED DATA

System.out.print("UnProtected Data: ");

for(int i = 0; i < unprotectedData.length; i++)

{

System.out.print(unprotectedData[i] == null ? null : new

String(unprotectedData[i]));

if(i < unprotectedData.length - 1)

{

System.out.print(",");

}

//THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION

System.out.println("");

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 58

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors when unprotecting data

4.4.17 bulkUnprotect()

This method unprotects in bulk the inputDataItems with the required data element.

public int[] bulkUnprotect(String dataElement, List<Integer> errorIndex, int[]

inputDataItems)

Parameters

String dataElement: Name of data element to be unprotected

int[] error index: array of the error indices encountered while unprotecting each data entry

in inputDataItems

int[] inputDataItems: int array to be unprotected

Result

unprotected int array data

If the Backward Compatibility mode is not set, then the appropriate error code appears. For

more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3

PEP Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk unprotect operation:

• 1: The unprotect operation for the entry is successful.

• 0: The unprotect operation for the entry is unsuccessful.

o For more information about the failed entry, view the logs available in ESA

Forensics.

• Any other value or garbage return value: The unprotect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

int[] protectData = {1234, 5678,9012,3456 };

int[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

//THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION

System.out.println("");

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 59

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

int[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT",

errorIndex, protectedData );

//THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION

System.out.println("");

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors when unprotecting data

4.4.18 bulkUnprotect()

This method unprotects in bulk the inputDataItems with the required data element.

public long[] bulkUnprotect(String dataElement, List<Integer> errorIndex, long[]

inputDataItems)

Parameters

String dataElement: Name of data element to be unprotected

int[] error index: array of the error indices encountered while unprotecting each data entry

in inputDataItems

long[] inputDataItems: long array to be unprotected

Result

Unprotected long array data

If the Backward Compatibility mode is not set, then the appropriate error code appears. For

more information about the error codes, refer to Table 11-2 PEP Log Return Codes and Table 11-3

PEP Result Codes in section 11 Appendix: Return Codes.

If the Backward Compatibility mode is set, then the Error Index includes one of the following

values, per entry in the bulk unprotect operation:

• 1: The unprotect operation for the entry is successful.

• 0: The unprotect operation for the entry is unsuccessful.

o For more information about the failed entry, view the logs available in ESA

Forensics.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 60

• Any other value or garbage return value: The unprotect operation for the entry is

unsuccessful. For more information about the failed entry, view the logs available in

ESA Forensics.

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

List<Integer> errorIndex = new ArrayList<Integer>();

long[] protectData = { 123412341234, 567856785678,

901290129012, 345634563456 };

long[] protectedData = mapReduceProtector.bulkProtect( "DE_PROTECT",

errorIndex, protectData );

//THIS WILL PRINT THE ERROR INDEX FOR PROTECT OPERATION

System.out.println("");

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

long[] unprotectedData = mapReduceProtector.bulkUnprotect( "DE_PROTECT",

errorIndex, protectedData );

//THIS WILL PRINT THE ERROR INDEX FOR UNPROTECT OPERATION

System.out.println("");

System.out.print("Error Index: ");

for(int i = 0; i < errorIndex.size(); i++)

{

System.out.print(errorIndex.get( i ));

if(i < errorIndex.size() - 1)

{

System.out.print(",");

}

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors when unprotecting data

4.4.19 reprotect()

Data that has been protected earlier is protected again with a separate data element.

public byte[] reprotect(String oldDataElement, String newDataElement, byte[] data)

Parameters

String oldDataElement: Name of data element with which data was protected earlier

String newDataElement: Name of new data element with which data is reprotected

byte[] data: array of data to be protected

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 61

Result

Byte array of reprotected data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

byte[] protectedResult = mapReduceProtector.protect( "DE_PROTECT_1",

"protegrity".getBytes() );

byte[] reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1",

"DE_PROTECT_2", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors while reprotecting data

4.4.20 reprotect()

Data that has been protected earlier is protected again with a separate data element.

public int reprotect(String oldDataElement, String newDataElement, int data)

Parameters

String oldDataElement: Name of data element with which data was protected earlier

String newDataElement: Name of new data element with which data is reprotected

int data: array of data to be protected

Result

Reprotected int data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

int protectedResult = mapReduceProtector.protect( "DE_PROTECT_1",

1234 );

int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1",

"DE_PROTECT_2", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors while reprotecting data

4.4.21 reprotect()

Data that has been protected earlier is protected again with a separate data element.

public long reprotect(String oldDataElement, String newDataElement, long data)

Parameters

String oldDataElement: Name of data element with which data was protected earlier

String newDataElement: Name of new data element with which data is reprotected

long data: array of data to be protected

Result

Reprotected long data

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 62

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

long protectedResult = mapReduceProtector.protect( "DE_PROTECT_1",

123412341234 );

int reprotectedResult = mapReduceProtector.reprotect( "DE_PROTECT_1",

"DE_PROTECT_2", protectedResult );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: For errors while reprotecting data

4.4.22 hmac()

This method performs data hashing using the HMAC operation on a single data item with a data

element, which is associated with hmac. It returns hmac value of the given data with the given data

element.

public byte[] hmac(String dataElement, byte[] data)

Parameters

String dataElement: Name of data element for HMAC

byte[] data: array of data for HMAC

Result

Byte array of HMAC data

Example

ptyMapReduceProtector mapReduceProtector = new ptyMapReduceProtector();

int openSessionStatus = mapReduceProtector.openSession("0");

byte[] bResult = mapReduceProtector.hmac( "DE_HMAC",

"protegrity".getBytes() );

int closeSessionStatus = mapReduceProtector.closeSession();

Exception

ptyMapRedProtectorException: If an error occurs while doing HMAC

4.5 Hive UDFs

This section describes all Hive User Defined Functions (UDFs) that are available for protection and

unprotection in Big Data Protector to build secure Big Data applications.

If you are using Ranger or Sentry, then ensure that your policy provides create access

permissions to the required UDFs.

4.5.1 ptyGetVersion()

This UDF returns the current version of PEP.

ptyGetVersion()

Parameters

None

Result

This UDF returns the current version of PEP.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 63

Example

create temporary function ptyGetVersion AS 'com.protegrity.hive.udf.ptyGetVersion';

drop table if exists test_data_table;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table;

select ptyGetVersion() from test_data_table;

4.5.2 ptyWhoAmI()

This UDF returns the current logged in user.

ptyWhoAmI()

Parameters

None

Result

This UDF returns the current logged in user.

Example

create temporary function ptyWhoAmI AS 'com.protegrity.hive.udf.ptyWhoAmI';

drop table if exists test_data_table;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' OVERWRITE INTO TABLE test_data_table;

select ptyWhoAmI() from test_data_table;

4.5.3 ptyProtectStr()

This UDF protects string values.

ptyProtectStr(String input, String dataElement)

Parameters

String input: String value to protect

String dataElement: Name of data element to protect string value

Result

This UDF returns protected string value.

Example

create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 64

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select trim(val) from temp_table;

select ptyProtectStr(val, 'Token_alpha') from test_data_table;

4.5.4 ptyUnprotectStr()

` This UDF unprotects the existing protected string value.

ptyUnprotectStr(String input, String dataElement)

Parameters

String input: Protected string value to unprotect

String dataElement: Name of data element to unprotect string value

Result

This UDF returns unprotected string value.

Example

create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr';

create temporary function ptyUnprotectStr AS 'com.protegrity.hive.udf.ptyUnprotectStr';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue string) row format delimited fields

terminated by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select trim(val) from temp_table;

insert overwrite table protected_data_table select ptyProtectStr(val, 'Token_alpha') from

test_data_table;

select ptyUnprotectStr(protectedValue, 'Token_alpha') from protected_data_table;

4.5.5 ptyReprotect()

This UDF reprotects string format protected data, which was earlier protected using the ptyProtectStr

UDF, with a different data element.

ptyReprotect(String input, String oldDataElement, String newDataElement)

Parameters

String input: String value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Result

This UDF returns protected string value.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 65

Example

create temporary function ptyProtectStr AS 'com.protegrity.hive.udf.ptyProtectStr';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val string) row format delimited fields terminated

by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select trim(val) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectStr(val, 'Token_alpha')

from test_data_table;

create table test_reprotected_data_table(val string) row format delimited fields

terminated by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotect(val,

'Token_alpha', 'new_Token_alpha') from test_protected_data_table;

4.5.6 ptyProtectUnicode()

This UDF protects string (Unicode) values.

ptyProtectUnicode(String input, String dataElement)

Parameters

String input: String (Unicode) value to protect

String dataElement: Name of data element to protect string (Unicode) value

This UDF should be used only if you need to tokenize Unicode data in Hive, and

migrate the tokenized data from Hive to a Teradata database and detokenize the

data using the Protegrity Database Protector.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data to a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns protected string value.

Example

create temporary function ptyProtectUnicode AS

'com.protegrity.hive.udf.ptyProtectUnicode';

drop table if exists temp_table;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 66

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

select ptyProtectUnicode(val, 'Token_unicode') from temp_table;

4.5.7 ptyUnprotectUnicode()

This UDF unprotects the existing protected string value.

ptyUnprotectUnicode(String input, String dataElement)

Parameters

String input: Protected string value to unprotect

String dataElement: Name of data element to unprotect string value

This UDF should be used only if you need to tokenize Unicode data in Teradata

using the Protegrity Database Protector, and migrate

the tokenized data from a

Teradata database to Hive and detokenize the data using the Protegrity Big Data

Protector for Hive.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data from

a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns unprotected string (Unicode) value.

Example

create temporary function ptyProtectUnicode AS

'com.protegrity.hive.udf.ptyProtectUnicode';

create temporary function ptyUnprotectUnicode AS

'com.protegrity.hive.udf.ptyUnprotectUnicode';

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table protected_data_table(protectedValue string) row format delimited fields

terminated by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table protected_data_table select ptyProtectUnicode(val,

'Token_unicode') from temp_table;

4.5.8 ptyReprotectUnicode()

This UDF reprotects string format protected data, which was protected earlier using the

ptyProtectUnicode UDF, with a different data element.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 67

ptyReprotectUnicode(String input, String oldDataElement, String newDataElement)

Parameters

String input: String (Unicode) value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

This UDF should be used only if you need to tokenize Unicode data in Hive, and

migrate the tokenized data from Hive to a Teradata database and detokenize the

data using the Protegrity Database Protector.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data to a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns protected string value.

Example

create temporary function ptyProtectUnicode AS

'com.protegrity.hive.udf.ptyProtectUnicode';

create temporary function ptyReprotectUnicode AS

'com.protegrity.hive.udf.ptyReprotectUnicode';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val string) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val string) row format delimited fields terminated

by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val)) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectUnicode(val,

'Unicode_Token') from test_data_table;

create table test_reprotected_data_table(val string) row format delimited fields

terminated by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotectUnicode(val,

'Unicode_Token',’new_Unicode_Token’) from test_data_table;

4.5.9 ptyProtectInt()

This UDF protects integer values.

ptyProtectInt(int input, String dataElement)

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 68

Parameters

int input: Integer value to protect

String dataElement: Name of data element to protect integer value

Result

This UDF returns protected integer value.

Example

create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val int) row format delimited fields terminated by ','

stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as int) from temp_table;

select ptyProtectInt(val, 'Token_numeric') from test_data_table;

4.5.10 ptyUnprotectInt()

` This UDF unprotects the existing protected integer value.

ptyUnprotectInt(int input, String dataElement)

Parameters

int input: Protected integer value to unprotect

String dataElement: Name of data element to unprotect integer value

Result

This UDF returns unprotected integer value.

Example

create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt';

create temporary function ptyUnprotectInt AS 'com.protegrity.hive.udf.ptyUnprotectInt';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val int) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue int) row format delimited fields

terminated by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as int) from temp_table;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 69

insert overwrite table protected_data_table select ptyProtectInt(val, 'Token_numeric')

from test_data_table;

select ptyUnprotectInt(protectedValue, 'Token_numeric') from protected_data_table;

4.5.11 ptyReprotect()

This UDF reprotects integer format protected data with a different data element.

ptyReprotect(int input, String oldDataElement, String newDataElement)

Parameters

int input: Integer value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Result

This UDF returns protected integer value.

Example

create temporary function ptyProtectInt AS 'com.protegrity.hive.udf.ptyProtectInt';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val int) row format delimited fields terminated by ',' stored as

textfile;

create table test_data_table(val int) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val int) row format delimited fields terminated by

',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as int) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectInt(val,

'Token_Integer') from test_data_table;

create table test_reprotected_data_table(val int) row format delimited fields terminated

by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotect(val,

'Token_Integer', 'new_Token_Integer') from test_protected_data_table;

4.5.12 ptyProtectFloat()

This UDF protects float value.

ptyProtectFloat(Float input, String dataElement)

Parameters

Float input: Float value to protect

String dataElement: Name of data element to unprotect float value

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 70

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected float value.

Example

create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val float) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as float) from temp_table;

select ptyProtectFloat(val, 'FLOAT_DE') from test_data_table;

4.5.13 ptyUnprotectFloat()

This UDF unprotects protected float value.

ptyUnprotectFloat(Float input, String dataElement)

Parameters

Float input: Protected float value to unprotect

String dataElement: Name of data element to unprotect float value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns unprotected float value.

Example

create temporary function ptyProtectFloat as 'com.protegrity.hive.udf.ptyProtectFloat';

create temporary function ptyUnprotectFloat as

'com.protegrity.hive.udf.ptyUnprotectFloat';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val float) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue float) row format delimited fields

terminated by ',' stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 71

insert overwrite table test_data_table select cast(trim(val) as float) from temp_table;

insert overwrite table protected_data_table select ptyProtectFloat(val, 'FLOAT_DE') from

test_data_table;

select ptyUnprotectFloat(protectedValue, 'FLOAT_DE') from protected_data_table;

4.5.14 ptyReprotect()

This UDF reprotects float format protected data with a different data element.

ptyReprotect(Float input, String oldDataElement, String newDataElement)

Parameters

Float input: Float value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected float value.

Example

create temporary function ptyProtectFloat AS 'com.protegrity.hive.udf.ptyProtectFloat';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val float) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val float) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val float) row format delimited fields terminated

by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as float) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectFloat(val,

'NoEncryption') from test_data_table;

create table test_reprotected_data_table(val float) row format delimited fields

terminated by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotect(val, '

NoEncryption’,’NoEncryption’) from test_protected_data_table;

4.5.15 ptyProtectDouble()

This UDF protects double value.

ptyProtectDouble(Double input, String dataElement)

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 72

Parameters

Double input: Double value to unprotect

String dataElement: Name of data element to unprotect double value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected double value.

Example

create temporary function ptyProtectDouble as

'com.protegrity.hive.udf.ptyProtectDouble';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val double) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as double) from temp_table;

select ptyProtectDouble(val, 'DOUBLE_DE') from test_data_table;

4.5.16 ptyUnprotectDouble()

This UDF unprotects protected double value.

ptyUnprotectDouble(Double input, String dataElement)

Parameters

Double input: Double value to unprotect

String dataElement: Name of data element to unprotect double value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns unprotected double value.

Example

create temporary function ptyProtectDouble as

'com.protegrity.hive.udf.ptyProtectDouble';

create temporary function ptyUnprotectDouble as

'com.protegrity.hive.udf.ptyUnprotectDouble';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val double) row format delimited fields terminated by ',' stored

as textfile;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 73

create table test_data_table(val double) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue double) row format delimited fields

terminated by ',' stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as double) from temp_table;

insert overwrite table protected_data_table select ptyProtectDouble(val, 'DOUBLE_DE')

from test_data_table;

select ptyUnprotectDouble(protectedValue, 'DOUBLE_DE') from protected_data_table;

4.5.17 ptyReprotect()

This UDF reprotects double format protected data with a different data element.

ptyReprotect(Double input, String oldDataElement, String newDataElement)

Parameters

Double input: Double value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected double value.

Example

create temporary function ptyProtectDouble AS 'com.protegrity.hive.udf.ptyProtectDouble';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val double) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val double) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val double) row format delimited fields terminated

by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as double) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectDouble(val,

'NoEncryption') from test_data_table;

create table test_reprotected_data_table(val double) row format delimited fields

terminated by ',' stored as textfile;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 74

insert overwrite table test_reprotected_data_table select ptyReprotect(val, '

NoEncryption’,’NoEncryption’) from test_protected_data_table;

4.5.18 ptyProtectBigInt()

This UDF protects BigInt value.

ptyProtectBigInt(BigInt input, String dataElement)

Parameters

BigInt input: Value to protect

String dataElement: Name of data element to protect value

Result

This UDF returns protected BigInteger value.

Example

create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val bigint) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val bigint) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table;

select ptyProtectBigInt(val, 'BIGINT_DE') from test_data_table;

4.5.19 ptyUnprotectBigInt()

This UDF unprotects protected BigInt value.

ptyUnprotectBigInt(BigInt input, String dataElement)

Parameters

BigInt input: Protected value to unprotect

String dataElement: Name of data element to unprotect value

Result

This UDF returns unprotected BigInteger value.

Example

create temporary function ptyProtectBigInt as 'com.protegrity.hive.udf.ptyProtectBigInt';

create temporary function ptyUnprotectBigInt as

'com.protegrity.hive.udf.ptyUnprotectBigInt';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val bigint) row format delimited fields terminated by ',' stored

as textfile;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 75

create table test_data_table(val bigint) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue bigint) row format delimited fields

terminated by ',' stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table;

insert overwrite table protected_data_table select ptyProtectBigInt(val, 'BIGINT_DE')

from test_data_table;

select ptyUnprotectBigInt(protectedValue, 'BIGINT_DE') from protected_data_table;

4.5.20 ptyReprotect()

This UDF reprotects BigInt format protected data with a different data element.

ptyReprotect(Bigint input, String oldDataElement, String newDataElement)

Parameters

Bigint input: Bigint value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Result

This UDF returns protected bigint value.

Example

create temporary function ptyProtectBigInt AS 'com.protegrity.hive.udf.ptyProtectBigInt';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val bigint) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val bigint) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val bigint) row format delimited fields terminated

by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as bigint) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectBigInt(val,

'Token_BigInteger') from test_data_table;

create table test_reprotected_data_table(val bigint) row format delimited fields

terminated by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotect(val, '

'BIGINT_DE', 'new_BIGINT_DE') from test_protected_data_table;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 76

4.5.21 ptyProtectDec()

This UDF protects decimal value.

This API works only with the CDH 4.3 distribution.

ptyProtectDec(Decimal input, String dataElement)

Parameters

Decimal input: Decimal value to protect

String dataElement: Name of data element to protect decimal value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected decimal value.

Example

create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val decimal) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val decimal) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table;

select ptyProtectDec(val, 'BIGDECIMAL_DE') from test_data_table;

4.5.22 ptyUnprotectDec()

This UDF unprotects protected decimal value.

This API works only with the CDH 4.3 distribution.

ptyUnprotectDec(Decimal input, String dataElement)

Parameters

Decimal input: Protected decimal value to unprotect

String dataElement: Name of data element to unprotect decimal value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns unprotected decimal value.

Example

create temporary function ptyProtectDec as 'com.protegrity.hive.udf.ptyProtectDec';

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 77

create temporary function ptyUnprotectDec as 'com.protegrity.hive.udf.ptyUnprotectDec';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val decimal) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue decimal) row format delimited fields

terminated by ',' stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table;

insert overwrite table protected_data_table select ptyProtectDec(val, 'BIGDECIMAL_DE')

from test_data_table;

select ptyUnprotectDec(protectedValue, 'BIGDECIMAL_DE') from protected_data_table;

4.5.23 ptyProtectHiveDecimal()

This UDF protects decimal value.

This API works only for distributions which include Hive, Version 0.11 and later.

ptyProtectHiveDecimal(Decimal input, String dataElement)

Parameters

Decimal input: Decimal value to protect

String dataElement: Name of data element to protect decimal value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Before the ptyProtectHiveDecimal() UDF is called, Hive rounds off the decimal value

in the table to 18 digits in scale, irrespective of the length of the data.

Result

This UDF returns protected decimal value.

Example

reate temporary function ptyProtectHiveDecimal as

'com.protegrity.hive.udf.ptyProtectHiveDecimal';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val decimal) row format delimited fields terminated by ','

stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 78

insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table;

select ptyProtectHiveDecimal(val, 'BIGDECIMAL_DE') from test_data_table;

4.5.24 ptyUnprotectHiveDecimal()

This UDF unprotects Decimal value.

This API works only for distributions which include Hive, Version 0.11 and later.

ptyUnprotectHiveDecimal(Decimal input, String dataElement)

Parameters

Decimal input: Decimal value to protect

String dataElement: Name of data element to unprotect decimal value

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Before the ptyUnprotectHiveDecimal() UDF is called, Hive rounds off the decimal

value in the table to 18 digits in scale, irrespective of the length of the data.

Result

This UDF returns unprotected decimal value.

Example

create temporary function ptyProtectHiveDecimal as

'com.protegrity.hive.udf.ptyProtectHiveDecimal';

create temporary function ptyUnprotectHiveDecimal as

'com.protegrity.hive.udf.ptyUnprotectHiveDecimal';

drop table if exists test_data_table;

drop table if exists temp_table;

drop table if exists protected_data_table;

create table temp_table(val string) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val decimal) row format delimited fields terminated by ','

stored as textfile;

create table protected_data_table(protectedValue decimal) row format delimited fields

terminated by ',' stored as textfile;

load data local inpath 'test_data.csv' overwrite into table temp_table;

insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table;

insert overwrite table protected_data_table select ptyProtectHiveDecimal(val,

'BIGDECIMAL_DE') from test_data_table;

select ptyUnprotectHiveDecimal(protectedValue, 'BIGDECIMAL_DE') from

protected_data_table;

4.5.25 ptyReprotect()

This UDF reprotects decimal format protected data with a different data element.

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 79

This API works only for distributions which include Hive, Version 0.11 and later.

ptyReprotect(Decimal input, String oldDataElement, String newDataElement)

Parameters

Decimal input: Decimal value to reprotect

String oldDataElement: Name of data element used to protect the data earlier

String newDataElement: Name of new data element to reprotect the data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

This UDF returns protected decimal value.

Example

create temporary function ptyProtectHiveDecimal AS

'com.protegrity.hive.udf.ptyProtectHiveDecimal';

create temporary function ptyReprotect AS 'com.protegrity.hive.udf.ptyReprotect';

drop table if exists test_data_table;

drop table if exists temp_table;

create table temp_table(val decimal) row format delimited fields terminated by ',' stored

as textfile;

create table test_data_table(val decimal) row format delimited fields terminated by ','

stored as textfile;

create table test_protected_data_table(val decimal) row format delimited fields

terminated by ',' stored as textfile;

LOAD DATA LOCAL INPATH 'test_data.csv' OVERWRITE INTO TABLE temp_table;

insert overwrite table test_data_table select cast(trim(val) as decimal) from temp_table;

insert overwrite table test_protected_data_table select ptyProtectHiveDecimal(val,

'NoEncryption') from test_data_table;

create table test_reprotected_data_table(val decimal) row format delimited fields

terminated by ',' stored as textfile;

insert overwrite table test_reprotected_data_table select ptyReprotect(val, '

NoEncryption’,’NoEncyption’) from test_protected_data_table;

4.6 Pig UDFs

This section describes all Pig UDFs that are available for protection and unprotection in Big Data

Protector to build secure Big Data applications.

4.6.1 ptyGetVersion()

This UDF returns the current version of PEP.

ptyGetVersion()

Parameters

None

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 80

Result

chararray: Version number

Example

// register pep pig version

DEFINE ptyGetVersion com.protegrity.pig.udf.ptyGetVersion;

//define UDF

employees = LOAD ‘employee.csv’ using PigStorage(‘,’)

AS (eid:chararray,name:chararray, ssn:chararray);

// load employee.csv from HDFS path

version = FOREACH employees GENERATE ptyGetVersion();

DUMP version;

4.6.2 ptyWhoAmI()

This UDF returns the current logged in user name.

ptyWhoAmI()

Parameters

None

Result

chararray: User name

Example

DEFINE ptyWhoAmI com.protegrity.pig.udf.ptyWhoAmI;

employees = LOAD ‘employee.csv’ using PigStorage(‘,’)

AS (eid:chararray, name:chararray, ssn:chararray);

username = FOREACH employees GENERATE ptyWhoAmI();

DUMP username;

4.6.3 ptyProtectInt()

This UDF returns protected value for integer data.

ptyProtectInt (int data, chararray dataElement)

Parameters

int data: Data to protect

chararray dataElement: Name of data element to use for protection

Result

Protected value for given numeric data

Example

DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt;

employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray,

ssn:chararray);

data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’);

DUMP data_p;

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 81

4.6.4 ptyUnprotectInt()

This UDF returns unprotected value for protected integer data.

ptyUnprotectInt (int data, chararray dataElement)

Parameters

int data: Protected data

chararray dataElement: Name of data element to use for unprotection

Result

Unprotected value for given protected integer data

Example

DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectInt;

DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectInt;

employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:int, name:chararray,

ssn:chararray);

data_p = FOREACH employees GENERATE ptyProtectInt(eid, ‘token_integer’);

data_u = FOREACH data_p GENERATE ptyUnprotectInt(eid, ‘token_integer’);

DUMP data_u;

4.6.5 ptyProtectStr()

This UDF protects string value.

ptyProtectStr(chararray input, chararray dataElement)

Parameters

chararray input: String value to protect

chararray dataElement: Name of data element to unprotect string value

Result

chararray

Example

DEFINE ptyProtectStr com.protegrity.pig.udf.ptyProtectStr;

employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray,

ssn:chararray);

data_p = FOREACH employees GENERATE ptyProtectIntStr(name, ‘token_alphanumeric’);

DUMP data_p

4.6.6 ptyUnprotectStr()

This UDF unprotects protected string value.

ptyUnprotectStr (chararray input, chararray dataElement)

Parameters

chararray input: Unprotected string value

chararray dataElement: Name of data element to unprotect string value

Result

chararray: Unprotected value

Big Data Protector Guide 6.6.5

Hadoop Application Protector

Confidential 82

Example

DEFINE ptyProtectInt com.protegrity.pig.udf.ptyProtectStr;

DEFINE ptyUnprotectInt com.protegrity.pig.udf.ptyUnProtectStr;

employees = LOAD ‘employee.csv’ using PigStorage(‘,’) AS (eid:chararray, name:chararray,

ssn:chararray);

data_p = FOREACH employees

GENERATE ptyProtectStr(name, ‘token_alphanumeric’) as name:chararray

DUMP data_p;

data_u = FOREACH data_p GENERATE ptyUnprotectStr(ssn, ‘Token_alphanumeric’);

DUMP data_u;

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 83

5 HDFS File Protector (HDFSFP)

5.1 Overview of HDFSFP

The files stored in HDFS are plain text files and can be accessed with a POSIX-based file system

access control. These files may contain sensitive data which is vulnerable with exposure to unwanted

users.

The HDFS File Protector (HDFSFP) helps to transparently protect these files as they are stored into

HDFS and allow only authorized users to access the content in the files.

5.2 Features of HDFSFP

The following are the features of HDFSFP:

• Protects and stores files in HDFS and retrieves the protected files in the clear from HDFS, as

per centrally defined security policy and access control.

• Stores and retrieves from HDFS transparently for the user, depending upon their access

control rights.

• Preserves Hadoop distributed data processing ensuring that protected content is processed

on data nodes independently.

• Blocks the addressing of files by the defined access control pass-through, transparently

without any protection or unprotection.

• Protects temporary data, such as intermediate files generated by the MapReduce job.

• Provides recursive access control for HDFS directories and files. Protects directories, its

subdirectories and files, as per defined security policy and access control.

• Protects files at rest so that unauthorized users can view only the protected content.

• Adds minimum overhead for data processing in HDFS.

• Can be accessed using the command shell and Java API.

5.3 Protector Usage

Files stored in HDFS are plain text files. Access controls for HDFS are implemented by using file-

based permissions that follow the UNIX permissions model. These files may contain sensitive data,

making it vulnerable when exposed to unwanted users. These files should be transparently protected

as they are stored into HDFS and the content should be exposed only to authorized users.

The files are stored and retrieved from HDFS using Hadoop ecosystem products, such as file shell

commands, MapReduce jobs, and so on.

Any user or application with write access to protected data at rest in HDFS can delete, update or

move the protected data. Hence though the protected data can be lost, the data is not compromised

as the user or application do not access the original data in the clear. Ensure that the Hadoop

administrator assigns file permissions in HDFS cautiously.

5.4 File Recover Utility

The File Recover utility recovers the contents from a protected file.

For more information about the File Recover Utility, refer to section 3.4.3 Recover Utility.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 84

5.5 HDFSFP Commands

Hadoop provides shell commands for modifying and administering HDFS. HDFSFP extends the

modification commands to control access to files and directories in HDFS.

The section describes the commands supported in HDFSFP.

5.5.1 copyFromLocal

This command ingests local data into HDFS.

hadoop ptyfs -copyFromLocal <local path of file to copy> <destination HDFS

directory path>

Result

• If the destination directory path is protected and the user executing the command has

permissions to create and protect, then the data is ingested in encrypted form.

• If the destination directory path is protected and the user does not have permissions to create

and protect, then the copy operation fails.

• If the destination HDFS directory path is not protected, then the data is ingested in clear form.

5.5.2 put

This command ingests local data into HDFS.

hadoop ptyfs -put <local path of file to copy> <destination HDFS directory

path>

Result

• If the destination HDFS directory path is protected and the user executing the command has

permissions to create and protect, then the data is ingested in encrypted form.

• If the destination HDFS directory path is protected and the user does not have permissions

to create and protect, then the copy operation fails.

• If the destination HDFS directory path is not protected, then the data is ingested in clear form.

5.5.3 copyToLocal

This command is used to copy an HDFS file to a local directory.

hadoop ptyfs -copyToLocal <HDFS file path to copy> <destination local

directory >

Result

• If the source HDFS file is protected and the user has unprotect permissions, then the file is

copied to the destination directory in clear form.

• If the source HDFS file is not protected, then the file is copied to the destination directory.

• If the HDFS file is protected the user does not have unprotect permissions, then the copy

operation fails.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 85

5.5.4 get

This command copies an HDFS file to a local directory.

hadoop ptyfs -get <HDFS file path to copy> <destination local directory>

Result

• If the source HDFS file is protected and the user has unprotect permissions, then the file is

copied to the destination directory in clear form.

• If the source HDFS file is not protected, then the file is copied to the destination directory.

• If the HDFS file is protected the user does not have unprotect permissions, then the copy

operation fails.

5.5.5 cp

This command copies a file from one HDFS directory to another HDFS directory.

hadoop ptyfs -cp <source HDFS file path> <destination HDFS directory path>

Result

• If the source HDFS file is protected and the user has unprotect permissions for the source

HDFS file, the destination directory is protected, and the user has permissions to protect and

create on the destination HDFS directory path, then the file gets copied in encrypted form.

• If the source HDFS file is protected and the user does not have permissions to unprotect, then

the copy operation fails.

• If the destination directory is protected and the user does not have permissions to protect

and create, then the copy operation fails.

• If the source HDFS file is unprotected and destination directory is protected and the user has

permissions to protect or create on the destination HDFS directory path, then the file is copied

in encrypted form.

• If the source HDFS file is protected and the user has permissions to unprotect for the source

HDFS file and destination HDFS directory path is not protected, then the file is copied in clear

form.

• If the source HDFS file and destination HDFS directory path are unprotected, then the

command works similar to the default Hadoop file shell -cp command.

5.5.6 mkdir

This command creates a new directory in HDFS.

hadoop ptyfs -mkdir <new HDFS directory path>

Result

• If the new directory is protected and the user has permissions to create, then the new

directory is created.

• If the new directory is not protected, then this command runs similar to the default HDFS file

shell -mkdir command.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 86

5.5.7 mv

This command moves an HDFS file from one HDFS directory to another HDFS directory.

hadoop ptyfs -mv <source HDFS file path> <destination HDFS directory path>

Result

• If the source HDFS file is protected and the user has unprotect and delete permissions and

the destination directory is also protected with the user having permissions to protect and

create on the destination HDFS directory path, then the file is moved to the destination

directory in encrypted form.

• If the HDFS file is protected and the user does not have unprotect and delete permissions or

the destination directory is protected and the user does not have permissions to protect and

create, then the move operation fails.

• If the source HDFS file is unprotected, the destination directory is protected and the user has

permissions to protect and create on the destination HDFS directory path, then the file is

copied in encrypted form.

• If the source HDFS file is protected and the user has permissions to unprotect and the

destination HDFS directory path is not protected, then the file is copied in clear form.

• If the source HDFS file and destination HDFS directory path are unprotected, then the

command works similar to the default Hadoop file shell -cp command.

5.5.8 rm

This command deletes HDFS files.

hadoop ptyfs -rm <HDFS file paths to delete>

Result

• If the HDFS file is protected and the user has permissions to delete on the HDFS file path,

then the file is deleted.

• If the HDFS file is protected and the user does not have permissions to delete on the HDFS

file path, then the delete operation fails.

• If the HDFS file is not protected, then the command works similar to the default Hadoop file

shell -rm command.

5.5.9 rmr

This command deletes an HDFS directory, its subdirectories and files.

hadoop ptyfs -rmr <HDFS directory path to delete>

Result

• If the HDFS directory path is protected and the user has permissions to delete on the HDFS

directory path, then the directory and its contents are deleted.

• If the HDFS directory path is protected and the user does not have permissions to delete on

the HDFS directory path, then the delete operation fails.

• If the HDFS directory path is not protected, then the command works as the default Hadoop

rm recursive (hadoop fs -rmr or hadoop fs –rm -r) command.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 87

5.6 Ingesting Files Securely

To ingest files into HDFS securely, use the put and CopyFromLocal commands.

For more information, refer to sections 5.5.2 put and 5.5.1 copyFromLocal.

If you need to ingest data to a protected ACL in HDFS using Sqoop, then use the -D target.output.dir

parameter before any tool-specific arguments as described in the following command.

sqoop import -D target.output.dir="/tmp/src" --driver com.mysql.jdbc.Driver

--connect "jdbc:mysql://master.localdomain/test" --username root --table test --

target-dir /tmp/src -m 1

In addition, if you need to add additional data to any existing data, then use the --append parameter

as described in the following command.

sqoop import -D target.output.dir="/tmp/src" --driver com.mysql.jdbc.Driver

--connect "jdbc:mysql://master.localdomain/test" --username root --table test --

target-dir /tmp/src -m 1 --append

5.7 Extracting Files Securely

To extract files from HDFS securely, use the get and CopyToLocal commands.

For more information, refer to sections 5.5.4 get and 5.5.3 copyToLocal.

5.8 HDFSFP Java API

Protegrity provides a Java API for working with files and directories using HDFSFP. The Java API for

HDFSFP provides an alternate means of working with HDFSFP besides the HDFSFP shell commands,

hadoop ptyfs, and enables you to integrate HDFSFP with Java applications.

The section describes the Java API commands supported in HDFSFP.

5.8.1 copy

This command copies a file from one HDFS directory to another HDFS directory.

copy(java.lang.String srcs, java.lang.String dst)

Parameters

srcs: HDFS file path

dst: HDFS file or directory path

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have protect and write permissions on the destination path in case the

destination path is protected, or the user does not have unprotect permission on the source

path or both.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 88

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the source HDFS file is protected and the user has unprotect permission for the source

HDFS file, the destination directory is protected, the ACL entry for the directory is activated,

and the user has permissions to protect and create on the destination HDFS directory path,

then the file gets copied in encrypted form.

• If the source HDFS file is protected and the user does not have permission to unprotect, then

the copy operation fails.

• If the destination directory is protected and the user does not have permissions to protect

and create, then the copy operation fails.

• If the source HDFS file is unprotected and destination directory is protected, the ACL entry

for the directory is activated, and the user has permissions to protect or create on the

destination HDFS directory path, then the file is copied in encrypted form.

• If the source HDFS file is protected and the user has permissions to unprotect for the source

HDFS file and destination HDFS directory path is not protected, then the file is copied in clear

form.

5.8.2 copyFromLocal

This command ingests local data into HDFS.

copyFromLocal(java.lang.String[] srcs, java.lang.String dst)

Parameters

srcs: Array of local file paths

dst: HDFS directory path

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have protect and write permissions on the destination path if it is protected.

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the destination directory path is protected, the ACL entry for the directory is activated, and

the user executing the command has permissions to create and protect, then the data is

ingested in encrypted form.

• If the destination directory path is protected and the user does not have permissions to create

and protect, then the copy operation fails.

• If the destination HDFS directory path is not protected, then the data is ingested in clear form.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 89

5.8.3 copyToLocal

This command is used to copy an HDFS file or directory to a local directory.

copyToLocal(java.lang.String srcs, java.lang.String dst)

Parameters

srcs: HDFS file or directory path

dst: Local directory or file path

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have unprotect and read permissions on the source path if it is protected.

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the source HDFS file is protected, the ACL entry for the directory is activated, and the user

has unprotect permission, then the file is copied to the destination directory in clear form.

• If the source HDFS file is not protected, then the file is copied to the destination directory.

• If the HDFS file is protected the user does not have unprotect permissions, then the copy

operation fails.

5.8.4 deleteFile

This command deletes files from HDFS.

deleteFile(java.lang.String srcf, boolean skipTrash)

Parameters

srcf: HDFS file path

skipTrash: Boolean value which decides if the file should be moved to trash. If the Boolean

value is true, then the file is not moved to trash;

if false, then the file is moved to trash.

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have delete permission to the path.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 90

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the HDFS file is protected and the user has permission to delete on the HDFS file path, then

the file is deleted.

• If the HDFS file is protected and the user does not have permission to delete on the HDFS file

path, then the delete operation fails.

5.8.5 deleteDir

This command deletes recursively an HDFS directory, its subdirectories, and files.

deleteDir(java.lang.String srcdir, boolean skipTrash)

Parameters

srcdir: HDFS directory path

skipTrash: Boolean value which decides if the file should be moved to trash. If the Boolean

value is true, then the directory is recursively not moved to trash;

if false, then the directory is recursively moved to trash.

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have delete permission to the path.

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the HDFS directory path is protected and the user has permission to delete on the HDFS

directory path, then the directory and its contents are deleted.

• If the HDFS directory path is protected and the user does not have permission to delete on

the HDFS directory path, then the delete operation fails.

5.8.6 mkdir

This command creates a new directory in HDFS.

mkdir(java.lang.String dir)

Parameters

dir: HDFS directory path

Returns

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 91

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The user does not have write permissions to the path.

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the new directory path exists in ACL or the ACL path for the parent directory path is

activated recursively, and the user has permissions to create, then the new directory with an

activated ACL path is created.

• If the new directory path or its parent directory path is not present in ACL recursively, then

the new directory is created without HDFSFP protection.

5.8.7 move

This command moves an HDFS file from one HDFS directory to another HDFS directory.

move(java.lang.String src, java.lang.String dst)

Parameters

src: HDFS file path

dst: HDFS file or directory path

Returns

True: If the operation is successful

Exception: If the operation fails

Exception (and Error Codes)

The API returns an exception (com.protegrity.hadoop.fileprotector.fs.ProtectorException)

if any of the following conditions are met:

• Input is null.

• The path does not exist.

• The user does not have unprotect and read, or protect and write, or create permissions to the

path.

• The user does not have protect and write permissions on the destination path in case the

destination path is protected, or the user does not have unprotect permission on the source

path or both.

For more information on exceptions, refer to the Javadoc provided with the HDFSFP Java API. The

Javadoc can be found in <protegrity_base_directory>/protegrity/hdfsfp/doc on the Data Ingestion

Node.

Result

• If the source HDFS file is protected, the ACL entry for the directory is activated, and the user

has unprotect and delete permissions and the destination directory is also protected with the

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 92

user having permissions to protect and create on the destination HDFS directory path, then

the file is moved to the destination directory in encrypted form.

• If the HDFS file is protected and the user does not have unprotect and delete permissions or

the destination directory is protected and the user does not have permissions to protect and

create, then the move operation fails.

• If the source HDFS file is unprotected, the destination directory is protected, the ACL entry

for the directory is activated, and the user has permissions to protect and create on the

destination HDFS directory path, then the file is copied in encrypted form.

• If the source HDFS file is protected and the user has permission to unprotect and the

destination HDFS directory path is not protected, then the file is copied in clear form.

5.9 Developing Applications using HDFSFP Java API

This section describes the guidelines to follow when developing applications using the HDFSFP Java

API.

The guidelines described in this section are a sample and assumes that

/opt/protegrity is the base installation directory of Big Data Protector. These

guidelines would need to be modified based on your requirements.

5.9.1 Setting up the Development Environment

Ensure that the following steps are completed before you begin to develop applications using the

HDFSFP Java API:

• Add the required HDFSFP Java API jar, hdfsfp-x.x.x.jar, in the Classpath.

• Instantiate the HDFSFP Java API function using the following command:

PtyHdfsProtector protector = new PtyHdfsProtector();

After successful instantiation, you are ready to call the HDFSFP Java API functions.

5.9.2 Protecting Data using the Class file

 To protect data using the Class file:

1. Compile the Java file to create a Class file with the following command.

javac -cp .:<PROTEGRITY_DIR>/hdfsfp/hdfsfp-x.x.x.jar ProtectData.java -d .

2. Protect data using the Class file with the following command.

hadoop ProtectData

5.9.3 Protecting Data using the JAR file

 To protect data using the JAR file:

1. Compile the Java file to create a Class file with the following command.

javac -cp .:<PROTEGRITY_DIR>/hdfsfp/hdfsfp-x.x.x.jar ProtectData.java -d .

2. Create the JAR file from the Class file with the following command.

jar -cvf protectData.jar ProtectData

3. Protect data using the JAR file with the following command.

hadoop jar protectData.jar ProtectData

5.9.4 Sample Program for the HDFSFP Java API

public class ProtectData {

public static PtyHdfsProtector protector = new PtyHdfsProtector();

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 93

public void copyFromLocalTest(String[] srcs, String dstf)

{

boolean result;

try {

result = protector.copyFromLocal(srcs, dstf);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void copyToLocalTest(String srcs, String dstf)

{

boolean result;

try {

result = protector.copyToLocal(srcs, dstf);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void copyTest(String srcs, String dstf)

{

boolean result;

try {

result = protector.copy(srcs, dstf);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void mkdirTest(String dir)

{

boolean result;

try {

result = protector.mkdir(dir);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void moveTest(String srcs, String dstf)

{

boolean result;

try {

result = protector.move(srcs,dstf);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void deleteFileTest(String file,boolean skipTrash)

{

boolean result;

try {

result = protector.deleteFile(file, skipTrash);

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public void deleteDirTest(String dir,boolean skipTrash)

{

boolean result;

try {

result = protector.deleteDir(dir, skipTrash);

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 94

} catch (ProtectorException pe) {

pe.printStackTrace();

}

public static void main(String[] args) {

ProtectData protect = new ProtectData();

// Ingest Local Data into HDFS

String srcsCFL[] = new String[2];

srcsCFL[0] ="<Local source file location1>";

srcsCFL[1] ="<Local Source file location2>";

String dstfCFL ="<ACL activated HDFS destination directory location>";

protect.copyFromLocalTest(srcsCFL, dstfCFL);

// Extract HDFS file to Local

String srcsCTL= "<ACL activated HDFS source file location>";

String dstfCTL = "<Local destination directory location >";

protect.copyToLocalTest(srcsCTL, dstfCTL);

// Copy File from HDFS to HDFS

String srcsCopy="<ACL activated HDFS source file location>";

String dstfCopy ="<ACL activated HDFS destination directory location>";

protect.copyTest(srcsCopy, dstfCopy);

// Create HDFS Sub-Directory

String dir = "<HDFS directory location>";

protect.mkdirTest(dir);

// Move from HDFS to HDFS

String srcsMove = "<ACL activated HDFS source file location>";

String dstfMove = "<ACL activated HDFS destination directory location>";

protect.moveTest(srcsMove, dstfMove);

// Delete File from HDFS

String fileDelete = "<HDFS file location>";

boolean skipTrashFile = false;

protect.deleteFileTest(fileDelete,skipTrashFile);

// Delete Sub-Directory and Children from HDFS

String dirDelete = "<HDFS directory location>";

boolean skipTrashDir = false;

protect.deleteDirTest(dirDelete,skipTrashDir);

}

5.10 Quick Reference Tasks

This section provides a quick reference for the tasks that can be performed by users.

5.10.1 Protecting Existing Data

The dfsadmin utility protects existing data after creating ACL for the HDFS path. It is a two-step

process. In first step, the user creates new ACL entries. In the second step, the user will activate the

newly created ACL entries. After activation, all ACL entries will be protected automatically.

The steps for activating ACL entries can be done for single or multiple entries. After activating ACL

entries, the HDFSFP infrastructure protects the HDFS path in the ACL entry.

While installing HDFSFP, you need to configure the ingestion user in the BDP.config file. The HDFS

administrator would have to ensure that the ingestion user has full access to the directories that

would be protected with HDFSFP. This user would be the authorized user for protection. Permissions

to protect or create are configured in the security policy. After the dfsadmin utility activates ACL

entry using the preconfigured ingested user, the HDFS File Protector protects the ACL path.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 95

For more information about adding and activating an ACL entry, refer to sections 5.15.1 Adding an

ACL Entry for Protecting Files or Folders and 5.15.5 Activating Inactive ACL Entries.

5.10.2 Reprotecting Files

Reprotecting Files or Folders

5.11 Sample Demo Use Case

For information about the sample demo use cases, refer to 12 Appendix: HDFSFP Demo.

HDFSFP can monitor policy and file or folder activity. This auditing can be done on a per policy basis.

The following event types can be audited when an event is generated on the action listed:

• Create or update ACL entry to protect or reprotect a file or folder

• Read/write an ACL encrypted file or folder

• Update or delete an ACL entry

Auditing qualifiers include success, failure, and auditing only when the user is audited for the same

action.

5.12 Appliance components of HDFSFP

This section describes the active components, shipped with ESA and required to run HDFSFP.

5.12.1 Dfsdatastore Utility

This utility adds the Hadoop cluster under protection for HDFSFP.

5.12.2 Dfsadmin Utility

This utility handles management of access control entries for files and folders.

5.13 Access Control Rules for Files and Folders

Rules for files and folders stored or accessed in HDFS are managed by Access Control Lists (ACLs).

The protection of HDFS files and folders is done after the ACL entry has been created. ACLs for

multiple Hadoop clusters can be managed only from the ESA. Protegrity Cache is used to store or

propagate secured ACLs across the clusters.

If you need to add, delete, search, update, or list a cluster, then use the DFS Cluster Management

Utility (dfsdatastore).

If you need to protect, unprotect, reprotect, activate, search, or update ACLs, or get information

about a job, then use the ACL Management Utility (dfsadmin).

For more information about managing access control entries across clusters, refer to sections 5.14

Using the DFS Cluster Management Utility (dfsdatastore) and 5.15 Using the ACL Management Utility

(dfsadmin).

5.14 Using the DFS Cluster Management Utility

(dfsdatastore)

The dfsdatastore utility enables you to manage the configuration of clusters on the ESA. The details

of all options supported by this utility are described in this section.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 96

5.14.1 Adding a Cluster for Protection

 To add a cluster for protection using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option Add.

7. Select Next.

8. Press ENTER.

The dfsdatastore credentials screen appears.

9. Enter the following parameters:

• Datastore name – The name for the datastore or cluster.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 97

This name will be used for managing ACLs for the cluster.

• Hostname/IP of the Lead node within the cluster – The hostname or IP address of

the Lead node of the cluster.

• Port number – The Protegrity Cache port which was specified in the BDP.config file

during installation.

10. Select OK.

11. Press ENTER.

The cluster with the specified parameters is added.

12. If the DfsCacheRefresh service is already running, then the datastore is added in an

activated state.

If the DfsCacheRefresh service is not running, then the datastore is added in an inactive

state. The datastore can be activated by starting the DfsCacheRefresh service.

If you are using Big Data Protector with version lower than 6.6.3, then on the

dfsdatastore credentials screen, a prompt for the Protegrity Cache password appears.

You need to specify the Protegrity Cache password that was provided during the

installation of the Big Data Protector.

 To start the DfsCacheRefresh Service:

1. Login to the ESA Web UI.

2. Navigate to System Services.

3. Start the DfsCacheRefresh service.

5.14.2 Updating a Cluster

Ensure that you utilize the Update option in the dfsdatastore UI to modify the

parameters of an existing datastore only.

 To update a cluster using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option Update.

7. Select Next.

8. Press ENTER.

The dfsdatastore update screen appears.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 98

9. Update the following parameters as required:

• Hostname/IP of the Lead node within the cluster – The hostname or IP address of

the Lead node of the cluster.

• Port number – The Protegrity Cache port which was specified in the BDP.config file

during installation.

10. Select OK.

11. Press ENTER.

If you are using Big Data Protector with version lower than 6.6.3, then on the

dfsdatastore credentials screen, a prompt for the Protegrity Cache password appears.

You need to specify the Protegrity Cache password that was provided during the

installation of the Big Data Protector.

The cluster is modified with the required updates.

5.14.3 Removing a Cluster

Ensure that the Cache Refresh Service is running in the ESA Web UI before

removing a cluster.

 To remove a cluster using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option Remove.

7. Select Next.

8. Press ENTER.

The dfsdatastore remove screen appears.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 99

9. Enter the following parameter:

• Datastore name – The name for the datastore or cluster.

This name will be used for managing ACLs for the cluster.

10. Select OK.

11. Press ENTER.

The required cluster is removed.

5.14.4 Monitoring a Cluster

 To monitor a cluster using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option Execute Command.

7. Select Next.

8. Press ENTER.

The dfsdatastore execute command screen appears.

9. Enter the following parameters:

• Datastore name - The name of the datastore or cluster.

This name is used for managing ACLs for the cluster.

• Command - The command to execute on the datastore. In this release, the only

command supported is TEST. The command TEST is executed on the cluster, which

is used to retrieve the statuses of the following servers:

o Cache Refresh Server, running on the ESA

o Cache Monitor Server, running on the Lead node of the cluster

o Distributed Cache Server, running on the Lead and slave nodes of the cluster

10. Select OK.

11. Press ENTER.

The dfsdatastore UI executes the TEST command on the cluster.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 100

If you are using Big Data Protector with version lower than 6.6.3, then the Cluster

Monitoring feature is not supported.

5.14.5 Searching a Cluster

 To search for a cluster using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option Search.

7. Select Next.

8. Press ENTER.

The dfsdatastore search screen appears.

9. Enter the following parameter:

• Datastore name – The name of the datastore or cluster.

This name is used for managing ACLs for the cluster.

10. Select OK.

11. Press ENTER.

The dfsdatastore UI searches for the required cluster.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 101

5.14.6 Listing all Clusters

 To list all clusters using the dfsdatastore UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS Cluster Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsdatastore UI appears.

6. Select the option List.

7. Select Next.

8. Press ENTER.

A list of all the clusters appears. Each cluster description contains one of the following cluster

statuses:

• 1: Cluster is in active state

• 0: Cluster is in inactive state

5.15 Using the ACL Management Utility (dfsadmin)

The dfsadmin utility enables you to manage ACLs for cluster. Managing ACLs is a two-step process,

which is creating or modifying ACL entries and then activating it. The protection of file or folder paths

will not take effect until ACL entries are verified, confirmed and activated.

Ensure that an unstructured policy is created in the ESA, which is to be linked with

the ACL.

The details of all options supported by this utility are described in this section.

5.15.1 Adding an ACL Entry for Protecting Directories in HDFS

It is recommended to not create ACLs for file paths.

If the ACL for the directory, which is containing a file for which an ACL already exists, is

being unprotected, then a decryption failure might occur, if there is a mismatch between

the data elements used for the protection of the directory and the file contained in the

directory.

 To add an ACL entry for protecting files or folders using the dfsadmin UI from the ESA

CLI Manager:

1. Login to the ESA CLI Manager.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 102

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Protect.

7. Select Next.

8. Press ENTER.

The dfsadmin protection screen appears.

9. Enter the following parameters:

• File Path – The directory path to protect.

• Data Element Name – The unstructured data element name to protect the HDFS

directory path with.

• Datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 103

• Recursive (yes or no) – Select one of the following options:

o Yes – Protect all files, sub-directories and their files, in the directory path.

o No – Protect the files in the directory path.

10. Select OK.

11. Press ENTER.

The ACL entry required for protecting the directory path is added to the Inactive list. The

ACL entries can be activated by selecting the Activate option.

After the ACL entries are activated, the following actions occur, as required:

• If the recursive flag is not set, then all files inside the directory path are protected.

• If the recursive flag is set, then all the files, sub-directories and its files, in the directory path

are protected.

If any MapReduce jobs or HDFS file shell commands are initiated on the ACL paths before the ACLs

are activated, then the jobs or commands will fail. After the ACLs are activated, any new files that

are ingested in the respecticve ACL directory path will get protected.

5.15.2 Updating an ACL Entry

 To update an ACL entry using the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Update.

7. Select Next.

8. Press ENTER.

The dfsadmin update screen appears.

9. Update the following parameters as required:

• File Path – The directory path to protect.

• Datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

• Recursive (yes or no) – Select one of the following options:

o Yes – Protect all files, sub-directories and their files, in the directory path.

o No – Protect the files in the directory path.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 104

10. Select OK.

11. Press ENTER.

The ACL entry is updated as required.

5.15.3 Reprotecting Files or Folders

 To reprotect files or folders using the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Reprotect.

7. Select Next.

8. Press ENTER.

The dfsadmin reprotection screen appears.

9. Enter the following parameters:

• File Path – The directory path to protect.

• datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

• Data Element Name – The data element name to protect the directory path with.

If the user has rotated the data element key and needs to reprotect the data, then

this field is optional.

10. Select OK.

11. Press ENTER.

The files inside the ACL entry are reprotected.

5.15.4 Deleting an ACL Entry to Unprotect Files or Directories

 To delete an ACL entry to unprotect files or directories using the dfsadmin UI from the

ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 105

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Unprotect.

7. Select Next.

8. Press ENTER.

The dfsadmin unprotection screen appears.

9. Enter the following parameters as required:

• File Path – The directory path which is protected.

• Datastore Name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

10. Select OK.

11. Press ENTER.

The files inside the ACL entry are unprotected.

5.15.5 Activating Inactive ACL Entries

If you are using a Kerberos-enabled Hadoop cluster, then ensure that the user ptyitusr

has a valid Kerberos ticket and write access permissions on the HDFS path for which the

ACL is being created.

 To activate inactive ACL entries using the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Activate.

7. Select Next.

8. Press ENTER.

The dfsadmin activation screen appears.

9. Enter the following parameter as required:

• datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 106

10. Select OK.

11. Press ENTER.

The inactive ACL entries in the Datastore are activated.

If an ACL entry for a directory containing files is activated (for

Protect/Unprotect/Reprotect), then the ownerships and permissions for only the

files contained in the directory are changed.

To avoid this issue, ensure that the user configured in the PROTEGRITY_IT_USR

property in the BDP.config file during the Big Data Protector installation is

added to the HDFS superuser group by running the following command on the

Lead node:

usermod -a -G hdfs <User_configured_in_the_“PROTEGRITY_IT_USR”

property>

If the protect or unprotect operation fails on the files or folders, which are a part of the

ACL entry being activated and the message ACL is locked appears, then ensure that you

monitor the beuler.log file for any exceptions and take the required corrective action.

 To monitor the beuler.log file:

1. Login to the Lead node with root permissions.

2. Switch the user to PROTEGRITY_IT_USR, as configured in the BDP.config file.

3. Navigate to the hdfsfp/ptyitusr directory.

4. Monitor the beuler.log file for any exceptions.

5. If any exceptions appear in the beuler.log file, then resolve the exceptions as required.

6. Login to the Lead node and run the beuler.sh script.

The following is a sample beuler.sh script command.

sh beuler.sh -path <ACL_directory_path> -datastore <datastore_name> -

activationid <activation_ID> -beulerjobid <beuler_job_ID>

Alternatively, you can restart the DfsCacheRefresh service.

 To restart the DfsCacheRefresh Service:

1. Login to the ESA Web UI.

2. Navigate to System Services.

3. Restart the DfsCacheRefresh service.

5.15.6 Viewing the ACL Activation Job Progress Information in the

Interactive Mode

 To view the ACL Activation Job Progress Information in the Interactive mode using the

dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option JobProgressInfo.

7. Select Next.

8. Press ENTER.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 107

The Activation ID screen appears.

9. Enter the Activation ID.

10. Press ENTER.

The filter search criteria screen appears.

11. If you need to specify the filtering criteria, then perform the following steps.

c) Type Y or y.

d) Select one of the following filtering criteria:

o Start Time

o Status

o ACL Path

12. Select Next.

13. Press ENTER.

14. If you do not need to specify the search criteria, then type N or n.

The dfsadmin job progress information screen appears listing all the jobs against the required

Activation ID with the following information:

• State: One of the following states of the job:

o Started

o Failed

o In progress

o Completed

o Yet to start

o Failed as Path Does not Exist

• Percentage Complete: The percentage completion for the directory encryption

• Job Start Time: The time when the directory encryption started

• Job End Time: The time when the directory encryption ended

• Processed Data: The amount of data that is encrypted

• Total Data: The total directory size being encrypted

• ACL Path: The directory path being encrypted

5.15.7 Viewing the ACL Activation Job Progress Information in the

Non Interactive Mode

 To view the ACL Activation Job Progress Information in the Non Interactive mode using

the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 108

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option JobProgressInfo against ActivationId.

The JobProgressInfo Activation ID screen appears.

7. Enter the Activation ID.

8. Select OK.

9. Press ENTER.

The dfsadmin job progress information screen for the required Activation ID appears.

5.15.8 Searching ACL Entries

 To search for ACL entries using the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

6. Select the option Search.

7. Select Next.

8. Press ENTER.

The dfsadmin search screen appears.

9. Enter the following parameters as required:

• File Path – The directory path to protect.

• datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

10. Select OK.

11. Press ENTER.

The dfsadmin UI searches for the required ACL entry.

5.15.9 Listing all ACL Entries

 To list ACL entries using the dfsadmin UI from the ESA CLI Manager:

1. Login to the ESA CLI Manager.

2. Navigate to Tools DFS ACL Management Utility.

3. Press ENTER.

The root password screen appears.

4. Enter the root password.

5. Press ENTER.

The dfsadmin UI appears.

Big Data Protector Guide 6.6.5

HDFS File Protector (HDFSFP)

Confidential 109

6. Select the option List.

7. Select OK.

8. Press ENTER.

The dfsadmin list screen appears.

9. Enter the following parameter as required:

• Datastore name – The datastore or cluster name specified while adding it on the ESA

using the DFS Cluster Management Utility.

10. Select OK.

11. Press ENTER.

A list of all the ACL entries appears.

If you are using Big Data Protector with version lower than 6.6.3, then the List option

does not show the Activation ID and the Beuler Job ID for the respective ACLs.

5.16 HDFS Codec for Encryption and Decryption

A codec is an algorithm which provides compression and decompression. Hadoop provides a codec

framework to compress blocks of data before storage. The codec compresses data while writing the

blocks and decompresses data while reading the blocks.

A split-able codec is an algorithm which is applied after splitting a file, making it possible to recover

original data from any part of the split.

The Protegrity HDFS codec is a split-able cryptographic codec. It uses encryption, such as AES 128,

AES 256, DES and so on. It utilizes the infrastructure of the Protegrity Application Protector for

applying cryptographic support. The protection is governed by the Policy deployed by the ESA, as

defined by the Security Officer.

Big Data Protector Guide 6.6.5

HBase

Confidential 110

6 HBase

HBase is a database, which provides random read and write access to tables, consisting of rows and

columns, in real-time. HBase is designed to run on commodity servers, to automatically scale as

more servers are added, and is fault tolerant as data is divided across servers in the cluster. HBase

tables are partitioned into multiple regions. Each region stores a range of rows in the table. Regions

contain a datastore in memory and a persistent datastore (HFile). The Name node assigns multiple

regions to a region server. The Name node manages the cluster and the region servers store portions

of the HBase tables and perform the work on the data.

6.1 Overview of the HBase Protector

The Protegrity HBase protector extends the functionality of the data storage framework and provides

transparent data protection and unprotection using coprocessors, which provide the functionality to

run code directly on region servers. The Protegrity coprocessor for HBase runs on the region servers

and protects the data stored in the servers. All clients which work with HBase are supported.

The data is transparently protected or unprotected, as required, utilizing the coprocessor framework.

6.2 HBase Protector Usage

The Protegrity HBase protector utilizes the get, put, and scan commands and calls the Protegrity

coprocessor for the HBase protector. The Protegrity coprocessor for the HBase protector locates the

metadata associated with the requested column qualifier and the current logged in user. If the data

element is associated with the column qualifier and the current logged in user, then the HBase

protector processes the data in a row based on the data elements defined by the security policy

deployed in the Big Data Protector.

The Protegrity HBase coprocessor only supports bytes converted from the string data type.

If any other data type is directly converted to bytes and inserted in an HBase table, which

is configured with the Protegrity HBase coprocessor, then data corruption might occur.

6.3 Adding Data Elements and Column Qualifier Mappings

to a New Table

In an HBase table, every column family of a table stores metadata for that family, which contain the

column qualifier and data element mappings.

Users need to add metadata to the column families for defining mappings between the data element

and column qualifier, when a new HBase table is created.

The following command creates a new HBase table with one column family.

create 'table', { NAME => 'column_family_1', METADATA => {

'DATA_ELEMENT:credit_card'=>'CC_NUMBER','DATA_ELEMENT:name'=>'TOK_CUSTOMER_NAME'

} }

Parameters

table: Name of the table.

column_family_1: Name of the column family.

METADATA: Data associated with the column family.

Big Data Protector Guide 6.6.5

HBase

Confidential 111

DATA_ELEMENT: Contains the column qualifier name. In the example, the column qualifier

names credit_card and name, correspond to data elements CC_NUMBER and

TOK_CUSTOMER_NAME respectively.

6.4 Adding Data Elements and Column Qualifier Mappings

to an Existing Table

Users can add data elements and column qualifiers to an existing HBase table. Users need to alter

the table to add metadata to the column families for defining mappings between the data element

and column qualifier.

The following command adds data elements and column qualifier mappings to a column in an existing

HBase table.

alter 'table', { NAME => 'column_family_1', METADATA =>

{'DATA_ELEMENT:credit_card'=>'CC_NUMBER',

'DATA_ELEMENT:name'=>'TOK_CUSTOMER_NAME' } }

Parameters

table: Name of the table.

column_family_1: Name of the column family.

METADATA: Data associated with the column family.

DATA_ELEMENT: Contains the column qualifier name. In the example, the column qualifier

names credit_card and name, correspond to data elements CC_NUMBER and

TOK_CUSTOMER_NAME respectively.

6.5 Inserting Protected Data into a Protected Table

Users can ingest protected data into a protected table in HBase using the BYPASS_COPROCESSOR

flag. If the BYPASS_COPROCESSOR flag is set while inserting data in the HBase table, then the

Protegrity coprocessor for HBase is bypassed.

The following command bypasses the Protegrity coprocessor for HBase and ingests protected data

into an HBase table.

put 'table', 'row_2', 'column_family:credit_card', '3603144224586181', {

ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}}

Parameters

table: Name of the table.

column_family: Name of the column family and the protected data to be inserted in the column.

METADATA: Data associated with the column family.

ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the

example, the flag to bypass the Protegrity coprocessor for HBase is set.

6.6 Retrieving Protected Data from a Table

If users need to retrieve protected data from an HBase table, then they need to set the

BYPASS_COPROCESSOR flag to retrieve the data. This is necessary to retain the protected data as

is since HBase performs protects and unprotects the data transparently.

Big Data Protector Guide 6.6.5

HBase

Confidential 112

The following command bypasses the Protegrity coprocessor for HBase and retrieves protected data

from an HBase table.

scan 'table', { ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}}

Parameters

table: Name of the table.

ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the

example, the flag to bypass the Protegrity coprocessor for HBase is set.

6.7 Protecting Existing Data

Users should define the mappings between the data elements and column qualifiers in the respective

column families, which are used to by the coprocessor to protect or unprotect the data.

The following command protects the existing data in an HBase table by setting the MIGRATION flag.

Data from the table is read, protected, and inserted back into the table.

scan 'table', { ATTRIBUTES => {'MIGRATION'=>'1'}}

Parameters

table: Name of the table.

ATTRIBUTES: Additional parameters to consider when ingesting the protected data. In the

example, the Migration flag is set to protect the existing data in the HBase table.

6.8 HBase Commands

Hadoop provides shell commands to ingest, extract, and display the data in an HBase table.

The section describes the commands supported by HBase.

6.8.1 put

This command ingests the data provided by the user in protected form, using the configured data

elements, into the required row and column of an HBase table. You can use this command to ingest

data into all the columns for the required row of the HBase table.

put '<table_name>','<row_number>', 'column_family:<column_name>', '<data>'

Parameters

table_name: Name of the table.

row_number: Number of the row in the HBase table.

column_family: Name of the column family and the protected data to be inserted in the column.

6.8.2 get

This command displays the protected data from the required row and column of an HBase table in

cleartext form. You can use this command to display the data contained in all the columns of the

required row of the HBase table.

get '<table_name>','<row_number>', 'column_family:<column_name>'

Parameters

table_name: Name of the table.

Big Data Protector Guide 6.6.5

HBase

Confidential 113

row_number: Number of the row in the HBase table.

column_family: Name of the column family.

Ensure that the logged in user has the permissions to view the protected data in cleartext

form. If the user does not have the permissions to view the protected data, then only the

protected data appears.

6.8.3 scan

This command displays the data from the HBase table in protected or unprotected form.

View the protected data using the following command.

scan '<table_name>', { ATTRIBUTES => {'BYPASS_COPROCESSOR'=>'1'}}

View the unprotected data using the following command.

scan '<table_name>'

Parameters

table_name: Name of the table.

ATTRIBUTES: Additional parameters to consider when displaying the protected or unprotected

data.

Ensure that the logged in user has the permissions to unprotect the protected data. If the

user does not have the permissions to unprotect the protected data, then only the

protected data appears.

6.9 Ingesting Files Securely

To ingest data into HBase securely, use the put command.

For more information, refer to section 6.8.1 put.

6.10 Extracting Files Securely

To extract data from HBase securely, use the get command.

For more information, refer to section 6.8.2 get.

6.11 Sample Use Cases

For information about the HBase protector sample use cases, refer to section 12.8 Protecting Data

using HBase.

Big Data Protector Guide 6.6.5

Impala

Confidential 114

7 Impala

Impala is an MPP SQL query engine for querying the data stored in a cluster. It provides the flexibility

of the SQL format and is capable of running the queries on HDFS in HBase.

This section provides information about the Impala protector, the UDFs provided, and the commands

for protecting and unprotecting data in an Impala table.

7.1 Overview of the Impala Protector

Impala is an MPP SQL query engine for querying the data stored in a cluster. The Protegrity Impala

protector extends the functionality of the Impala query engine and provides UDFs which protect or

unprotect the data as it is stored or retrieved.

7.2 Impala Protector Usage

The Protegrity Impala protector provides UDFs for protecting data using encryption or tokenization,

and unprotecting data by using decryption or detokenization.

Ensure that the /user/impala path exists in HDFS with the Impala supergroup

permissions.

You can verify this by the following command:

# hadoop fs –ls /user

 To create the /user/impala path in Impala with Supergroup permissions:

If the /user/impala path does not exist or does not have supergroup permissions, then perform the

following steps.

1. Create the /user/impala directory in HDFS using the following command.

# sudo –u hdfs hadoop –mkdir /user/impala

2. Assign Impala supergroup permissions to the /user/impala path using the following

command.

# sudo –u hdfs hadoop –chown –R impala:supergroup /user/impala

7.3 Impala UDFs

This section describes all Impala UDFs that are available for protection and unprotection in Big Data

Protector to build secure Big Data applications.

7.3.1 pty_GetVersion()

This UDF returns the PEP version number.

ptyGetVersion()

Parameters

None

Result

This UDF returns the current version of the PEP.

Big Data Protector Guide 6.6.5

Impala

Confidential 115

Example

select pty_GetVersion ();

7.3.2 pty_WhoAmI()

This UDF returns the logged in user name.

ptyWhoAmI()

Parameters

None

Result

Text: Returns the logged in user name

Example

select pty_WhoAmI();

7.3.3 pty_GetCurrentKeyId()

This UDF returns the current active key identification number of the encryption type data element.

pty_GetCurrentKeyId(dataElement string)

Parameters

dataElement: Variable specifies the protection method

Result

integer: Returns the current key identification number

Example

select pty_GetCurrentKeyId('enc_3des_kid');

7.3.4 pty_GetKeyId()

This UDF returns the key ID used for each row in a table.

pty_GetKeyId(dataElement string, col string)

Parameters

dataElement: Variable specifies the protection method

col: String array of the data in table

Result

integer: Returns the key indentification number for the row

Example

select pty_GetKeyId('enc_3des_kid',column_name) from table_name;

7.3.5 pty_StringEnc()

This UDF returns the encrypted value for a column containing String format data.

pty_StringEnc(data string, dataElement string)

Big Data Protector Guide 6.6.5

Impala

Confidential 116

Parameters

data: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Result

string: Returns a string value

Example

select pty_StringEnc(column_name,'enc_3des') from table_name;

7.3.6 pty_StringDec()

This UDF returns the decrypted value for a column containing String format data.

pty_StringDec(data string, dataElement string)

Parameters

data: Column name of the data to decode in the table

dataElement: Variable specifying the unprotection method

Result

string: Returns a string value

Example

select pty_StringDec(column_name,'enc_3des') from table_name;

7.3.7 pty_StringIns()

This UDF returns the tokenized value for a column containing String format data.

pty_StringIns(data string, dataElement string)

Parameters

data: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Result

string: Returns the tokenized string value

Example

select pty_StringIns(column_name, 'TOK_NAME') from table_name;

7.3.8 pty_StringSel()

This UDF returns the detokenized value for a column containing String format data.

pty_StringSel(data string, dataElement string)

Parameters

data: Column name of the data to detokenize in the table

dataElement: Variable specifing the unprotection method

Result

string: Returns the detokenized string value

Big Data Protector Guide 6.6.5

Impala

Confidential 117

Example

select pty_StringSel(column_name, 'TOK_NAME') from table_name;

7.3.9 pty_UnicodeStringIns()

This UDF returns the tokenized value for a column containing String (Unicode) format data.

pty_UnicodeStringIns(data string, dataElement string)

Parameters

data: Column name of the string (Unicode) format data to tokenize in the table

dataElement: Name of data element to protect string (Unicode) value

This UDF should be used only if you need to tokenize Unicode data in Impala, and

migrate the tokenized data from Impala to a Teradata database and detokenize the

data using the Protegrity Database Protector.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data to a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns protected string value.

Example

select pty_UnicodeStringIns(val, 'Token_unicode') from temp_table;

7.3.10 pty_UnicodeStringSel()

This UDF unprotects the existing protected String value.

pty_UnicodeStringSel(data string, dataElement string)

Parameters

data: Column name of the string format data to detokenize in the table

varchar dataElement: Name of data element to unprotect string value

This UDF should be used only if you need to tokenize Unicode data in Teradata

using the Protegrity Database Protector, and migrate the tokenized data from a

Teradata database to Impala and detokenize the data using the Protegrity Big Data

Protector for Impala.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data from

a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns detokenized string (Unicode) value.

Example

select pty_UnicodeStringSel(val, 'Token_unicode') from temp_table;

Big Data Protector Guide 6.6.5

Impala

Confidential 118

7.3.11 pty_IntegerEnc()

This UDF returns the encrypted value for a column containing Integer format data.

pty_IntegerEnc(data integer, dataElement string)

Parameters

data: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Result

string: Returns a string value

Example

select pty_IntegerEnc(column_name,'enc_3des') from table_name;

7.3.12 pty_IntegerDec()

This UDF returns the decrypted value for a column containing Integer format data.

pty_IntegerDec(data string, dataElement string)

Parameters

data: Column name of the data to decode in the table

dataElement: Variable specifying the unprotection method

Result

integer: Returns an integer value

Example

select pty_IntegerDec(column_name,'enc_3des') from table_name;

7.3.13 pty_IntegerIns()

This UDF returns the tokenized value for a column containing Integer format data.

pty_IntegerIns(data integer, dataElement string)

Parameters

data: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Result

integer: Returns the tokenized integer value

Example

select pty_IntegerIns(column_name,'integer_de') from table_name;

7.3.14 pty_IntegerSel()

This UDF returns the detokenized value for a column containing Integer format data.

pty_IntegerSel(data integer, dataElement string)

Parameters

data: Column name of the data to detokenize in the table

dataElement: Variable specifing the unprotection method

Big Data Protector Guide 6.6.5

Impala

Confidential 119

Result

integer: Returns the detokenized integer value

Example

select pty_IntegerSel(column_name,'integer_de') from table_name;

7.3.15 pty_FloatEnc()

This UDF returns the encrypted value for a column containing Float format data.

pty_FloatEnc(data float, dataElement string)

Parameters

data: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Result

string: Returns a string value

Example

select pty_FloatEnc(column_name,'enc_3des') from table_name;

7.3.16 pty_FloatDec()

This UDF returns the decrypted value for a column containing Float format data.

pty_FloatDec(data string, dataElement string)

Parameters

data: Column name of the data to decode in the table

dataElement: Variable specifying the unprotection method

Result

float: Returns a float value

Example

select pty_FloatDec(column_name,'enc_3des') from table_name;

7.3.17 pty_FloatIns()

This UDF returns the tokenized value for a column containing Float format data.

pty_FloatIns(data float, dataElement string)

Parameters

data: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Result

float: Returns the tokenized float value

Big Data Protector Guide 6.6.5

Impala

Confidential 120

Example

select pty_FloatIns(cast(12.3 as float), 'no_enc');

Ensure that you use the data element with the No Encryption method only. Using

any other data element would return an error mentioning that the operation is not

supported for that data type.

If you need to tokenize the Float column, then load the Float column into a String

column and use the pty_StringIns UDF to tokenize the column.

For more information about the pty_StringIns UDF, refer to section 7.3.7

pty_StringIns().

7.3.18 pty_FloatSel()

This UDF returns the detokenized value for a column containing Float format data.

pty_FloatSel(data float, dataElement string)

Parameters

data: Column name of the data to detokenize in the table

dataElement: Variable specifing the unprotection method

Result

float: Returns the detokenized float value

Example

select pty_FloatSel(tokenized_value, 'no_enc');

Ensure that you use the data element with the No Encryption method only. Using

any other data element would return an error mentioning that the operation is not

supported for that data type.

7.3.19 pty_DoubleEnc()

This UDF returns the encrypted value for a column containing Double format data.

pty_DoubleEnc(data double, dataElement string)

Parameters

data: Integer data column to encrypt in the table

dataElement: Variable specifying the protection method

Result

string: Returns a string

Example

select pty_DoubleEnc(column_name,'enc_3des') from table_name;

Big Data Protector Guide 6.6.5

Impala

Confidential 121

7.3.20 pty_DoubleDec()

This UDF returns the decrypted value for a column containing Double format data.

Pty_DoubleDec(data string, dataElement string)

Parameters

data: Column name of the data to decode in the table

dataElement: Variable specifying the unprotection method

Result

double: Returns a double value

Example

select pty_DoubleDec(column_name,'enc_3des') from table_name;

7.3.21 pty_DoubleIns()

This UDF returns the tokenized value for a column containing Double format data.

pty_DoubleIns(data double, dataElement string)

Parameters

data: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Result

double: Returns a double value

Example

select pty_DoubleIns(cast(1.2 as double), 'no_enc');

Ensure that you use the data element with the No Encryption method only. Using

any other data element would return an error mentioning that the operation is not

supported for that data type.

If you need to tokenize the Double column, then load the Double column into a

String column and use the pty_StringIns UDF to tokenize the column.

For more information about the pty_StringIns UDF, refer to section 7.3.7

pty_StringIns().

7.3.22 pty_DoubleSel()

This UDF returns the detokenized value for a column containing Double format data.

pty_DoubleSel(data double, dataElement string)

Parameters

data: Column name of the data to detokenize in the table

dataElement: Variable specifing the unprotection method

Result

double: Returns the detokenized double value

Big Data Protector Guide 6.6.5

Impala

Confidential 122

Example

select pty_DoubleSel(tokenized_value, 'no_enc');

Ensure that you use the data element with the No Encryption method only. Using

any other data element would return an error mentioning that the operation is not

supported for that data type.

7.4 Inserting Data from a File into a Table

To insert data from a file into an Impala table, ensure that the required user permissions for the

directory path in HDFS are assigned for the Impala table.

 To prepare the environment for the basic_sample.csv file:

1. Assign permissions to the path where data from the basic_sample.csv file needs to be copied

using the following command:

sudo -u hdfs hadoop fs -chown root:root /tmp/basic_sample/sample/

2. Copy the data from the basic_sample.csv file into HDFS using the following command:

hdfs dfs -put basic_sample.csv /tmp/basic_sample/sample/

3. Verify the presence of the basic_sample.csv file in the HDFS path using the following

command:

hdfs dfs -ls /tmp/basic_sample/sample/

4. Assign permissions for Impala to the path where the basic_sample.csv file is located using

the following command:

sudo -u hdfs hadoop fs -chown impala:supergroup /path/

 To populate the table sample_table from the basic_sample_data.csv file:

The following commands populate the table basic_sample with the data from the

basic_sample_data.csv file.

create table sample_table(colname1 colname1_format, colname2 colname2_format,

colname3 colname3_format)

row format delimited fields terminated by ',';

LOAD DATA INPATH '/tmp/basic_sample/sample/' INTO TABLE sample_table;

Parameters

sample_table: Name of the Impala table created to load the data from the input CSV file from

the required path.

colname1, colname2, colname3: Name of the columns.

colname1_format, colname2_format, colname3_format: The data types contained in the

respective columns. The data types can only be of types STRING, INT, DOUBLE, or FLOAT.

ATTRIBUTES: Additional parameters to consider when ingesting the data.

In the example, the row format is delimited using the ‘,’ character as the row format in the input

file is comma separated. If the input file is tab separated, then the the row format is delimited

using '\t'.

Big Data Protector Guide 6.6.5

Impala

Confidential 123

7.5 Protecting Existing Data

To protect existing data, users should define the mappings between the columns and their respective

data elements in the data security policy.

The following commands ingest cleartext data from the basic_sample table to the

basic_sample_protected table in protected form using Impala UDFs.

create table basic_sample_protected (colname1 colname1_format, colname2

colname2_format, colname3 colname3_format)

insert into basic_sample_protected(colname1, colname2, colname3) select

ID,pty_stringins(colname1, dataElement1),pty_stringins(colname2,

dataElement2),pty_stringins(colname3, dataElement3) from basic_sample;

Parameters

basic_sample_protected: Table to store protected data.

colname1, colname2, colname3: Name of the columns.

dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns.

basic_sample: Table containing the original data in cleartext form.

7.6 Unprotecting Protected Data

To unprotect protected data, you need to specify the name of the table which contains the protected

data, the table which would store the unprotected data, and the columns and their respective data

elements.

Ensure that the user performing the task has permissions to unprotect the data as required in the

data security policy.

The following commands unprotect the protected data in a table and stores the data in cleartext form

in to a different table, if the user has the required permissions.

create table table_unprotected (colname1 colname1_format, colname2

colname2_format, colname3 colname3_format)

insert into table_unprotected (colname1, colname2, colname3) select

ID,pty_stringsel(colname1, dataElement1),pty_stringsel(colname2,

dataElement2),pty_stringsel(colname3, dataElement3) from table_protected;

Parameters

table_unprotected: Table to store unprotected data.

colname1, colname2, colname3: Name of the columns.

dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns.

table_protected: Table containing protected data.

7.7 Retrieving Data from a Table

To retrieve data from a table, the user needs to have access to the table.

The following command displays the data contained in the table.

select * from table;

Parameters

table: Name of the table.

Big Data Protector Guide 6.6.5

Impala

Confidential 124

7.8 Sample Use Cases

For information about the Impala protector sample use cases, refer to section 11.9 Protecting Data

using Impala.

Big Data Protector Guide 6.6.5

HAWQ

Confidential 125

8 HAWQ

HAWQ is an MPP SQL processing engine for querying the data stored in a Hadoop cluster. It breaks

complex queries into smaller tasks and distributes their execution to the query processing units.

HAWQ is an MPP database, which uses HDFS to store data. It has the following components:

• HAWQ Master Server: Enables users to interact with HAWQ using client programs, such as

PSQL or APIs, such as JDBC or ODBC.

The HAWQ Master Server performs the following functions:

o Authenticates client connections

o Processes incoming SQL commands

o Distributes workload among HAWQ segments

o Coordinates the results returned by each segment

o Presents the final results to the client application

• Name Node: Enables client applications to locate a file.

• HAWQ Segments: Are the units which process the individual data modules simultaneously

• HAWQ Storage: Is HDFS, which stores all the table data

• Interconnect Switch: Is the networking layer of HAWQ, which handles the communication

between the segments

This section provides information about the HAWQ protector, the UDFs provided, and the commands

for protecting and unprotecting data in a HAWQ table.

8.1 Overview of the HAWQ Protector

The Protegrity HAWQ protector extends the functionality of the HAWQ processing engine and provides

UDFs which protect or unprotect the data as it is stored or retrieved.

8.2 HAWQ Protector Usage

The Protegrity HAWQ protector provides UDFs for protecting data using encryption or tokenization,

and unprotecting data by using decryption or detokenization. Ensure that the format of the data is

either Varchar, Integer, Date, or Real.

Ensure that the HAWQ is configured after the Big Data Protector is installed.

For more information about configuring HAWQ, refer to section 3.1.11 Configuring

HAWQ.

8.3 HAWQ UDFs

This section describes all HAWQ UDFs that are available for protection and unprotection in Big Data

Protector to build secure Big Data applications.

8.3.1 pty_GetVersion()

This UDF returns the PEP version number.

Pty_GetVersion()

Parameters

None

Returns

This UDF returns the current PEP server version

Big Data Protector Guide 6.6.5

HAWQ

Confidential 126

Example

select pty_GetVersion();

8.3.2 pty_WhoAmI()

This UDF returns the logged in user name.

Pty_WhoAmI()

Parameters

None

Returns

This UDF returns the current logged in user name

Example

select pty_WhoAmI();

8.3.3 pty_GetCurrentKeyId()

This UDF returns the current active key identification number of the encryption type data element.

pty_GetCurrentKeyId (dataElement varchar)

Parameters

dataElement: Variable specifies the protection method

Returns

This UDF returns the current key identification number of the encryption type data element,

which is passed as the parameter.

Example

select pty_GetCurrentKeyId('enc_de');

8.3.4 pty_GetKeyId()

This UDF returns the key ID for the encryption data element, used for protecting each row in a table.

pty_GetKeyId(dataElement string, col byte[])

Parameters

dataElement: Variable specifies the protection method

col: Byte array of the column in the table

Returns

This UDF returns the key ID for the encryption data element, used for protecting each row in the

table

Example

select pty_GetKeyId('enc_de',table_name.c) from table_name;

8.3.5 pty_VarcharEnc()

This UDF returns the encrypted value for a column containing varchar format data.

pty_VarcharEnc(col varchar, dataElement varchar)

Big Data Protector Guide 6.6.5

HAWQ

Confidential 127

Parameters

col: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the encrypted value as a byte array

Example

select pty_VarcharEnc(column_name,'enc_de') from table_name;

8.3.6 pty_VarcharDec()

This UDF returns the decrypted value for a column containing varchar format protected data.

pty_VarcharDec(col byte[], dataElement varchar)

Parameters

col: Column name of the data to decrypt in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the decrypted value

Example

select pty_VarcharDec(column_name,'enc_de') from table_name;

8.3.7 pty_VarcharHash()

This UDF returns the hashed value for a column containing varchar format data.

pty_VarcharHash(col varchar, dataElement varchar)

Parameters

col: Column name of the data to hash in the table

dataElement: Variable specifying the protection method

Returns

The protected value as byte array

Example

select pty_VarcharHash(column_name,'hash_de') from table_name;

8.3.8 pty_VarcharIns()

This UDF returns the tokenized value for a column containing varchar format data.

pty_VarcharIns(col varchar, dataElement varchar)

Parameters

col: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the tokenized value as byte array

Example

select pty_VarcharIns(column_name,'alpha_num_tk_de') from table_name;

Big Data Protector Guide 6.6.5

HAWQ

Confidential 128

8.3.9 pty_VarcharSel()

This UDF returns the detokenized value for a column containing varchar format tokenized data.

pty_VarcharSel(col varchar, dataElement varchar)

Parameters

col: Column name of the data to detokenize in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the detokenized value

Example

select pty_VarcharSel(column_name,'alpha_num_tk_de') from table_name;

8.3.10 pty_UnicodeVarcharIns()

This UDF protects varchar (Unicode) values.

pty_UnicodeVarcharIns(col varchar, dataElement varchar)

Parameters

col: Column name of the varchar (Unicode) data to protect

dataElement: Name of data element to protect varchar (Unicode) data.

This UDF should be used only if you need to tokenize Unicode data in HAWQ, and

migrate the tokenized data from HAWQ to a Teradata database and detokenize the

data using the Protegrity Database Protector.

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data to a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns protected varchar value.

Example

select pty_UnicodeVarcharIns(column_name, 'Token_unicode') from temp_table;

8.3.11 pty_UnicodeVarcharSel()

This UDF unprotects varchar values.

pty_unicodevarcharsel(col varchar, dataElement varchar)

Parameters

varchar input: Column name of the varchar data to unprotect

varchar dataElement: Name of data element to unprotect varchar data

This UDF should be used only if you need to tokenize Unicode data in Teradata

using the Protegrity Database Protector, and migrate the tokenized data from a

Teradata database to HAWQ and detokenize the data using the Protegrity Big Data

Protector for HAWQ.

Big Data Protector Guide 6.6.5

HAWQ

Confidential 129

Ensure that you use this UDF with a Unicode tokenization data element only.

For more information about migrating tokenized Unicode data to a Teradata

database, refer to section 15

Appendix: Migrating Tokenized Unicode Data from

and to a Teradata Database.

Result

This UDF returns unprotected varchar (Unicode) value.

Example

select pty_unicodevarcharsel(column_name, 'Token_unicode') from temp_table;

8.3.12 pty_IntegerEnc()

This UDF returns the encrypted value for a column containing integer format data.

pty_IntegerEnc(col integer, dataElement varchar)

Parameters

col: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the encrypted value as byte array

Example

select pty_IntegerEnc(column_name,'enc_de') from table_name;

8.3.13 pty_IntegerDec()

This UDF returns the decrypted value for a column containing encrypted data as byte array format.

pty_IntegerEnc(col byte[], dataElement varchar)

Parameters

col: Column name of the data to decrypt in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the decrypted value

Example

select pty_IntegerDec(column_name,'enc_de') from table_name;

8.3.14 pty_IntegerHash()

This UDF returns the hashed value for a column, containing integer format data, as a byte array.

pty_IntegerHash(col integer, dataElement varchar)

Parameters

col: Column name of the data to hash in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the hash value as byte array

Big Data Protector Guide 6.6.5

HAWQ

Confidential 130

Example

select pty_IntegerHash(column_name,'hash_de') from table_name;

8.3.15 pty_IntegerIns()

This UDF returns the tokenized value for a column containing integer format data.

pty_IntegerIns(col integer, dataElement varchar)

Parameters

col: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the tokenized value

Example

select pty_IntegerIns(column_name,'int_tk_de') from table_name;

8.3.16 pty_IntegerSel()

This UDF returns the detokenized value for a column containing integer format data.

pty_IntegerSel(col integer, dataElement varchar)

Parameters

col: Column name of the data to detokenize in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the detokenized value

Example

select pty_IntegerSel(column_name,'enc_de') from table_name;

8.3.17 pty_DateEnc()

This UDF returns the encrypted value for a column containing date format data.

pty_DateEnc(col date, dataElement varchar)

Parameters

col: Date column to encrypt in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the encrypted value as byte array

Example

select pty_DateEnc(column_name,'enc_de') from table_name;

8.3.18 pty_DateDec()

This UDF returns the decrypted value for a column containing encrypted data in byte array format.

pty_DateDec(col byte[], dataElement varchar)

Big Data Protector Guide 6.6.5

HAWQ

Confidential 131

Parameters

col: Date column to decrypt in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the decrypted value

Example

select pty_DateDec(column_name,'enc_de') from table_name;

8.3.19 pty_DateHash()

This UDF returns the hashed value for a column containing data in date format.

pty_DateHash(col date, dataElement varchar)

Parameters

col: Date column to hash in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the hashed value as byte array

Example

select pty_DateHash(column_name,'hash_de') from table_name;

8.3.20 pty_DateIns()

This UDF returns the tokenized value for a column containing data in date format.

pty_DateIns(col date, dataElement varchar)

Parameters

col: Date column to tokenize in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the tokenized value as date

If the date provided is out of the range than that described in Protection Methods

Reference 6.5.2, then an error message appears in the psql

shell and the

transaction is aborted. An Audit log entry is not generated for this issue.

Example

select pty_DateIns(column_name,'date_tk_de') from table_name;

8.3.21 pty_DateSel()

This UDF returns the detokenized value for a column containing data in date format.

pty_DateSel(col date, dataElement varchar)

Parameters

col: Date column to detokenize in the table

dataElement: Variable specifying the unprotection method

Big Data Protector Guide 6.6.5

HAWQ

Confidential 132

Returns

This UDF returns the detokenized value as date

Example

select pty_DateSel(column_name,'date_tk_de') from table_name;

8.3.22 pty_RealEnc()

This UDF returns the encrypted value for a column containing data in decimal format.

pty_RealEnc(col real, dataElement varchar)

Parameters

col: Column name of the data to encrypt in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the encrypted value in byte array format

Example

select pty_RealEnc(column_name,'enc_de') from table_name;

8.3.23 pty_RealDec()

This UDF returns the decrypted value for a column containing encrypted data in byte array format.

pty_RealDec(col real, dataElement varchar)

Parameters

col: Column name of the data to decrypt in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the decrypted value in real format

Example

select pty_RealDec(column_name,'enc_de') from table_name;

8.3.24 pty_RealHash()

This UDF returns the hashed value for a column containing data in real format.

pty_RealHash(col real, dataElement varchar)

Parameters

col: Column name of the data to hash in the table

dataElement: Variable specifying the protection method

Returns

This UDF returns the hashed value as byte array

Example

select pty_RealHash(column_name,'hash_de') from table_name;

8.3.25 pty_RealIns()

This UDF returns the tokenized value for a column containing data in real format.

Big Data Protector Guide 6.6.5

HAWQ

Confidential 133

If a decimal value is used, then it is tokenized by first loading the decimal type

column into a varchar type column and then using pty_VarcharIns() to tokenize

this column.

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

pty_RealIns(col real, dataElement varchar)

Parameters

col: Column name of the data to tokenize in the table

dataElement: Variable specifying the protection method

Result

This UDF returns the tokenized value in real format

Example

select pty_RealIns(column_name,'noenc_de') from table_name;

8.3.26 pty_RealSel()

This UDF returns the detokenized value for a column containing data in real format.

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

If used with any other type of data, then an error explaining that the datatype is

unsupported should be reported.

pty_RealSel(col real, dataElement varchar)

Parameters

col: Column name of the data to detokenize in the table

dataElement: Variable specifying the unprotection method

Returns

This UDF returns the detokenized value

Example

select pty_RealSel(column_name,'noenc_de') from table_name;

8.4 Inserting Data from a File into a Table

 To populate the table sample_table from the basic_sample_data.csv file:

The following command creates the table sample_table, with the required number of columns.

create table sample_table (colname1 colname1_format, colname2 colname2_format,

colname3 colname3_format) distributed randomly;

The following command grants permissions for the table sample_table to the required user, which

will be used to perform the protect or unprotect operations.

grant all on sample_table to <username>;

The following command enables you to populate the table sample_table with the data from the

basic_sample_data.csv file from the <PROTEGRITY_DIR>/samples/data directory.

\copy sample_table from '/opt/protegrity/samples/data/basic_sample_data.csv'

with delimiter ','

Parameters

Big Data Protector Guide 6.6.5

HAWQ

Confidential 134

sample_table: Name of the HAWQ table created to load the data from the input CSV file from

the required path.

colname1, colname2, colname3: Name of the columns.

colname1_format, colname2_format, colname3_format: The data types contained in the

respective columns. The data types can only be of types VARCHAR, INTEGER, DATE or REAL.

ATTRIBUTES: Additional parameters to consider when ingesting the data.

In the example, the row format is delimited using the ‘,’ character as the row format in the input

file is comma separated. If the input file is tab separated, then the the row format is delimited

using '\t'.

8.5 Protecting Existing Data

To protect existing data, users should define the mappings between the columns and their respective

data elements in the data security policy.

The following commands create the table basic_sample_protected to store the protected data.

drop table if exists basic_sample_protected;

create table basic_sample_protected (colname1 colname1_format, colname2

colname2_format, colname3 colname3_format) distributed randomly;

Ensure that the user performing the task has the permissions to protect the data,

as required, in the data security policy.

The following command ingests cleartext data from the basic_sample table to the

basic_sample_protected table in protected form using HAWQ UDFs.

insert into basic_sample_protected(colname1, colname2, colname3) select colname1,

pty_varcharins(colname2,dataElement2), pty_varcharins(colname3,dataElement3) from

basic_sample;

Parameters

basic_sample_protected: Table to store protected data.

colname1, colname2, colname3: Name of the columns.

dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns.

basic_sample: Table containing the original data in cleartext form.

8.6 Unprotecting Protected Data

To unprotect protected data, you need to specify the name of the table which contains the protected

data, the table which would store the unprotected data, and the columns and their respective data

elements.

Ensure that the user performing the task has permissions to unprotect the data as required in the

data security policy.

The following commands create the table basic_sample_unprotected to store the unprotected data.

drop table if exists table_unprotected;

create table table_unprotected (colname1 colname1_format, colname2

colname2_format, colname3 colname3_format) distributed randomly;

Big Data Protector Guide 6.6.5

HAWQ

Confidential 135

The following command retrieves the unprotected data and saves it in the basic_sample_unprotected

table.

insert into table_unprotected (colname1, colname2, colname3) select colname1,

pty_varcharsel(colname2,dataElement2), pty_varcharsel(colname3,dataElement3) from

table_protected;

Parameters

table_unprotected: Table to store unprotected data.

colname1, colname2, colname3: Name of the columns.

dataElement1, dataElement2, dataElement3: The data elements corresponding to the columns.

table_protected: Table containing protected data.

8.7 Retrieving Data from a Table

To retrieve data from a table, the user needs to have access to the table.

The following command displays the data contained in the table.

select * from table;

Parameters

table: Name of the table.

8.8 Sample Use Cases

For information about the HAWQ protector sample use cases, refer to section 11.10 Protecting Data

using HAWQ.

Big Data Protector Guide 6.6.5

Spark

Confidential 136

9 Spark

Spark is an execution engine that carries out batch processing of jobs in-memory and handles a

wider range of computational workloads. In addition to processing a batch of stored data, Spark is

capable of manipulating data in real time.

Spark leverages the physical memory of the Hadoop system and utilizes Resilient Distributed

Datasets (RDDs) to store the data in-memory and lowers latency, if the data fits in the memory size.

The data is saved on the hard drive only if required. As RDDs are the basic units of abstraction and

computation in Spark, you can use the protection and unprotection APIs, provided by the Spark

protector, when performing the transformation operations on an RDD.

If you need to use the Spark Protector API in a Spark Java job, then the users will have to implement

the function interface as per the Spark Java programming specifications and subsequently use it in

the required transformation of an RDD to tokenize the data.

This section provides information about the Spark protector, the APIs provided, and the commands

for protecting and unprotecting data in a file by using the respective Spark APIs for protection or

unprotection. In addition, it provides information about Spark SQL, which is a module that adds

relational data processing capabilities to the Spark APIs, and a sample program for Spark Scala.

9.1 Overview of the Spark Protector

The Protegrity Spark protector extends the functionality of the Spark engine and provides APIs that

protect or unprotect the data as it is stored or retrieved.

9.2 Spark Protector Usage

The Protegrity Spark protector provides APIs for protecting and reprotecting the data using

encryption or tokenization, and unprotecting data by using decryption or detokenization.

Ensure that Spark is configured after the Big Data Protector is installed.

For more information about configuring Spark, refer to section 3.1.12 Configuring

Spark.

9.3 Spark APIs

This section describes the Spark APIs (Java) available for protection and unprotection in the Big Data

Protector to build secure Big Data applications.

The Protegrity Spark protector only supports bytes converted from the string data type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

9.3.1 getVersion()

This function returns the current version of the Spark protector.

public string getVersion()

Parameters

None

Big Data Protector Guide 6.6.5

Spark

Confidential 137

Result

This function returns the current version of Spark protector.

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector(applicationId);

String version = protector.getVersion();

Exception

PtySparkProtectorException: If unable to return the current version of the Spark protector

9.3.2 getCurrentKeyId()

This method returns the current Key ID for the data element, which contains the KEY ID attribute,

while creating the data element, such as AES-256, AES-128, and so on.

public int getCurrentKeyId(String dataElement)

Parameters

dataElement: Name of the data element

Result

This method returns the current Key ID for the data element containing the KEY ID attribute.

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector(applicationId);

int keyId = protector.getCurrentKeyId("AES-256");

Exception

PtySparkProtectorException: If unable to return the current Key ID for the data element

9.3.3 checkAccess()

This method checks the access of the user for the specified data element.

public boolean checkAccess(String dataElement, Permission permission)

Parameters

dataElement: Name of the data element

Permission: Type of the access of the user for the data element

Result

true: If the user has access to the data element

false: If the user does not have access to the data element

Example

String applicationId = sparkContext.getConf().getAppId()

Protector protector = new PtySparkProtector(applicationId);

boolean accessType = protector.checkAccess(dataElement, Permission.PROTECT);

Exception

PtySparkProtectorException: If unable to verify the access of the user for the data element

Big Data Protector Guide 6.6.5

Spark

Confidential 138

9.3.4 getDefaultDataElement()

This method returns default data element configured in the security policy.

public String getDefaultDataElement(String policyName)

Parameters

policyName: Name of policy configured using Policy management in ESA

Result

Default data element name configured in the security policy.

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector(applicationId);

String dataElement = protector.getDefaultDataElement("sample_policy");

Exception

PtySparkProtectorException: If unable to return the default data element name

9.3.5 hmac()

This method performs hashing of the data using the HMAC operation on a single data item with a

data element, which is associated with HMAC. It returns the hmac value of the data with the data

element.

public byte[] hmac(String dataElement, byte[] input)

Parameters

dataElement: Name of the data element for HMAC

data: Byte array of data for HMAC

Result

Byte array of HMAC data

Example

String applicationId = sparkContext.getConf().getAppId()

Protector protector = new PtySparkProtector(applicationId);

byte[] output = protector.hmac("HMAC-SHA1", "test1".getBytes());

Exception

PtySparkProtectorException: If unable to protect data

9.3.6 protect()

Protects the data provided as a byte array. The type of protection applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, byte[][] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: Array of a byte array of data to be protected

output: Array of a byte array containing protected data

Big Data Protector Guide 6.6.5

Spark

Confidential 139

The Protegrity Spark protector only supports bytes converted from the string data type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

If you are using the Protect API which accepts byte as input and provides byte as output,

then ensure that when unprotecting the data, the Unprotect API, with byte as input and

byte as output is utilized. In addition, ensure that the byte data being provided as input

to the Protect API has been converted from a string data type only.

Result

The output variable in the method signature contains protected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement=”Binary”;

byte[][] input = new byte[][]{“test1”.getbytes(),”test2”.getbytes()};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.7 protect()

Protects the short format data provided as a short array. The type of protection applied is defined by

dataElement.

public void protect(String dataElement, List<Integer> errorIndex, short[] input,

short[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: Short array of data to be protected

output: Short array containing protected data

Result

The output variable in the method signature contains protected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement=”short”;

short[] input = new short[] {1234, 4545};

short[] output = new short[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

Big Data Protector Guide 6.6.5

Spark

Confidential 140

9.3.8 protect()

Encrypts the short format data provided as a short array. The type of encryption applied is defined

by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, short[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: Short array of data to be encrypted

output: Array of an encrypted byte array

Result

The output variable in the method signature contains protected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

short[] input = new short[]{1234, 4545};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.9 protect()

Protects the data provided as int array. The type of protection applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, int[] input,

int[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: Int array of data to be protected

output: Int array containing protected data

Result

The output variable in the method signature contains protected int data

Big Data Protector Guide 6.6.5

Spark

Confidential 141

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "int";

int[] input = new int[]{1234, 4545};

int[] output = new int[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.10 protect()

Encrypts the data provided as int array. The type of encryption applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, int[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: Int array of data to be encrypted

output: Array of an encrypted byte array

Result

The output variable in the method signature contains encrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

int[] input = new int[]{1234, 4545};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.11 protect()

Protects the data provided as long byte array. The type of protection applied is defined by

dataElement.

public void protect(String dataElement, List<Integer> errorIndex, long[] input,

long[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: Long array of data to be protected

output: Long array containing protected data

Big Data Protector Guide 6.6.5

Spark

Confidential 142

Result

The output variable in the method signature contains protected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "long";

long[] input = new long[] {1234, 4545};

long[] output = new long[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.12 protect()

Encrypts the data provided as long byte array. The type of encryption applied is defined by

dataElement.

public void protect(String dataElement, List<Integer> errorIndex, long[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: Long array of data to be encrypted

output: Array of a byte array containing encrypted data

Result

The output variable in the method signature contains encrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

long[] input = new long[] {1234, 4545};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.13 protect()

Protects the data provided as float array. The type of protection applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, float[] input,

float[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

Big Data Protector Guide 6.6.5

Spark

Confidential 143

input: Float array of data to be protected

output: Float array containing protected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

The output variable in the method signature contains protected float data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "float";

float[] input = new float[] {123.4f, 454.5f};

float[] output = new float[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.14 protect()

Encrypts the data provided as float array. The type of encryption applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, float[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: Float array of data to be encrypted

output: Array of a byte array containing encrypted data

Ensure that you use the data element with either the No Encryption method or

Encryption data element only. Using any other data element might cause corruption

of data.

Result

The output variable in the method signature contains encrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

float[] input = new float[] {123.4f, 454.5f};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.15 protect()

Protects the data provided as double array. The type of protection applied is defined by dataElement.

Big Data Protector Guide 6.6.5

Spark

Confidential 144

public void protect(String dataElement, List<Integer> errorIndex, double[] input,

double[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: Double array of data to be protected

output: Double array containing protected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

The output variable in the method signature contains protected double data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "double";

double[] input = new double[] {123.4, 454.5};

double[] output = new double[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.16 protect()

Encrypts the data provided as double array. The type of encryption applied is defined by

dataElement.

public void protect(String dataElement, List<Integer> errorIndex, double[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: Double array of data to be encrypted

output: Array of a byte array containing encrypted data

Ensure that you use the data element with either the No Encryption method or

Encryption data element only. Using any other data element might cause corruption

of data.

Result

The output variable in the method signature contains encrypted data

Big Data Protector Guide 6.6.5

Spark

Confidential 145

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

double[] input = new double[] {123.4, 454.5};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.17 protect()

Protects the data provided as string array. The type of protection applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, String[] input,

String[] output)

Parameters

dataElement: Name of the data element used for protection

errorIndex: List of the Error Index

input: String array of data to be protected

output: String array containing protected data

Result

The output variable in the method signature contains protected string data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AlphaNum";

String[] input = new String[] {"test1", "test2"};

String[] output = new String[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to protect data

9.3.18 protect()

Encrypts the data provided as string array. The type of encryption applied is defined by dataElement.

public void protect(String dataElement, List<Integer> errorIndex, String[] input,

byte[][] output)

Parameters

dataElement: Name of the data element used for encryption

errorIndex: List of the Error Index

input: String array of data to be encrypted

output: Array of a byte array containing encrypted data

Result

The output variable in the method signature contains encrypted data

Big Data Protector Guide 6.6.5

Spark

Confidential 146

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

String[] input = new String[] {"test1", "test2"};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.protect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to encrypt data

9.3.19 unprotect()

Unprotects the protected data provided as a byte array. The type of unprotection applied is defined

by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, byte[][] output)

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Array of a byte array of data to be unprotected

output: Array of a byte array containing unprotected data

The Protegrity Spark protector only supports bytes converted from the string data type.

If int, short, or long format data is directly converted to bytes and passed as input to the

API that supports byte as input and provides byte as output, then data corruption might

occur.

Result

The output variable in the method signature unprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "Binary";

byte[][] input = new byte[][] {“test1”.getbytes(), ”test2”.getbytes()};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.20 unprotect()

Unprotects the protected short format data provided as a short array. The type of unprotection

applied is defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, short[]

input, short[] output)

Big Data Protector Guide 6.6.5

Spark

Confidential 147

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Short array of data to be unprotected

output: Short array containing unprotected data

Result

The output variable in the method signature contains unprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "short";

short[] input = new short[]{1234, 4545};

short[] output = new short[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.21 unprotect()

Decrypts the encrypted short format data provided as a byte array. The type of decryption applied

is defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, short[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

input: Array of a byte array containing encrypted data

output: Short array containing decrypted data

Result

The output variable in the method signature contains decrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted short array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, short[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = { <encrypted short array> }

short[] output = new short[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to decrypt data

Big Data Protector Guide 6.6.5

Spark

Confidential 148

9.3.22 unprotect()

Unprotects the protected data provided as int array. The type of unprotection applied is defined by

dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, int[] input,

int[] output)

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Int array of data to be unprotected

output: Int array containing unprotected data

Result

The output variable in the method signature contains unprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "int";

int[] input = new int[]{1234, 4545};

int[] output = new int[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.23 unprotect()

Decrypts the encrypted int format data provided as byte array. The type of decryption applied is

defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, int[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

input: Array of a byte array containing encrypted data

output: Int array containing decrypted data

Result

The output variable in the method signature contains decrypted data

Big Data Protector Guide 6.6.5

Spark

Confidential 149

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted int array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, int[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = {<encrypted int array>};

int[] output = new int[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to decrypt data

9.3.24 unprotect()

Unprotects the protected data provided as long array. The type of unprotection applied is defined by

dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, long[] input,

long[] output)

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Long array of data to be unprotected

output: Long array containing unprotected data

Result

The output variable in the method signature contains unprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "long";

long[] input = new long[] {1234, 4545};

long[] output = new long[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.25 unprotect()

Decrypts the encrypted long format data provided as byte array. The type of decryption applied is

defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, long[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

Big Data Protector Guide 6.6.5

Spark

Confidential 150

input: Array of a byte array containing encrypted data

output: Long array containing decrypted data

Result

The output variable in the method signature decrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted long array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, long[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = { <encrypted long array> };

long[] output = new long[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.26 unprotect()

Unprotects the protected data provided as float array. The type of unprotection applied is defined by

dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, float[] input,

float[] output)

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Float array of data to be unprotected

output: Float array containing unprotected data

Result

The output variable in the method signature contains unprotected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "float";

float[] input = new float[] {123.4f, 454.5f};

float[] output = new float[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

Big Data Protector Guide 6.6.5

Spark

Confidential 151

9.3.27 unprotect()

Decrypts the encrypted float format data provided as byte array. The type of decryption applied is

defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, float[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

input: Array of a byte array containing encrypted data

output: Float array containing decrypted data

Ensure that you use the data element with either the No Encryption method or

Encryption data element only. Using any other data element might cause corruption

of data.

Result

The output variable in the method signature contains decrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted float array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, float[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = { <encrypted float array> };

float[] output = new float[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to decrypt data

9.3.28 unprotect()

Unprotects the protected data provided as double array. The type of unprotection applied is defined

by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, double[]

input, double[] output)

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: Double array of data to be unprotected

output: Double array containing unprotected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

The output variable in the method signature contains unprotected data

Big Data Protector Guide 6.6.5

Spark

Confidential 152

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "double";

double[] input = new double[] {123.4, 454.5};

double[] output = new double[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.29 unprotect()

Decrypts the encrypted double format data provided as byte array. The type of decryption applied is

defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, double[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

input: Array of a byte array containing encrypted data

output: Double array containing decrypted data

Ensure that you use the data element with either the No Encryption method or

Encryption data element only. Using any other data element might cause corruption

of data.

Result

The output variable in the method signature contains decrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted double array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, double[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = { <encrypted double array> };

double[] output = new double[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.30 unprotect()

Unprotects the protected data provided as string array. The type of unprotection applied is defined

by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, String[]

input, String[] output)

Big Data Protector Guide 6.6.5

Spark

Confidential 153

Parameters

dataElement: Name of the data element used for unprotection

errorIndex: List of the Error Index

input: String array of data to be unprotected

output: String array containing unprotected data

Result

The output variable in the method signature contains unprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AlphaNum";

String[] input = new String[] {"test1", "test2"};

String[] output = new String[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to unprotect data

9.3.31 unprotect()

Decrypts the encrypted string format data provided as byte array. The type of decryption applied is

defined by dataElement.

public void unprotect(String dataElement, List<Integer> errorIndex, byte[][]

input, String[] output)

Parameters

dataElement: Name of the data element used for decryption

errorIndex: List of the Error Index

input: Array of a byte array containing encrypted data

output: String array containing decrypted data

Result

The output variable in the method signature contains decrypted data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String dataElement = "AES-256";

// here input is encrypted String array created using our below API

// public void protect(String dataElement, List<Integer> errorIndex, String[]

input, byte[][] output) throws PtySparkProtectorException;

byte[][] input = { <encrypted string array> };

String[] output = new String[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.unprotect(dataElement, errorIndexList, input, output);

Exception

PtySparkProtectorException: If unable to decrypt data

Big Data Protector Guide 6.6.5

Spark

Confidential 154

9.3.32 reprotect()

Reprotects the byte array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, byte[][] input, byte[][] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: Array of a byte array of data to be reprotected

output: Array of a byte array containing reprotected data

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "Binary";

String newDataElement = "Binary_1";

byte[][] input = new byte[][] {"test1".getBytes(), "test2".getBytes()};

byte[][] output = new byte[input.length][];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.3.33 reprotect()

Reprotects the short array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, short[] input, short[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: Short array of data to be reprotected

output: Short array containing reprotected data

Result

The output variable in the method signature contains reprotected data

Big Data Protector Guide 6.6.5

Spark

Confidential 155

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "short";

String newDataElement = "short_1";

short[] input = new short[] {135, 136};

short[] output = new short[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.3.34 reprotect()

Reprotects the int array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, int[] input, int[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: Int array of data to be reprotected

output: Int array containing reprotected data

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "int";

String newDataElement = "int_1";

int[] input = new int[] {234,351};

int[] output = new int[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.3.35 reprotect()

Reprotects the long array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, long[] input, long[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

Big Data Protector Guide 6.6.5

Spark

Confidential 156

errorIndex: List of the Error Index

input: Long array of data to be reprotected

output: Long array containing reprotected data

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "long";

String newDataElement = "long_1";

long[] input = new long[] {1234, 135}};

long[] output = new long[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.3.36 reprotect()

Reprotects the float array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, float[] input, float[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: Float array of data to be reprotected

output: Float array containing reprotected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "NoEnc";

String newDataElement = "NoEnc_1";

float[] input = new float[] {23.56f, 26.43f}};

float[] output = new float[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

Big Data Protector Guide 6.6.5

Spark

Confidential 157

9.3.37 reprotect()

Reprotects the double array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, double[] input, double[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: Double array of data to be reprotected

output: Double array containing reprotected data

Ensure that you use the data element with the No Encryption method only. Using

any other data element might cause corruption of data.

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "NoEnc";

String newDataElement = "NoEnc_1";

double[] input = new double[] {235.5, 1235.66};

double[] output = new double[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.3.38 reprotect()

Reprotects the string array data, which was protected earlier, with a different data element.

public void reprotect(String oldDataElement, String newDataElement,

List<Integer> errorIndex, String[] input, String[] output)

Parameters

oldDataElement: Name of the data element with which data was protected earlier

newDataElement: Name of the new data element with which data is reprotected

errorIndex: List of the Error Index

input: String array of data to be reprotected

output: String array containing reprotected data

Result

The output variable in the method signature contains reprotected data

Example

String applicationId = sparkContext.getConf().getAppId();

Protector protector = new PtySparkProtector (applicationId);

String oldDataElement = "AlphaNum";

Big Data Protector Guide 6.6.5

Spark

Confidential 158

String newDataElement = "AlphaNum_1";

String[] input = new String[] {"test1", "test2"};

String[] output = new String[input.length];

List<Integer> errorIndexList = new ArrayList<Integer>();

protector.reprotect(oldDataElement, newDataElement, errorIndexList, input,

output);

Exception

PtySparkProtectorException: If errors occur while reprotecting data

9.4 Displaying the Cleartext Data from a File

 To display the cleartext data from the basic_sample_data.csv file:

The following command enables you to display the cleartext data from the basic_sample_data.csv

file from the /tmp/basic_sample/sample/ directory.

hadoop fs -cat /tmp/basic_sample/sample/basic_sample_data.csv

Parameters

basic_sample_data.csv: Name of the file containing cleartext data.

9.5 Protecting Existing Data

To protect cleartext data, you need to specify the name of the file which contains the cleartext data

and the name of the file which would store the protected data.

The following command reads the cleartext data from the basic_sample_data.csv file and stores it in

the basic_sample_protected.csv file in protected form using Spark UDFs.

Ensure that the user performing the task has the permissions to protect the data,

as required, in the data security policy.

./spark-submit --master yarn --class com.protegrity.spark.ProtectData

<PROTEGRITY_DIR>/samples/spark/lib/spark_protector_demo.jar

<Path_of_Cleartext_data_file>/basic_sample_data.csv

<Path_of_Protected_data_file>/basic_sample_protected.csv

Parameters

com.protegrity.spark.ProtectData: The Spark protector class for protecting the data.

spark_protector_demo.jar: The Jar file utilizing the Spark protector API for protecting data in the

.csv file.

<Path_of_Cleartext_data_file>: The HDFS directory path for the file with cleartext data.

<Path_of_Protected_data_file>: The HDFS directory path for the file with protected data.

basic_sample_data: File to read cleartext data.

basic_sample_protected_data: File to store protected data.

9.6 Unprotecting Protected Data

To unprotect the protected data, you need to specify the name of the file which contains the protected

data and the name of the file which would store the unprotected data.

The following command retrieves the protected data from the basic_sample_protected.csv file and

saves it in basic_sample_unprotected.csv file in unprotected form.

Ensure that the user performing the task has the permissions to unprotect the data,

as required, in the data security policy.

Big Data Protector Guide 6.6.5

Spark

Confidential 159

./spark-submit --master yarn --class com.protegrity.spark.UnProtectData

<PROTEGRITY_DIR>/samples/spark/lib/spark_protector_demo.jar

<Path_of_Protected_data_file>/basic_sample_protected_data.csv

<Path_of_Unprotected_data_file>/basic_sample_unprotected_data.csv

Parameters

com.protegrity.spark.UnProtectData: The Spark protector class for unprotecting the protected

data.

spark_protector_demo.jar: The Jar file utilizing the Spark protector API for unprotecting the

protected data in the .csv file.

<Path_of_Protected_data_file>: The HDFS directory path for the file with protected data.

<Path_of_Unprotected_data_file>: The HDFS directory path for the file to store the unprotected

data.

basic_sample_protected_data: File to read the protected data.

basic_sample_unprotected_data: File to save the unprotected data.

9.7 Retrieving the Unprotected Data from a File

To retrieve data from a file containing protected data, the user needs to have access to the file.

The following command displays the unprotected data contained in the file.

hadoop fs -cat /tmp/basic_sample/sample/basic_sample_unprotected_data.csv/part*

Parameters

basic_sample_unprotected_data.csv: Name of the file containing unprotected data.

9.8 Spark APIs and Supported Protection Methods

The following table lists the Spark APIs, the input and output data types, and the supported Protection

Methods.

Table 9-1: Spark APIs and Supported Protection Methods

Operation Input Output Protection Method Supported

Protect Byte Byte Tokenization, Encryption, No Encyption, DTP2, CUSP

Protect Short Short

Tokenization, No Encyption

Protect Short Byte

Encryption, CUSP

Protect Int Int

Tokenization, No Encyption

Protect Int Byte

Encryption, CUSP

Protect Long Long

Tokenization, No Encyption

Protect Long Byte

Encryption, CUSP

Protect Float Float

Tokenization, No Encyption

Protect Float Byte

Encryption, CUSP

Protect Double Double

Tokenization, No Encyption

Protect Double Byte

Encryption, CUSP

Protect String String

Tokenization, No Encyption, DTP2

Big Data Protector Guide 6.6.5

Spark

Confidential 160

Operation Input Output Protection Method Supported

Protect String Byte

Encryption, CUSP

Unprotect Byte Byte

Tokenization, Encryption, No Encyption, DTP2, CUSP

Unprotect Short Short

Tokenization, NoEncyption

Unprotect Byte Short

Encryption, CUSP

Unprotect Int Int

Tokenization, No Encyption

Unprotect Byte Int

Encryption, CUSP

Unprotect Long Long

Tokenization, No Encyption

Unprotect Byte Long

Encryption, CUSP

Unprotect Float Float

Tokenization, No Encyption

Unprotect Byte Float

Encryption, CUSP

Unprotect Double Double

Tokenization, No Encyption

Unprotect Byte Double

Encryption, CUSP

Unprotect String String

Tokenization, No Encyption, DTP2

Unprotect Byte String

Encryption, CUSP

Reprotect Byte Byte

Tokenization, Encryption, DTP2, CUSP

NOTE: If a protected value is generated using Byte as both

Input and Output, then only Encryption/CUSP is supported.

Reprotect Short Short

Tokenization

Reprotect Int Int

Tokenization

Reprotect Long Long

Tokenization

Reprotect Float Float

Tokenization

Reprotect Double Double

Tokenization

Reprotect String String

Tokenization, DTP2

9.9 Sample Use Cases

For information about the Spark protector sample use cases, refer to section 12.11 Protecting Data

using Spark.

9.10 Spark SQL

The Spark SQL module provides relational data processing capabilities to Spark. The module allows

you to run SQL queries with Spark programs. It contains DataFrames, which is an RDD with an

associated schema, that provide support for processing structured data in Hive tables.

Spark SQL enables structured data processing and programming of RDDs providing relational and

procedural processing through a DataFrame API that integrates with Spark.

Starting from the 6.6.3 release, you can invoke Protegrity Hive UDFs using Spark SQL and protect

or unprotect data.

Big Data Protector Guide 6.6.5

Spark

Confidential 161

Starting from the Big Data Protector 6.6.3 release, the Spark SQL CLI is only supported.

9.10.1 DataFrames

A DataFrame is a distributed collection of data, such as RDDs, with a corresponding schema.

DataFrames can be created from a wide array of sources, such as Hive tables, external databases,

structured data files, or existing RDDs.

It can act as a distributed SQL query engine and is equivalent to a table in a relational database that

can be manipulated, similar to RDDs. To optimize execution, DataFrames support relational

operations and track their schema.

9.10.2 SQLContext

A SQLContext is a class that is used to initialize Spark SQL. It enables applications to run SQL queries,

while running SQL functions, and provides the result as a DataFrame.

HiveContext extends the functionality of SQLContext and provides capabilities to use Hive UDFs,

create Hive queries, and access and modify the data in Hive tables.

The Spark SQL CLI is used to run the Hive metastore service in local mode and execute queries.

When we run Spark SQL (spark-sql), which is client for running queries in Spark, it creates a

SparkContext defined as sc and HiveContext defined as sqlContext.

9.10.3 Accessing the Hive Protector UDFs

You can access Hive UDFs in Spark SQL by configuring the following Hive UDF JAR path in the Spark

classpath.

For more information about configuring Spark SQL, refer to section 3.1.12 Configuring Spark.

If you are using the Hive UDFs ptyWhoAmI() and ptyProtectStr() with Spark SQL, then

the user retrieval process differs, as described by the following list:

• ptyWhoAmI(): The user is detected from the SessionState object, similar to Hive

• ptyProtectStr(): The user is detected from the HiveConf object as the SessionState

object does not return the user.

If you are using the UDF ptyProtectStr() with Hive, then the user is detected from

the SessionState object.

9.10.3.1 Accessing Hive UDFs when Hive Services are Running

If the hive-site.xml file is placed in the Spark protector configuration directory, which is typically

/etc/spark/conf, then it can access the Hive UDFs, Hive tables, and read and write data that is stored

data in Hive.

For more information about the Hive UDFs, refer to section 4.5 Hive UDFs.

 To access Hive UDFs when Hive Services are running:

1. Ensure that Spark SQL is configured, as required.

For more information about configuring Spark SQL, refer to section 3.1.12 Configuring Spark.

2. Start the Spark SQL (spark-sql) client.

3. Create a table test_table with the following command.

CREATE TABLE IF NOT EXISTS test_table (val STRING);

Big Data Protector Guide 6.6.5

Spark

Confidential 162

4. Load data from the sample.txt file in the table test_table with the following command.

LOAD DATA LOCAL INPATH 'sample.txt' INTO TABLE test_table;

5. Protect the data in the table test_table using the Hive UDFs with the followind commands.

create temporary function ptyProtectStr

AS 'com.protegrity.hive.udf.ptyProtectStr';

SELECT ptyProtectStr(val,'AN_TOKEN') from test_table

The required data is protected by the data element.

9.10.3.2 Accessing Hive UDFs when Hive Services are Stopped

Spark SQL can access the Hive UDFs without Hive support when the Hive services are stopped.

When we run Spark SQL (spark-sql), it creates its own metastore (metastore_db file) and warehouse

in the installation directory of Spark. Internally, Spark SQL compiles with Hive and uses the classes

for internal execution of the Hive UDFs.

For more information about the Hive UDFs, refer to section 4.5 Hive UDFs.

 To access Hive UDFs when Hive Services are stopped:

1. Ensure that the Hadoop cluster is installed, configured, and running.

2. Start the Spark SQL (spark-sql) client.

3. Create a table test_table with the following command.

CREATE TABLE IF NOT EXISTS test_table (val STRING);

4. Load data from the sample.txt file in the table test_table with the following command.

LOAD DATA LOCAL INPATH 'sample.txt' INTO TABLE test_table;

5. Protect the data in the table test_table using the Hive UDFs with the followind commands.

create temporary function ptyProtectStr

AS 'com.protegrity.hive.udf.ptyProtectStr';

SELECT ptyProtectStr(val,'AN_TOKEN') from test_table

The required data is protected by the data element.

9.10.4 Sample Use Cases

If you need to use Hive UDFs with Spark SQL, then initiate the Spark SQL (spark-sql) client. You can

then specify commands which are similar to using Hive UDFs.

For information about the sample use cases for Hive UDFs using Spark SQL, refer to section 12.6

Protecting Data using Hive.

9.11 Spark Scala

The Protegrity Spark protector (Java) can be used with Scala to protect and reprotect the data by

using encryption or tokenization, and unprotect the data by using decryption or detokenization.

In this Big Data Protector release, a sample code snippet for Spark Scala is provided.

9.11.1 Sample Use Cases

For information about the sample code snippets for using Spark with Scala, refer to section 12.11.4

Sample Code Usage for Spark (Scala).

Big Data Protector Guide 6.6.5

Data Node and Name Node Security with File Protector

Confidential 163

10 Data Node and Name Node Security with File

Protector

The HDFS file system stores data in a distributed manner, which is accessible by any node that is a

part of the Hadoop system.

The Hadoop utility uses commands to display and manipulate data in the HDFS file system. Hadoop

stores information, such as blocks, metadata, jobs and so on in the OS file system. The Hadoop local

directories are accessible to the OS user.

Protegrity File Protector over HDFS provides the way to protect the Hadoop local folder, which is

stored locally in the operating system. The user will not be able to manipulate data or corrupt the

folders. Any operation like read, write, delete and so on in the HDFS file system can be performed

by users having access to the directories, only through binaries, such as hadoop, mapred, hive, pig

and so on.

For managing access, Protegrity provides Policy management in ESA, which creates and deploys

policies on Hadoop nodes.

For more information about Protegrity File Protector, contact Protegrity Professional Services.

10.1 Features of the Protegrity File Protector

You can use Protegrity File Protector to protect files and folders on the local file system.

The Protegrity File Protector provides a highly transparent and easy to administer solution for

securing sensitive files and ensures the safety of your data, as explained in the following sections.

10.1.1 Protegrity File Encryption

Protegrity File Encryption protects file content in a transparent manner by encryption and decryption.

This is accomplished by employing policies defined using Policy management in ESA and volume

encryption modules.

10.1.2 Protegrity Volume Encryption

Protegrity Volume Encryption protects the contents of files and folders stored in a protected or

encrypted volume employing policies defined using Policy management in ESA. The Administrator

must mount the volume for use and then unmount and close it.

10.1.3 Protegrity Access Control

Protegrity Access Control protects directories, their child files and subdirectories in real-time from

unauthorized file deletion, modification or read. This is accomplished by employing policies defined

using Policy management in ESA and binding individual files and folders with their respective

policies.

Big Data Protector Guide 6.6.5

Appendix: Return Codes

Confidential 164

11 Appendix: Return Codes

If you are using the HDFSFP protector and any failures occur, then the protector thows an exception.

The exception consists of an error code and error message. The following table lists all possible error

codes and error descriptions.

Table 11-1 Error Codes for HDFSFP Protector

Code Error Error Description

XC_FAILED

The requested operation or service failed.

XC_LOG_CHECK_ACCESS

The access was denied as the user does not have the

required privileges to perform the requested operation.

XC_LOG_TIME_ACCESS

The access was denied as the user does not have the

required privileges to perform the requested operation at this

point in time.

XC_LOG_ENCRYPT_SUCCESS

The data was successfully encrypted.

XC_LOG_ENCRYPT_FAILED

The encryption of data failed.

XC_LOG_DECRYPT_SUCCESS

The data was successfully decrypted.

XC_LOG_DECRYPT_FAILED

The decryption of data failed.

100

XC_INVALID_PARAMETER

The parameter specified in the function call is invalid.

101

XC_TIMEOUT

The operation timed out before a result was returned.

102

XC_ACCESS_DENIED

Permission to access an object or file on the filesystem is

denied.

103

XC_NOT_SUPPORTED

The requested operation is not supported.

104

XC_SESSION_REFUSED

The remote peer client did not accept the session request.

105

XC_DISCONNECTED

The session was terminated.

106

XC_UNREACHABLE

The host could not be reached.

107

XC_SESSION_IN_USE

The session is already in use.

108

XC_EOF

The end of file is reached.

109

XC_NOT_FOUND

Not found.

110

XC_BUFFER_TOO_SMALL

Supplied input or output buffer is too small.

If you are using MapReduce, Hive, Pig, HBase, or Spark, and any failures occur, then the protector

throws an exception. The exception consists of an error code and error message. The following table

lists all possible error codes and error descriptions.

The following table lists all possible return codes provided to the PEP log files.

Table 11-2 PEP Log Return Codes

Code Error Error Description

NONE

USER_NOT_FOUND

The user name could not be found in the policy residing in

the shared memory.

DATA_ELEMENT_NOT_FOUND

The data element could not be found in the policy residing in

the shared memory.

PERMISSION_DENIED

The user does not have the required permissions to perform

the requested operation.

Big Data Protector Guide 6.6.5

Appendix: Return Codes

Confidential 165

Code Error Error Description

TIME_PERMISSION_DENIED

The user does not have the appropriate permissions to

perform the requested operation at this point in time.

INTEGRITY_CHECK_FAILED

Integrity check failed.

PROTECT_SUCCESS

The operation to protect the data was successful.

PROTECT_FAILED

The operation to protect the data failed.

UNPROTECT_SUCCESS

The operation to unprotect the data was successful.

UNPROTECT_FAILED

The operation to unprotect the data failed.

OK_ACCESS

The user has the required permissions to perform the

requested operation.

This return code ensures a verification and no data is

protected or unprotected.

INACTIVE_KEYID_USED

The operation to unprotect the data was successful using an

inactive Key ID.

INVALID_PARAM

The input is null or not within allowed limits.

INTERNAL_ERROR

An internal error occurred in a function call after the PEP

provider is started.

LOAD_KEY_FAILED

Failed to load the data encryption key.

INIT_FAILED

The PEP server failed to initialize, which is a fatal error.

OUT_OF_MEMORY

Failed to allocate memory.

BUFFER_TOO_SMALL

The input or output buffer is very small.

INPUT_TOO_SHORT

The data is too short to be protected or unprotected.

INPUT_TOO_LONG

The data is too long to be protected or unprotected.

USERNAME_TOO_LONG

The user name is longer than the maximum supported length

of the user name that can be used for protect or unprotect

operations.

UNSUPPORTED

The algorithm or action for the specific data element is

unsupported.

APPLICATION_AUTHORIZED

The application is authorized.

APPLICATION_NOT_AUTHORIZED

The application is not authorized.

JSON_NOTSERIALIZABLE

The JSON type is not serializable.

JSON_MALLOC_FAILED

The memory allocation for the JSON type failed.

EMPTY_POLICY

The policy residing in the shared memory is empty.

DELETE_SUCCESS

The operation to delete the data was successful.

DELETE_FAILED

The operation to delete the data failed.

CREATE_SUCCESS

The operation to create or add the data was successful.

CREATE_FAILED

The operation to create or add the data failed.

MNGPROT_SUCCESS

The management of the protection operation was successful.

MNGPROT_FAILED

The management of the protection operation failed.

POLICY_LOCKED

The policy residing in the shared memory is locked. This

error can be caused by a Disk Full alert.

Big Data Protector Guide 6.6.5

Appendix: Return Codes

Confidential 166

Code Error Error Description

LICENSE_EXPIRED

The license is not valid or the current date is beyond the

license expiration date.

METHOD_RESTRICTED

The use of the Protection method is restricted by license.

LICENSE_INVALID

The license is invalid or the time is prior to the start of the

license tenure.

INVALID_FORMAT

The content of the input data is invalid.

PROCESSING_SUCCESS

It is used for audit entries used for collecting Access Counter

records.

INVALID_POLICY

It is used for a z/OS Query regarding the default data

element when the policy name is not found.

The following table lists all possible result codes provided as a result of operations performed on the

PEP.

Table 11-3 PEP Result Codes

Code Error Error Description

SUCCESS

The operation was successful.

FAILED

The operation failed.

-1

INVALID_PARAMETER

The parameter is invalid.

-2

EOF

The end of file was reached.

-3

BUSY

The operation is already in progress or the PEP server is busy

with some other operation.

-4

TIMEOUT

The time-out threshold was reached as the PEP server was

waiting for a response.

-5

ALREADY_EXISTS

The object, such as file, already exists.

-6

ACCESS_DENIED

The permission to access the object was denied.

-7

PARSE_ERROR

The error occurred when the contents were parsed.

-8

NOT_FOUND

The search operation was not successful.

-9

NOT_SUPPORTED

The operation is not supported.

-10

CONNECTION_REFUSED

The connection was refused.

-11

DISCONNECTED

The connection was terminated.

-12

UNREACHABLE

The Internet link is down or the host is not reachable.

-13

ADDRESS_IN_USE

The IP Adddress or port is already utilized.

-14

OUT_OF_MEMORY

The operation to allocate memory failed.

-15

CRC_ERROR

The CRC check failed.

-16

BUFFER_TOO_SMALL

The buffer size is very small.

-17

BAD_REQUEST

The message received was not in a standard format.

-18

INVALID_STRING_LENGTH

The input string is very long.

-19

INVALID_TYPE

The incorrect type of <NEED INPUTS> was used.

-20

READONLY_OBJECT

The object is set with read-only access.

-21

SERVICE_FAILED

The service failed.

-22

ALREADY_CONNECTED

The Administrator is already connected to the server.

Big Data Protector Guide 6.6.5

Appendix: Return Codes

Confidential 167

Code Error Error Description

-23

INVALID_KEY

The key is invalid.

-24

INTEGRITY_ERROR

The integrity check failed.

-25

LOGIN_FAILED

The attempt to login failed.

-26

NOT_AVAILABLE

The object is not available.

-27

NOT_EXIST

The object does not exist.

-28

SET_FAILED

The Set operation failed.

-29

GET_FAILED

The Get operation failed.

-30

READ_FAILED

The Read operation failed.

-31

WRITE_FAILED

The Write operation failed.

-33

REWRITE_FAILED

The Rewrite operation failed.

-34

DELETE_FAILED

The Delete operation failed.

-35

UPDATE_FAILED

The Update operation failed.

-36

SIGN_FAILED

The Sign operation failed.

-37

VERIFY_FAILED

The Verification failed.

-38

ENCRYPT_FAILED

The Encrypt operation failed.

-39

DECRYPT_FAILED

The Decrypt operation failed.

-40

REENCRYPT_FAILED

The Reencrypt operation failed.

-41

EXPIRED

The object has expired.

-42

REVOKED

The object has been revoked.

-43

INVALID_FORMAT

The format is invalid.

-44

HASH_FAILED

The Hash operation failed.

-45

NOT_DEFINED

The property or setting is not defined.

-46

NOT_INITIALIZED

The service requested or function is performed on an object

that is not initialized.

-47

POLICY_LOCKED

The Policy is locked.

-48

THROW_EXCEPTION

The error message is used to convey that an exception

should be thrown during decryption.

-49

USER_AUTHENTICATION_FAILED

The Authentication operation failed.

-54

INVALID_CARD_TYPE

The credit card number provided does not confirm to the

required credit card format.

-55

LICENSE_AUDITONLY

The License provided is for the audit functionality and only

No Encryption data elements are allowed.

-56

NO_VALID_CIPHERS

No valid ciphers were found.

-57

NO_VALID_PROTOCOLS

No valid protocols were found.

-201

CRYPT_KEY_DATA_ILLEGAL

The key data specified is invalid.

-202

CRYPT_INTEGRITY_ERROR

The integrity check for the data failed.

-203

CRYPT_DATA_LEN_ILLEGAL

The data length specified is invalid.

-204

CRYPT_LOGIN_FAILURE

The Crypto login failed.

-205

CRYPT_CONTEXT_IN_USE

An attempt to close a key being used is made.

Big Data Protector Guide 6.6.5

Appendix: Return Codes

Confidential 168

Code Error Error Description

-206

CRYPT_NO_TOKEN

The hardware token is available.

-207

CRYPT_OBJECT_EXISTS

The object to be created already exists.

-208

CRYPT_OBJECT_MISSING

A request for a non-existing object is made.

-221

X509_SET_DATA

The operation to set data in the object failed.

-222

X509_GET_DATA

The operation to get data from the object failed.

-223

X509_SIGN_OBJECT

The operation to sign the object failed.

-224

X509_VERIFY_OBJECT

The verification operation for the object failed.

-231

SSL_CERT_EXPIRED

The certificate has expired.

-232

SSL_CERT_REVOKED

The certificate has been revoked.

-233

SSL_CERT_UNKNOWN

The Trusted certificate was not found.

-234

SSL_CERT_VERIFY_FAILED

The certificate cound not be verified.

-235

SSL_FAILED

A general SSL error occurs.

-241

KEY_ID_FORMAT_ERROR

The format on the Key ID is invalid.

-242

KEY_CLASS_FORMAT_ERROR

The format on the KeyClass is invalid.

-243

KEY_EXPIRED

The key expired.

-250

FIPS_MODE_FAILED

The FIPS mode failed.

Big Data Protector Guide 6.6.5

Appendix: Samples

Confidential 169

12 Appendix: Samples

Many organizations are adopting Hadoop due to its ability to process and analyze large volumes of

unstructured data in a distributed manner. However, this might make sensitive data vulnerable with

exposure to unauthorized users.

The Big Data Protector provides Hadoop users with top-to-bottom data protection from the

application level to the file level. Sensitive data can be protected from internal and external threats

and unauthorized users. The protected data can be utilized by users, business processes, and

applications. In addition, the data is viewed in cleartext form by authorized users and in protected

form by unauthorized users.

In the samples provided, data protection is done by tokenization, where sensitive data is converted

to similar looking inert data known as tokens and the data format and type can be preserved. These

tokens can be detokenized back to the original values when required.

The sample outputs provided in the documentation are for reference only. The values

on your systems might vary than the ones listed in the document.

For ease of illustration, the following items have been added to documentation:

• Protected data in the samples is identified in dark grey color.

•

Header rows in the output are retained to list the type of information contained in

them. The actual output would not contain these header rows.

• Spaces are inserted in the output to demarcate data in the columns.

This section provides documentation for sample data protection for MapReduce, Hive, Pig, HBase,

Impala, HAWQ, and Spark using Big Data Protector.

Ensure that you login as the user root before performing the following tasks.

 To copy the sample data:

This command copies the sample data from the installation directory to the

/tmp/basic_sample/sample directory in HDFS.

#> hadoop fs -rm -r /tmp/basic_sample/sample/

#> hadoop fs -rm -r /tmp/basic_sample

#> hadoop fs -mkdir /tmp/basic_sample/

#> hadoop fs -mkdir /tmp/basic_sample/sample/

#> hadoop fs -copyFromLocal /opt/protegrity/samples/data/basic_sample_data.csv

/tmp/basic_sample/sample

 To assign user permissions for the sample data in the Hadoop directories:

This command assigns permissions for the sample data in the Hadoop directories to all users.

#> sudo -u hdfs hadoop fs -chmod -R 777 /tmp/basic_sample

#> sudo -u hdfs hadoop fs -chmod -R 777 /apps/hive/warehouse

 To create the user John:

This command creates the user John.

#> useradd John

Ensure that you set a password for the user John using the command passwd John.

 To create the user Fred:

This command creates the user Fred.

Big Data Protector Guide 6.6.5

Appendix: Samples

Confidential 170

#> useradd Fred

Ensure that you set a password for the user Fred using the command passwd Fred.

 To create directories for the users John and Fred:

This command creates the required directories for the users John and Fred in HDFS.

# sudo -u hdfs hadoop fs -mkdir /user/John

# sudo -u hdfs hadoop fs -chown -R John /user/John

# sudo -u hdfs hadoop fs -mkdir /user/Fred

# sudo -u hdfs hadoop fs -chown -R Fred /user/Fred

12.1 Roles in the Samples

The roles available in the samples are described in the following table.

Table 11-1: List of Roles

Roles User Role Description

SAMPLE_ADMIN root

This user is able to protect and unprotect the data, and access the

data in the cleartext form.

SAMPLE_INGESTION_USER John This user is allowed to ingest the data and protect the sensitive

data.

SAMPLE_ANALYST Fred This user is able to unprotect the Name and Amount fields and is

able to access the other fields in protected form.

12.2 Data Elements in the Security Policy

The data elements used in the samples are described in the following table.

Table 11-2: List of Data Elements

Data Element Data

Securing

Method

Data

Element

Type

Input Accepted Description

TOK_NAME Tokenization Alpha-

Numeric Alphabetic symbols

including lowercase

(a-z) and

uppercase letters

(A-Z) and digits

from 0 through 9.

Min length: 1

Max length: 4080

Data element for the

Customer Name.

TOK_CREDIT_CARD Tokenization Creditcard Digits 0 through 9

with no separators.

Min length: 6

Max length: 256

Data element for the

Credit Card number.

TOK_AMOUNT Tokenization Decimal Digits 0 through 9.

The sign (+ or -)

and decimal point

(. or ,) can be used

as separators.

Min length: 1

Max length: 36

Data element for the

Amount spent by the

customer using the credit

card.

Big Data Protector Guide 6.6.5

Appendix: Samples

Confidential 171

Data Element Data

Securing

Method

Data

Element

Type

Input Accepted Description

TOK_PHONE Tokenization Numeric Digits 0 through 9.

Min length: 1

Max length: 3933

Data element for the

Phone number.

12.3 Role-based Permissions for Data Elements in the

Sample

The role-based permissions for the data elements used in the samples are described in the following

table.

Table 11-3: Role-based Permissions for Data Elements

Roles TOK_AMOUNT TOK_NAME TOK_CREDIT_CARD TOK_PHONE

SAMPLE_ADMIN Protect

Unprotect Protect

Unprotect

SAMPLE_ANALYST_1 Protect Protect Protect Protect

SAMPLE_ANALYST_2 Unprotect Unprotect - -

12.4 Data Used by the Samples

Table 11-4: Fields and Values in the Sample

ID Name Phone Credit Card Amount

928724

Hultgren Caylor

9823750987

376235139103947

6959123

928725

Bourne Jose

9823350487

6226600538383292

42964354

928726

Sorce Hatti

9824757883

6226540862865375

7257656

928727

Lorie Garvey

9913730982

5464987835837424

85447788

928728

Belva Beeson

9948752198

5539455602750205

59040774

928729

Hultgren Caylor

9823750987

376235139103947

3245234

928730

Bourne Jose

9823350487

6226600538383292

2300567

928731

Lorie Garvey

9913730982

5464987835837424

85447788

928732

Bourne Jose

9823350487

6226600538383292

3096233

928733

Hultgren Caylor

9823750987

376235139103947

5167763

928734

Lorie Garvey

9913730982

5464987835837424

85447788

12.5 Protecting Data using MapReduce

A MapReduce job in a Hadoop cluster involves sensitive data. You can use the Protegrity MapReduce

APIs to protect data when it is saved or retrieved from a protected source.

The Protegrity MapReduce APIs can protect and unprotect the data as defined by the Data security

policy.

For more information on the list of available Protegrity MapReduce APIs, refer to section 4.4

MapReduce APIs.

Big Data Protector Guide 6.6.5

Appendix: Samples

Confidential 172

The following sections describe two sample use cases.

1. A basic use case to demonstrate the functionality of how basic protection and unprotection

works using Protegrity MapReduce APIs.

2. A role-based use case to demonstrate the different data access permissions when two users

belonging to different roles are viewing the same data.

For ease of illustration, the use cases describe the following two users:

•

User with ability to protect the data, thereby accessing the data in protected

form.

• User with only access to a few fields from the protected data in cleartext form.

12.5.1 Basic Use Case

This section describes the commands to perform the following functions:

• Display the original data as is.

• Protect the original data.

• Display the data protected using the Protegrity MapReduce API.

• Unprotect the protected data.

• Display the unprotected data.

Ensure that you login as the user root before performing the following tasks.

 To view the original data:

This command displays the sample data as is.

hadoop fs -cat /tmp/basic_sample/sample/basic_sample_data.csv

Result: (Original data)

ID , NAME , PHONE , CREDIT_CARD , AMOUNT

928724, Hultgren Caylor, 9823750987 , 376235139103947 , 6959123

928725, Bourne Jose , 9823350487 , 6226600538383292 , 42964354

928726, Sorce Hatti , 9824757883 , 6226540862865375 , 7257656

928727, Lorie Garvey , 9913730982 , 5464987835837424 , 85447788

928728, Belva Beeson , 9948752198 , 5539455602750205 , 59040774

928729, Hultgren Caylor, 9823750987 , 376235139103947 , 3245234

928730, Bourne Jose , 9823350487 , 6226600538383292 , 2300567

928731, Lorie Garvey , 9913730982 , 5464987835837424 , 85447788

928732, Bourne Jose , 9823350487 , 6226600538383292 , 3096233

928733, Hultgren Caylor, 9823750987 , 376235139103947 , 5167763

928734, Lorie Garvey , 9913730982 , 5464987835837424 , 85447788

 To protect the data:

This command protects the sample data. The data in the Name, Phone, Credit card, and Amount

fields is protected.

hadoop jar /opt/protegrity/samples/mapreduce/lib/basic*.jar

com.protegrity.samples.mapreduce.ProtectData

/tmp/basic_sample/sample/basic_sample_data.csv

/tmp/basic_sample/protected_mapred_data

 To view the protected data:

This command displays the protected data.

hadoop fs -cat /tmp/basic_sample/protected_mapred_data/part*

Big Data Protector Guide 6.6.5

Appendix: Samples

Confidential 173

Result: (Protected data)

ID , NAME , PHONE , CREDIT_CARD , AMOUNT

928724, EnYEwVg3 MOQxQw , 27995164409 , 173483871743706 , 85924227

928725, 4h6NlN FJi9 , 87122238232 , 5730496842473502 , 83764821

928726, Lecwe 48zhNF , 31934151773 , 6472961686603834 , 49177868

928727, X9lLP BAA8vN , 70201301198 , 7277097339102446 , 945396991

928728, AYEmh 2CwyvX , 21190182420 , 3411370995179337 , 976189279

928729, EnYEwVg3 MOQxQw , 27995164409 , 173483871743706 , 4781777

928730, 4h6NlN FJi9 , 87122238232 , 5730496842473502 , 3285956

928731, X9lLP BAA8vN , 70201301198 , 7277097339102446 , 945396991

928732, 4h6NlN FJi9 , 87122238232 , 5730496842473502 , 4112197

928733, EnYEwVg3 MOQxQw , 27995164409 , 173483871743706 , 63953943

928734, X9lLP BAA8vN , 70201301198 , 7277097339102446 , 945396991

 To unprotect the data:

This command unprotects the protected data.

hadoop jar /opt/protegrity/samples/mapreduce/lib/basic*.jar

com.protegrity.samples.mapreduce.UnprotectData

/tmp/basic_sample/protected_mapred_data/part*

/tmp/basic_sample/unprotected_mapred_data

 To view the unprotected data:

This command displays the unprotected data.

hadoop fs -cat /tmp/basic_sample/unprotected_mapred_data/part*

Result: (Unprotected data)