Connecting Java Applications To Big Data Targets Using BDGlue Glue User Guide 1 2 0

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 99

DownloadConnecting Java Applications To Big Data Targets Using BDGlue Glue User Guide 1 2 0
Open PDF In BrowserView PDF
Connecting
Java
Applications to
Big Data
Targets Using
BDGlue
User Guide
Oracle Data Integration
Solutions
Version 1.2.0

Contents
Change Log .................................................................................................................................................... 5
Introduction .................................................................................................................................................. 6
Source Code .............................................................................................................................................. 6
Licensing.................................................................................................................................................... 6
Disclaimer.................................................................................................................................................. 7
Architectural Approach ................................................................................................................................. 8
Installing BDGlue ......................................................................................................................................... 10
Encoders...................................................................................................................................................... 12
The “Null” Encoder ................................................................................................................................. 12
The Delimited Text Encoder .................................................................................................................... 12
The JSON Encoder ................................................................................................................................... 13
The Avro Encoder .................................................................................................................................... 13
Publishers .................................................................................................................................................... 16
The Console Publisher............................................................................................................................. 16
BDGlue Targets ........................................................................................................................................... 18
Connecting to Targets via the Flume Publisher ...................................................................................... 18
Configuring Flume ............................................................................................................................... 19
The Flume RPC Client .......................................................................................................................... 19
Flume Events ....................................................................................................................................... 20
Standard Flume Agent Configuration ................................................................................................. 20
Multiplexing Flume Agent Configuration ............................................................................................ 20
Running Flume .................................................................................................................................... 23
Using Flume to Deliver Data into HDFS files ....................................................................................... 23
Making Data Stored in HDFS Accessible to Hive ................................................................................. 31
Delivering Data to Kafka ......................................................................................................................... 33
Configuring the Kafka Publisher.......................................................................................................... 34
Using Flume to Deliver Data to Kafka ................................................................................................. 35
Validating Delivery to Kafka ................................................................................................................ 37
Delivering Data to HBase ........................................................................................................................ 38
Connecting to HBase via the Asynchronous HBase Publisher ............................................................ 38
Using Flume to Deliver Data to HBase ................................................................................................ 39
2

Basic HBase Administration ................................................................................................................ 41
Delivering Data to Oracle NoSQL ............................................................................................................ 43
KV API Support .................................................................................................................................... 43
Table API Support................................................................................................................................ 44
NoSQL Transactional Durability .......................................................................................................... 44
Connecting Directly to Oracle NoSQL via the NoSQL Publisher .......................................................... 44
Using Flume to Deliver Data into the Oracle NoSQL Database........................................................... 46
Basic Oracle NoSQL Administration .................................................................................................... 49
Delivering Data to Cassandra .................................................................................................................. 55
Connecting to Cassandra via the Cassandra Publisher ....................................................................... 55
Basic Cassandra Administration .......................................................................................................... 56
Other Potential Targets........................................................................................................................... 58
Source Configuration .................................................................................................................................. 59
GoldenGate as a Source for BDGlue ....................................................................................................... 59
Configuring GoldenGate for BDGlue ................................................................................................... 60
Configure the GoldenGate EXTRACT................................................................................................... 60
Configure the GoldenGate PUMP ....................................................................................................... 61
The “SchemaDef” Utility ............................................................................................................................. 64
Running SchemaDef ............................................................................................................................ 64
Generating Avro Schemas with SchemaDef ....................................................................................... 64
Generating Hive Table Definitions for Use with Avro Schemas .......................................................... 65
Generating Cassandra Table Definitions ............................................................................................. 66
BDGlue Developer’s Guide.......................................................................................................................... 68
Building a Custom Encoder ..................................................................................................................... 68
Building a Custom Publisher ................................................................................................................... 68
Prerequisite Requirements ......................................................................................................................... 70
Appendix ..................................................................................................................................................... 71
bdglue.properties.................................................................................................................................... 71
schemadef.properties ............................................................................................................................. 85
Helpful Reference Sources ...................................................................................................................... 89
License and Notice Files .......................................................................................................................... 89

3

4

Change Log
Version
1.0
1.0.1

Date
8/3/2015
9/24/2015
10/07/2015
10/23/2015
10/27/2015
12/07/2015
12/09/2015
12/10/2015
12/16/2015
01/13/2016
01/22/2016
01/26/2016
03/01/2016
03/10/2016
03/23/2016
04/07/2016

1.1.0

04/09/2016
4/21/2016

1.2.0

5/15/2016
5/27/2016
6/1/2016
6/8/2016
6/8/2016

Comments
Initial Release
Added code to support replacing newline characters in string fields.
Added support for before images of data in JSON encoding.
Fixed issue with negative hash value in ParallelPublisher.
Changes to make Kafka topics and message keys customizable.
Added prompt for password in schemadef.
Added support for including table name in encoded data.
Added support for pass through of properties from bdglue.properties to Kafka.
Added code to better deal with compressed records and null column values.
Added code to ignore updates where nothing changed.
Added support for the Kafka Schema Registry.
Added calls to trim() when retrieving properties.
Changes to support GGforBD 12.2 and schema change events.
Automatic generation of Avro schemas based on schema change events.
Support specifying numeric output type in Avro schema generation.
Added default null for dynamic avro schemas. Fixed issue with avro table name
valid chars.
Reworked schema registry encoding to pass Avro GenericRecord to serializers.
Refactored code for publishers to potentially allow for more selective building
at some point. Added new KafkaRegistryPublisher that supports registering
avro schemas with the schema registry.
Added Cassandra Support
Improved shutdown logic to clean up more quickly and not block on take() calls.
Changed queue “take” logic to use drainTo() to hopefully reduce latch waits
Changed dynamic avro schema generation to make all columns nullable.
Initial release to GitHub.

5

Introduction
“Big Data Glue” (a.k.a. BDGlue) was developed as a general purpose library for delivering data from Java
applications into various Big Data targets in a number of different data formats. The idea was to create a
“one stop shop” of sorts to facilitate easy exploration of different technologies to help users identify
what might be the most appropriate approach in any particular case. The overarching goal was to allow
this experimentation to occur without the user having to write any Big Data-specific code. Big Data
targets include Flume, Kafka, HDFS, Hive, HBase, Oracle NoSQL, Cassandra, and others.
Hadoop and other Big Data technologies are by their very natures constantly evolving and infinitely
configurable. It is unlikely that BDGlue will exactly meet the requirements of a user’s intended
production architecture, but it will hopefully provide a good starting point for many and at a minimum
should prove sufficient for early point proving exercises.
The code was developed using Oracle’s “Big Data Lite” virtual machine1 to ensure compatibility with
Oracle’s engineered Big Data solution, the Big Data Appliance (BDA)2. If you are not familiar with the
BDA, it is an extremely well thought-out and cost effective solution that is certainly worthy of
consideration as you start to scale from the lab and into production. However, BDGlue does not leverage
any capabilities specific to BDA and will work equally well with any standard Hadoop distribution,
including those from Cloudera, Hortonworks, MapR, and Apache.

Source Code
The source code for BDGlue is freely available and may be found on GitHub at:
http://github.com/bdglue/bdglue

Licensing
BDGlue is developed as open source and released under the Apache License, Version 2.0. Most external
components that it interfaces with are licensed in this fashion as well, with a few exceptions which are
called out explicitly in the LICENSE and NOTICE files that accompany the source code. In those situations,
their corresponding licenses have been deemed compatible with Apache 2.0. A copy of the LICENSE and
NOTICE files are also included at the end of this document.
There is no license fee associated with BDGlue itself. It is up to the user to determine if source and/or
target environments are subject to license fees from their vendors. For example, the GoldenGate
Adapter for Java must be licensed if you make use of the GoldenGate source as described at the end of
this document, as well as the GoldenGate CDC capabilities in the source environment. In short, just
because BDGlue provides an interface to a technology, it doesn’t imply that access is inherently free.

1

The Big Data Lite virtual machine may be downloaded from
http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite-2104726.html.
2
For information on the Big Data Appliance, see
https://www.oracle.com/engineered-systems/big-data-appliance/index.html

6

Disclaimer
While it is intended to be useful out of the box, this code is not formally supported by Oracle. The code
is provided as is and as an example only. While quite functional, it is not warranted or supported as
“production ready” and responsibility for making it so remains the customer’s responsibility. Source
code is available so that customers may alter it as needed to meet their specific needs.

7

Architectural Approach
The BDGlue architecture is modular in its approach, based on the idea of sources, encoders, and
publishers, with the goal of mitigating the impact of change as new capabilities are added. For example,
“encoders” are independent of their upstream source and their downstream publisher so that new
encoding formats can be implemented without requiring change elsewhere in the code.
The following diagram illustrates the high level structure of BDGlue.

Encoding is the process of translating data received from the source into a particular format to facilitate
downstream use. Publishing is the process of writing data to a target environment via RPC. You will note
that BDGlue was designed to support two separate and distinct thread pools to provide scalability for
the “encoding” and “publishing” processes, each of which can be somewhat time-consuming.
Note that while the process of encoding data is multi-threaded and essentially asynchronous, the
encoded records are actually delivered to the publishers in the same order that they were received from
the source. This is to ensure that data anomalies don’t get introduced as a result of a race condition that
might arise if multiple changes to a particular record occur in rapid succession, but it will likely have a
small impact on encoder throughput
In the same way, data is delivered to individual publishers based on a hash of a string value … either



the “table” name, ensuring all records from a particular “table” will be processed in order by the
same publisher; or
based on the “key” associated with the record, in this case ensuring that records based on the
same key value are processed in order by the same publisher.

8

The targets themselves are completely external to BDGlue. In most cases, they are accessed via an RPC
connection (i.e. a socket opened on a specified port). A “target” might be a streaming technology such
as Flume or Kafka, or it might be an actual big data repository such as HBase, Oracle NoSQL, Cassandra,
etc. From Flume we can deliver encoded data at a very granular level to both HDFS files and Hive.
Last but not least in the BDGlue conversation has to do with sources. BDGlue was designed initially with
the idea of delivering data sourced from a relational database, and leveraging Oracle GoldenGate in
particular. We quickly came to realize that BDGlue had the potential of being more generally useful than
that, so we made a deliberate effort to decouple the data sources specific to GoldenGate from the rest
of BDGlue to the greatest degree possible. BDGlue looks at things from the perspective of table-like
structures – essentially tables with a set of columns – but the reality is that any sort of data source could
likely be mapped into them without requiring a lot of imagination or effort.

9

Installing BDGlue
For convenience, BDGlue can be obtained from GitHub in source form and can easily be compiled from
there. The net result of the compilation process will be a bdglue-specific *.jar file, jar file dependencies
needed to compile and execute, as well as documentation, example properties files, etc.
Note that BDGlue is configured to build with Maven, and a suitable pom.xml file is included for this
purpose. For those unfamiliar with Maven, there is a traditional Makefile provided which invokes Maven
under the covers. Being somewhat old school, while Maven is great for compiling everything and
assembling the dependencies, we prefer calling Maven from make (gmake actually) as the install step is
a bit more straightforward to comprehend as it copies all of the relevant build artifacts to a “deploy”
directory.
Note in either case, you will need to set two environment variables:
# GGBD_HOME is the directory where GG for Big Data in installed.
# For example, if GG for Big Data is installed at /u01/ggbd12_2, then you would set
export GGBD_HOME=/u01/ggbd12_2

And
# GGBD_VERSION is an environment variable set to the version of the
# ggdbutil-VERSION.jar file found in the $GGBD_HOME/ggjava/resources/lib directory.
# For example, if the file is named ggdbutil-12.2.0.1.0.012.jar, then you would set
export GGBD_VERSION=12.2.0.1.0.012

Download.
# create a directory
[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$

where you want to install the files
mkdir bdglue
cd bdglue
git clone https://github.com/bdglue/bdglue
export GGBD_HOME=/path/to/gg4bigdata
export GGBD_VERSION=12.2.0.1.0.12

Build with Make:
[ogg@bigdatalite ~]$ make
mvn package -Dggbd.VERSION=12.2.0.1.0.012 -Dggbd.HOME=/u01/ggbd12_2
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------

10

[INFO] Building bdglue 1.2.0.0
[INFO] -----------------------------------------------------------------------[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ bdglue --[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 2 resources
[INFO] Copying 2 resources
< -- snip -- >
[INFO] -----------------------------------------------------------------------[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------------[INFO] Total time: 12.987 s
[INFO] Finished at: 2016-06-08T16:02:03-04:00
[INFO] Final Memory: 40M/314M
[INFO] -----------------------------------------------------------------------mkdir –p ./deploy/lib/dependencies ./deploy/doc
cp ./target/bdglue*.jar ./deploy/lib
cp ./target/dependencies/*.jar ./deploy/lib/dependencies
cp -R ./target/apidocs ./deploy/doc
cp ./*.pdf ./deploy/doc
[ogg@bigdatalite ~]$

Build with Maven:
[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$ mvn package -Dggbd.VERSION=$GGBD_VERSION -Dggbd.HOME=$GGBD_HOME
[ogg@bigdatalite ~]$

CAUTION: Be sure that the versions of the java dependencies (i.e. Kafka, Avro, Cassandra, HBase, etc.)
that BDGlue builds with are compatible with the version of those target solutions deployed in your
environment. In many cases, the dependencies will be forward / backward compatible, but not
always. If you have difficulties at run time, whether exceptions related to methods not being found, or
unidentifiable failures, this could very well be the cause. You may need to alter the java dependencies
in the pom.xml file (used for building), or install newer versions of the target solution in your
environment.
Finally, note that if you are working directly with your Oracle sales team, you might be provided with a
newer version of bdglue.jar than the one provided in this installation. If that is the case, you will want to
replace the version of bdglue.jar in the installation directory with the version provided to you. You can
confirm the version of bdglue.jar by typing the following from the command line:

[ogg@bigdatalite ~]$ java –jar bdglue*.jar
BDGlue Version: 1.2.0.0 Date: 2016/06/08 13:45

11

Encoders
There are a number of encoders that are inherently part of BDGlue: null, delimited text, Avro, and JSON.
For the most part, these should prove to be sufficient for just about any use case, but BDGlue was
designed to be extended with additional encoders if needed. This can be accomplished simply by
implementing a Java interface. New Encoders can be developed and deployed without requiring changes
to BDGlue itself. More information pertaining to creating new encoders can be found in Building a
Custom Encoder in the “Developers Guide” section later in this document.

The “Null” Encoder
The “null” encoder is just what it sounds like … it actually does no encoding at all. It is designed to simply
take the data that was provided by the source, encapsulate it with a little meta-data related to the work
that needs to be done downstream, and then pass the data along to the publisher. Consequently, the
null encoder is the most lightweight of the encoders and is intended for use against those targets that



BDGlue will connect to directly (i.e. not via Flume, Kafka, etc.); and
Require data to be applied via API at the field (or column) level rather than at the record level.

Targets for which the null encoder is appropriate include HBase, the Oracle NoSQL “table” API,
Cassandra, etc.
To tell BDGlue to make use of the Null Encoder, simply specify the encoder in the bdglue.properties file
as follows:

bdglue.encoder.class = com.oracle.bdglue.encoder.NullEncoder

The Delimited Text Encoder
The “delimited text encoder” is also just what it sounds like … it is designed to take the data that is
passed in from the source and encode it into a data “delimited text” fashion.
Delimited text is the simplest and most straight forward way to transmit the data from BDGlue to a
target. It is also likely the least useful. The column values are added to a buffer that will become the
“body” of the data that is sent downstream by a publisher. Columns are added to the buffer in the order
they are represented in the table metadata, with each column separated from the one preceding it in
the buffer by a delimiter. By default, the delimiter is the default delimiter recognized by Hive, which is
\001 (^A). That value can be overridden in the bdglue.properties file by specifying the
bdglue.encoder.delimiter property.
Delimited text is fast and somewhat compact, but it contains no metadata regarding the structure of the
data (i.e. the names of the columns). This requires downstream consumers of the data to know the

12

structure when it comes time to make use of the data later. This could in theory be a challenge;
particularly if the schema has evolved over time.
To tell BDGlue to make use of the Delimited Text Encoder, simply specify the encoder in the
bdglue.properties file as follows:

bdglue.encoder.class = com.oracle.bdglue.encoder.DelimitedTextEncoder

The JSON Encoder
“JSON” is short for “JavaScript Object Notation”. It is a lightweight data interchange format that uses
human-readable text to transmit data comprised of attribute-value pairs. It is language-independent and
has proven to be quite useful in many Big Data use cases. In the case of this encoder, the “attributes”
are the column/field names, and the “values” are the actual data values associated with those names.
Here is an example of JSON-encoded data:

{"ID":"2871","NAME":"Dane Nash","GENDER":"Male","CITY":"Le Grand-Quevilly",
"PHONE":"(874) 373-6196","OLD_ID":"1","ZIP":"81558-771","CUST_DATE":"2014/04/13"}

Column/field names are ID, NAME, GENDER, CITY, and so on.
To tell BDGlue to make use of the JSON Encoder, simply specify the encoder in the bdglue.properties file
as follows:

bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder

The Avro Encoder
This data encoding is a bit more advanced than the others. Avro is a data serialization format that
supports rich data structures in a compact binary data format. It has proven to be quite useful, and is
understood directly by Hive, Oracle NoSQL, and other targets. Avro also supports the notion of “schema
evolution”, albeit in a more limited sense than might be supported by a relational database.
Unlike JSON, which is text-based and self-describing, Avro data is actually transmitted downstream to
recipients in a more compact binary format based on an “Avro schema” that describes the contents. Like
JSON-formatted data, this data also has a clearly defined structure, but it is different in that the

13

“schema” that describes the data must be made available to the recipient so that the data can be
understood. Avro schemas are actually defined using JSON.
Here is an example of what an Avro schema file looks like. As mentioned, it is a JSON format that
describes the columns and their data types. Notice the “union” entries that contain “null” and a data
type. These indicate that those columns may be null. Note also the specification of default values: “null”
for columns that may be null; -1 for the OLD_ID column which in this case may not be null; etc. Inclusion
of the null column information and default values is optional and specified in the properties file. It is
recommended that these always be enabled as the information assists the target repository (HDFS, Hive,
NoSQL, etc.) in the schema evolution process.
{
"type" : "record",
"name" : "CUST_INFO",
"namespace" : "bdglue",
"doc" : "SchemaDef",
"fields" : [ {
"name" : "ID",
"type" : "int",
"doc" : "keycol"
}, {
"name" : "NAME",
"type" : [ "null", "string"
"default" : null
}, {
"name" : "GENDER",
"type" : [ "null", "string"
"default" : null
}, {
"name" : "CITY",
"type" : [ "null", "string"
"default" : null
}, {
"name" : "PHONE",
"type" : [ "null", "string"
"default" : null
}, {
"name" : "OLD_ID",
"type" : "int",
"default" : -1
}, {
"name" : "ZIP",
"type" : [ "null", "string"
"default" : null
}, {
"name" : "CUST_DATE",
"type" : [ "null", "string"
"default" : null
} ]

],

],

],

],

],

],

}

14

For relational database sources, a utility “SchemaDef” is provided with BDGlue that will generate the
Avro schema files that would correspond to a table from the table’s metadata. SchemaDef will also
generate meta-information in other formats as well. SchemaDef is described later in this document.
To tell BDGlue to make use of the Avro Encoder, simply specify the encoder in the bdglue.properties file
as follows:

bdglue.encoder.class = com.oracle.bdglue.encoder.AvroEncoder

15

Publishers
A publisher is responsible for understanding how to interface with an external “target”. Another way of
saying that is that a publisher is specific to its intended target. A publisher takes the data and associated
meta-data handed off from the encoder and delivers it to the target. As mentioned previously,
publishers are part of a “pool”, with each publisher having its own independent connection to the
target, most typically via an RPC.
In some cases, the publisher will deliver the data received from the encoder “as is” to the target. It will
hand off these encoded records without really understanding their contents, just knowing that it needs
to pass them along. Examples of publishers where encoded data would likely be passed along as
provided by the encoder without further interpretation include Flume, Kafka, the Oracle NoSQL KV API,
etc.
The primary exception to this would be data passed along via the “null encoder”. In this particular case,
it is intended that the publisher process the data field-by-field as it writes to the target. Examples of
publishers that would leverage data passed along from the “null encoder” include HBase, the Oracle
NoSQL Table API, Cassandra, etc. In each of these cases, data is added to stored records on a field-byfield basis, so a pre-formatted record based on JSON, Avro, etc. are likely not appropriate.
We will explore one publisher, the “Console Publisher”, in the next section. We’ll look at how to
configure BDGlue publishers to deliver to supported targets later in the document.
Finally, just as it was designed to support development of new “encoders”, BDGlue was designed to be
extended to support new publishers as well. Just as with Encoders, this is done by implementing a Java
interface, and just as with new encoders this can be done without the need to make changes elsewhere
in the code. More information pertaining to creating new publishers can be found in Building a Custom
Publisher in the “Developers Guide” section found later in this document.

The Console Publisher
The first, and simplest, publisher we will cover is the “Console Publisher.” It was developed to assist with
certain troubleshooting processes, particularly in areas pertaining to ensuring that things are configured
properly before we actually start trying to “publish” data to a target. The Console Publisher simply takes
the records that are passed to it and writes them to standard out (the “console”). Because it is writing to
what could very well be a display screen, configuring the JSON Encoder when using the Console
Publisher is probably best … records are more easily readable.
Here is how you might configure BDGlue’s properties file to use the Console Publisher.

# bdglue.properties to make use of the ConsolePublisher.
#
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.threads = 2

16

bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = false
bdglue.event.header-avropath = false
bdglue.publisher.class = com.oracle.bdglue.publisher.console.ConsolePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = table

17

BDGlue Targets
In this section, we’ll explore delivery to a variety of target environments supported by BDGlue. We’ll
start specifically with using Flume to deliver files to HDFS. Flume is actually quite flexible and powerful.
There are a number of other targets we can deliver to from Flume. We’ll look at specific information for
using Flume to deliver to those targets in their respective sections.

Connecting to Targets via the Flume Publisher
Apache Flume is a streaming mechanism that fits naturally with the BDGlue architecture. Flume provides
a number of out-of-the-box benefits that align well with the BDGlue use case:







It supports RPC connections from locations that are not physically part of the Hadoop cluster.
It is modular and thus extremely flexible in terms of how data “streams” are configured.
There are many out-of-the-box components that can be leveraged directly without need for
modification or customization.
In particular, Flume does an outstanding job with its HDFS file handling. If there is a need to
stream data from outside of Hadoop into files in HDFS, there may be no better mechanism for
doing this.
It provides a pluggable architecture that allows custom components to be developed and
deployed when needed.
Flume Agent
Data Source

BDGlue
Source

BDGlue
Source Channel

BDGlue
properties
file

Sink

HDFS /
Hive /
HBase /
NoSQL

Flume
Config file

Figure 1: Basic architecture of the BDGlue Flume implementation

To really understand what is going on behind the scenes, it is important that the user have a good
understanding of Flume and its various components.
A good reference on Flume is:


Apache Flume: Distributed Log Collection for Hadoop, by Steve Hoffman (©2013 Packt
Publishing)

18

While the book is focused predominantly on streaming data collected from log files, there is a lot of
excellent information on configuring Flume to take advantage of the flexibility it offers. Despite the
public perception, there is much more that Flume brings to the table than the scraping of log files.
An excellent introductory reference on Hadoop and Big Data in general is:


Hadoop: The Definitive Guide (Fourth Edition), by Tom White (©2015 O’Reilly media)

This book not only provides information on Hadoop technologies such as HDFS, Hive, and HBase, but it
also provides some good detail on Avro serialization, a technology that proved to be quite useful in
practice in any number of customer environments.
Configuring Flume
As mentioned previously, Flume is incredibly flexible in terms of how it can be configured. Topologies
can be arbitrarily complex, supporting fan-in, fan-out, etc. Data streams can be intercepted, modified in
flight, and rerouted. There really is no end to what might be configured.
There are two basic topologies for Flume that we feel will be most commonly useful in BDGlue use
cases:



A single stream that will handle multiple tables via an agent that consists of a single sourcechannel-sink combination. This is easiest to implement, and we think will be most common.
A multiplexing stream where a single source fans out into a separate channel and sink for each
table being processed. This is more complex to implement, but might be an approach when
there are a few high volume tables. There might be a separate channel and sink for each of the
high volume tables, and then another “catch all” that handles the rest.

The Flume RPC Client
The BDGlue Flume publisher is implemented to support both Avro and Thrift for RPC communication
with Flume. It is possible to switch between the two via a property in the bdglue.properties file. Most of
the testing of BDGlue was done using Avro RPCs, and all examples in this user guide leverage Avro for
RPC communication. If you wish to use Thrift instead, you will need to configure



bdglue.flume.rpc.type = thrift-rpc in the bdglue.properties file; and
bdglue.sources..type = thrift in the Flume configuration file.

Do not confuse Avro RPC with Avro Serialization, which we also make good use of in BDGlue. While they
share a common portion of their name, the two are essentially independent of one another. For the
examples in this user guide, we will configure



bdglue.flume.rpc.type = avro-rpc in the bdglue.properties file.
The bdglue.sources..type = avro in the Flume configuration file.

19

Flume Events
Note that data moves through a Flume Agent as a series of “events”, where in the case of GoldenGate
each event represents a captured database operation, or source record otherwise. We’ll just generally
refer to the data as “source data” or “source record” going forward. The body of the event contains an
encoding of the contents of the source record. Several encodings are supported: Avro binary, JSON, and
delimited text. All are configurable via the bdglue.properties file.
In addition to the body, each Flume event has a header that contains some meta-information about the
event. For BDGlue, the header will always contain the table name. Depending on other options that are
configured, additional meta-information will also be included.
Standard Flume Agent Configuration
As mentioned above, the most typical Flume agent configuration will be relatively simple: a single
source-channel-sink combination that writes data to specific destinations for each table that is being
processed.
Flume Agent
Source
Records

BDGlue

Avro
RPC

Source Channel

Target
Sink

Target
Target

Flume
Config file

BDGlue
Properties
file

Figure 2: Typical Flume agent configuration

The various “Targets” might be files in HDFS, or perhaps Hive or HBase tables, based on how the
properties and configuration files are set up. Our examples in subsequent sections will be based on this
configuration and we’ll look at the details of the bdglue.properties and Flume configuration files at that
time.
Multiplexing Flume Agent Configuration
Before that, however, we’ll take a quick look at one other configuration that might prove useful. This
configuration is one where a single Flume “source” multiplexes data across multiple channels based on
table name, and each channel has its own sink to write the data into Hadoop.

20

Flume Agent
Source
Records

BDGlue

Avro
RPC

S1

Source
BDGlue
Properties
file

C1

K1

Target

C2

K2

Target

C3

K3

Target

Channels

Sinks

Flume Config file

Figure 3: Multiplexing Flume Agent

To configure in this fashion, you’ll need to specify a separate Flume configuration for each channel and
sink. If there are a lot of tables that you want to process individually, this could get fairly complicated in
a hurry. The following will give you an idea of what such a configuration file might look like. Note that
this example is not complete, but it will give you an idea of what might be required to configure the
example above.

21

# list the sources, channels, and sinks for the agent
bdglue.sources = s1
bdglue.channels = c1 c2 c3
bdglue.sinks = k1 k2 k3
# Map the channels to the source. One channel per table being captured.
bdglue.sources.s1.channels = c1 c2 c3
# Set the properties for the source
bdglue.sources.s1.type = avro
bdglue.sources.s1.bind = localhost
bdglue.sources.s1.port = 41414
bdglue.sources.s1.selector.type = multiplexing
bdglue.sources.s1.selector.header = table
bdglue.sources.s1.selector.mapping.default = c1
bdglue.sources.s1.selector.mapping. = c2
bdglue.sources.s1.selector.mapping. = c3
# Set the properties for the channels
# c1 is the default ... it will handle unspecified tables.
bdglue.channels.c1.type = memory
bdglue.channels.c1.capacity = 1000
bdglue.channels.c1.transactionCapacity = 100
bdglue.channels.c2.type = memory
bdglue.channels.c2.capacity = 1000
bdglue.channels.c2.transactionCapacity = 100
bdglue.channels.c3.type = memory
bdglue.channels.c3.capacity = 1000
bdglue.channels.c3.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
bdglue.sinks.k1.channel = c1
bdglue.sinks.k2.channel = c2
bdglue.sinks.k3.channel = c3
# k1 is the default. Logs instead of writes.
bdglue.sinks.k1.type = logger
bdglue.sinks.k2.type = hdfs
bdglue.sinks.k2.serializer = avro_event
bdglue.sinks.k2.serializer.compressionCodec = gzip
bdglue.sinks.k2.hdfs.path = hdfs://bigdatalite.localdomain/flume/gg-data/%{table}
bdglue.sinks.k2.hdfs.fileType = DataStream
# avro files must end in .avro to work in an Avro MapReduce job
bdglue.sinks.k2.hdfs.filePrefix = bdglue
bdglue.sinks.k2.hdfs.fileSuffix = .avro
bdglue.sinks.k2.hdfs.inUsePrefix = _
bdglue.sinks.k2.hdfs.inUseSuffix =
bdglue.sinks.k3.type = hdfs
bdglue.sinks.k3.serializer = avro_event
bdglue.sinks.k3.serializer.compressionCodec = gzip
bdglue.sinks.k3.hdfs.path = hdfs://bigdatalite.localdomain/flume/gg-data/%{table}
bdglue.sinks.k3.hdfs.fileType = DataStream
# avro files must end in .avro to work in an Avro MapReduce job
bdglue.sinks.k3.hdfs.filePrefix = bdglue
bdglue.sinks.k3.hdfs.fileSuffix = .avro
bdglue.sinks.k3.hdfs.inUsePrefix = _
bdglue.sinks.k3.hdfs.inUseSuffix =

22

In the example above, note that the configuration for channel/sink c1/s1 is configured as a “default” (i.e.
catch all) channel. In this case, we are logging information as an exception. That channel could also be
configured to process rather than log those tables, while still allowing “special” handling of
channel/sinks c2/k2 and c3/k3.
Running Flume
Once it is actually time to start the flume agent, you’ll do so by executing a statement similar to the
following example.

flume-ng agent --conf conf --conf-file bdglue.conf --name bdglue
--classpath /path/to/lib/bdglue.jar
-Dflume.root.logger=info,console

Several things to note above:





bdglue.conf is the name of your configuration file for Flume. It can have any name you wish.
--name bdglue: “bdglue” is the name of your agent. It must exactly match the name of your
agent in the configuration file. You’ll note that each line in the example configuration file above
begins with “bdglue”. Your agent can have any name you wish, but this name must match.
--classpath *.jar gives the name of your jar file that contains any custom source-channel-sink
code you may have developed. It is not required otherwise. In the case of BDGlue, it will only be
needed when delivering to HBase and Oracle NoSQL as custom sink logic was developed for
those targets.

Using Flume to Deliver Data into HDFS files
Flume has actually proven to be an excellent way to deliver data into files stored within HDFS. This may
seem counterintuitive in some ways, but unless you have an entire file ready to go at once, the idea of
streaming data into those files actually makes a lot of sense, particularly if you might have the need to
write to multiple files simultaneously. The Flume HDFS “sink” can support thousands of open files
simultaneously (say, one for each table being delivered via transactional CDC – change data capture),
and provides excellent control over directory structure, and when to roll to a new file based on size,
number of records, and/or time.
BDGlue supports delivery to HDFS via Flume in several different encoded file formats:




Delimited text. This is just as it sounds, with column values separated by a delimiter. By default,
that delimiter is \001 (^A), which is the default delimiter for Hive, but that can be overridden in
the bdglue.properties file.
JSON-formatted text. This is basically a “key-value” description of the data where the key is the
name of the column, and the value is the value stored within that column.
23



Avro binary schema.

Each of these formats was described previously in the section on Encoders.
Delimited Text
# bdglue.properties
#
bdglue.encoder.class = com.oracle.bdglue.encoder.DelimitedTextEncoder
bdglue.encoder.threads = 2
bdglue.encoder.delimiter = 001
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

Here is the Flume configuration file:

24

# list the sources, channels, and sinks for the agent
bdglue.sources = s1
bdglue.channels = c1
bdglue.sinks = k1
# Map the channels to the source.
bdglue.sources.s1.channels = c1
# Set the properties for the source
bdglue.sources.s1.type = avro
bdglue.sources.s1.bind = localhost
bdglue.sources.s1.port = 41414
bdglue.sources.s1.selector.type = replicating
# Set the properties for the channels
bdglue.channels.c1.type = memory
# make capacity and transactionCapacity much larger
# (i.e. 10x or more) for production use
bdglue.channels.c1.capacity = 1000
bdglue.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
bdglue.sinks.k1.channel = c1
bdglue.sinks.k1.type = hdfs
bdglue.sinks.k1.serializer = text
# each table written to separate directory named ‘tablename’
bdglue.sinks.k1.hdfs.path = hdfs://bigdatalite.localdomain/user/flume/ggdata/%{table}
bdglue.sinks.k1.hdfs.fileType = DataStream
bdglue.sinks.k1.hdfs.filePrefix = bdglue
bdglue.sinks.k1.hdfs.fileSuffix = .txt
bdglue.sinks.k1.hdfs.inUsePrefix = _
bdglue.sinks.k1.hdfs.inUseSuffix =
# number of records the sink will read per transaction.
# Higher numbers may yield better performance.
bdglue.sinks.k1.hdfs.batchSize = 10
# the size of the files in bytes.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollSize = 1048576
# roll to a new file after N records.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollCount = 100
# roll to a new file after N seconds. 0=disable
bdglue.sinks.k1.hdfs.rollInterval = 30

25

JSON Encoding
Under the covers, when writing to HDFS JSON-encoded data is handled in the same way as delimited
text is handled. The fundamental difference is that the data is formatted in such a way that the column
names are included along with their contents.
The bdglue.properties file needed for this might look something like the following.

# bdglue.properties
#
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

The corresponding Flume configuration file would look the same as it did for delimited text as the data is
handled in exactly the same fashion by the Flume agent’s sink when we are writing to HDFS. We will use
JSON-formatted data again later to write data into HBase. There will definitely be differences in the
configuration files at that point.

26

# list the sources, channels, and sinks for the agent
# list the sources, channels, and sinks for the agent
bdglue.sources = s1
bdglue.channels = c1
bdglue.sinks = k1
# Map the channels to the source.
bdglue.sources.s1.channels = c1
# Set the properties for the source
bdglue.sources.s1.type = avro
bdglue.sources.s1.bind = localhost
bdglue.sources.s1.port = 41414
bdglue.sources.s1.selector.type = replicating
# Set the properties for the channels
bdglue.channels.c1.type = memory
# make capacity and transactionCapacity much larger
# (i.e. 10x or more) for production use
bdglue.channels.c1.capacity = 1000
bdglue.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
bdglue.sinks.k1.channel = c1
bdglue.sinks.k1.type = hdfs
bdglue.sinks.k1.serializer = text
# each table written to separate directory named ‘tablename’
bdglue.sinks.k1.hdfs.path = hdfs://bigdatalite.localdomain/user/flume/ggdata/%{table}
bdglue.sinks.k1.hdfs.fileType = DataStream
bdglue.sinks.k1.hdfs.filePrefix = bdglue
bdglue.sinks.k1.hdfs.fileSuffix = .txt
bdglue.sinks.k1.hdfs.inUsePrefix = _
bdglue.sinks.k1.hdfs.inUseSuffix =
# number of records the sink will read per transaction.
# Higher numbers may yield better performance.
bdglue.sinks.k1.hdfs.batchSize = 10
# the size of the files in bytes.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollSize = 1048576
# roll to a new file after N records.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollCount = 100
# roll to a new file after N seconds. 0=disable
bdglue.sinks.k1.hdfs.rollInterval = 30

27

Having the metadata transmitted with the column data is handy, but it does take up more space in HDFS
when stored this way.
Configuring for Binary Avro Encoding
As mentioned earlier, and advantage to Avro encoding over JSON is that it is more compact, but it is also
a little more complex as Avro schema files are required. It is possible to have BDGlue generate Avro
schema files on the fly from the metadata that is passed in from the source, but this is not
recommended as the files are needed downstream before data is actually landed. Instead, it is
recommended that for data coming from relational database sources you utilize the SchemaDef utility
to generate these schemas. See Generating Avro Schemas with SchemaDef later in this document for
more information on how to do this.
Once we have the Avro schema files where we need them, we can think about configuring BDGlue and
Flume to handle Avro encoded data.
The first step, as before, is to set the appropriate properties in the bdglue.properties file. You will see
here that we are introducing a couple of new properties associated with the location of the *.avsc files
locally and in HDFS.

# configuring BDGlue for Avro encoding
#
bdglue.encoder.class = com.oracle.bdglue.encoder.AvroEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = true
# The URI in HDFS where schemas will be stored.
# Required by the Flume sink event serializer.
bdglue.event.avro-hdfs-schema-path = hdfs:///user/flume/gg-data/avro-schema/
# local path where bdglue can find the avro *.avsc schema files
bdglue.event.avro-schema-path = /local/path/to/avro/schema/files
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

28

And of course, we also need to configure Flume to handle this data as well. Again, you’ll see some
differences in the properties for the agent’s sink … specifically a non-default serializer that properly
creates the *.avro files with the proper schema.

29

# list the sources, channels, and sinks for the agent
bdglue.sources = s1
bdglue.channels = c1
bdglue.sinks = k1
# Map the channels to the source. One channel per table being captured.
bdglue.sources.s1.channels = c1
# Set the properties for the source
bdglue.sources.s1.type = avro
bdglue.sources.s1.bind = localhost
bdglue.sources.s1.port = 41414
bdglue.sources.s1.selector.type = replicating
# Set the properties for the channels
# c1 is the default ... it will handle unspecified tables.
bdglue.channels.c1.type = memory
# make capacity and transactionCapacity much larger
# (i.e. 10x or more) for production use
bdglue.channels.c1.capacity = 1000
bdglue.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
bdglue.sinks.k1.channel = c1
bdglue.sinks.k1.type = hdfs
bdglue.sinks.k1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
bdglue.sinks.k1.hdfs.path = hdfs://bigdatalite.localdomain/user/flume/ggdata/%{table}
bdglue.sinks.k1.hdfs.fileType = DataStream
# avro files must end in .avro to work in an Avro MapReduce job
bdglue.sinks.k1.hdfs.filePrefix = bdglue
bdglue.sinks.k1.hdfs.fileSuffix = .avro
bdglue.sinks.k1.hdfs.inUsePrefix = _
bdglue.sinks.k1.hdfs.inUseSuffix =
# number of records the sink will read per transaction.
# Higher numbers may yield better performance.
bdglue.sinks.k1.hdfs.batchSize = 10
# the size of the files in bytes.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollSize = 1048576
# roll to a new file after N records.
# 0=disable (recommended for production)
bdglue.sinks.k1.hdfs.rollCount = 100
# roll to a new file after N seconds. 0=disable
bdglue.sinks.k1.hdfs.rollInterval = 30

30

And that’s it. We are now all set to deliver Avro encoded data into *.avro files in HDFS.
Making Data Stored in HDFS Accessible to Hive
So now we have built and demonstrated the foundation for what comes next … making the data
accessible via other Hadoop technologies. In this section, we’ll look at accessing data from Hive.
You may be wondering why we went to the trouble we did in the previous section. It certainly seems like
a lot of work just to put all that data into HDFS. The answer to that question is: “Hive.” It turns out that
once data has been properly serialized and stored in Avro format, Hive can make use of it directly … no
need to Sqoop the data into Hive, etc. By approaching things this way, we save both an extra “Sqoop”
step, and we eliminate any potential performance impact of writing the data directly into Hive tables on
the fly. Of course, you can always choose to import the data into actual Hive storage later if you wish.
Configuration
The configuration for doing this is exactly the same as we did in the previous section. Since there are no
differences in the Flume configuration, so we won’t repeat it here. There are no differences in the
bdglue.properties file either. We are repeating it here to highlight one property. The value of this
property must match the corresponding value specified by the SchemaDef utility when generating the
Hive Query Language DDL for the corresponding tables.

# configuring BDGlue to create HDFS-formatted files that
# can be accessed by Hive.
#
bdglue.encoder.class = com.oracle.bdglue.encoder.AvroEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = true
# The URI in HDFS where schemas will be stored.
# Required by the Flume sink event serializer.
bdglue.event.avro-hdfs-schema-path = hdfs:///user/flume/gg-data/avro-schema/
# local path where bdglue can find the avro *.avsc schema files
bdglue.event.avro-schema-path = /local/path/to/avro/schema/files
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

31

Accessing *.avro Files From Hive
Hive is smart enough to be able to access *.avro files where they live, and in this section we’ll show you
how that works.
The first, and only real step, is to create a table in Hive and tell it to read data from *.avro files. You’ll
notice a couple of key things:



We do not need to specify the columns, their types, etc. All of this information is found in the
Avro schema metadata, so all we have to do is point Hive to the schema and we’re all set.
This process is making use of Hive’s Avro SerDe (serializer and deserializer) mechanism to
decode the Avro data.

What is especially nice is that we can use the SchemaDef utility to generate the Hive table definitions
like the following example. See Generating Hive Table Definitions for Use with Avro Schemas for more
information.

DROP TABLE CUST_INFO;
CREATE EXTERNAL TABLE CUST_INFO
COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/flume/gg-data/bdgluedemo.CUST_INFO/'
TBLPROPERTIES (
'avro.schema.url'=
'hdfs:///user/flume/gg-data/avro-schema/bdgluedemo.CUST_INFO.avsc'
);

And as a validation that this all works as expected, review the following.

hive> describe CUST_INFO;
OK
Id
int
from deserializer
Name
string
from deserializer
Gender
string
from deserializer
City
string
from deserializer
Phone
string
from deserializer
old_id
int
from deserializer
zip
string
from deserializer
cust_date
string
from deserializer
Time taken: 0.545 seconds, Fetched: 8 row(s)
hive> select * from CUST_INFO limit 5;

32

OK
1601 Dane Nash Male Le Grand-Quevilly (874) 373-6196 1 81558-771 2014/04/13
1602 Serina Jarvis Male Carlton (828) 764-7840
2 70179
2014/03/14
1603 Amos Fischer Male Fontaine-l'Evique (141) 398-6160
3 9188 2015/02/06
1604 Hamish Mcpherson Male Edmonton (251) 120-8238
4 T4M 1S9 2013/12/21
1605 Chadwick Daniels Female Ansfelden (236) 631-9213
5 38076
2015/04/05
Time taken: 0.723 seconds, Fetched: 5 row(s)
Hive>

Delivering Data to Kafka
Kafka is a fast, scalable, and fault-tolerant publish-subscribe messaging system that is frequently used in
place of more traditional message brokers in “Big Data” environments. As with traditional message
brokers, Kafka has the notion of a “topic” to which events are published. Data is published by a Kafka
“producer”.
Data written to a topic by a producer can further be partitioned by the notion of a “key”. The key serves
two purposes: to aid in partitioning data that has been written to a topic for reasons of scalability, and in
our case to aid downstream “consumers” in determining exactly what data they are looking at.
In the case of BDGlue, by default all data is written to a single topic, and the data is further partitioned
by use of a key. The key in this case is the table name, which can be used to route data to particular
consumers, and additional tell those consumer what exactly they are looking at.
Finally, Kafka supports the notion of “batch” or “bulk” writes using an asynchronous API that accepts
many messages at once to aid in scalability. BDGlue takes advantage of this capability by writing batches
of messages at once. The batch size is configurable, as is a timeout specified in milliseconds that will
force a “flush” in the event that too much time passes before a batch is completed and written.
When publishing events, Kafka is expecting three bits of information:




Topic – which will be the same for all events published by an instance of the Kafka Publisher
Key – which will correspond to the table name that relates to the encoded data
Body – the actual body of the message that is to be delivered. The format of this data may be
anything. In the case of the Kafka publisher, any of the encoded types are supported: Delimited
Text, JSON, and Avro.

Note that there are some additional java dependencies required to execute a Kafka publisher beyond
those required to actually compile BDGlue and must be added to the classpath in the Java Adapter
properties file. In this case, the specific order of the dependencies listed is very important. If you make a
mistake here you will like find the wrong entry point into Kafka and results will be indeterminate.

33

#Adapter Logging parameters.
#log.logname=ggjavaue
#log.tofile=true
log.level=INFO

#Adapter Check pointing parameters
goldengate.userexit.chkptprefix=GGHCHKP_
goldengate.userexit.nochkpt=true
# Java User Exit Property
goldengate.userexit.writers=javawriter
# this is one continuous line
javawriter.bootoptions= -Xms64m -Xmx512M
-Dlog4j.configuration=ggjavaue-log4j.properties
-Dbdglue.properties=bdglue.properties
-Djava.class.path=./gghadoop:./ggjava/ggjava.jar
#
#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
javawriter.stats.time=3600
javawriter.stats.numrecs=5000
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
#Hadoop Handler.
gg.handlerlist=gghadoop
gg.handler.gghadoop.type=com.oracle.gghadoop.GG12Handler
gg.handler.gghadoop.mode=op
gg.classpath=./gghadoop/lib/*:/kafka/kafka_2.10-0.8.2.1/libs/kafka-clients0.8.2.1.jar:/kafka/kafka_2.10-0.8.2.1/libs/*

Configuring the Kafka Publisher
Configuring the Kafka Publisher is actually very straight-forward:



Configure an encoder (note that the “NullEncoder” is not supported by this publisher). The
encoder must be for one of the actual supported data formats: Avro, JSON, or Delimited.
Configure the KafkaPublisher.

# bdglue.properties file for delivery to Kafka
#
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false

34

bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = false
bdglue.publisher.class = com.oracle.bdglue.publisher.kafka.KafkaPublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = table
bdglue.kafka.topic = goldengate
bdglue.kafka.batchSize = 3
bdglue.kafka.flushFreq = 500
bdglue.kafka.metadata.broker.list = localhost:9092

Note that at there are several “bdglue.kafka” properties located toward the bottom of the example
above. Only one of those is actually required, and that is the broker list. This is defined in the Kafka
documentation and tells the KafkaPublisher which Kafka broker(s) to deliver events to. Information
about these and a few other Kafka-related properties can be found in the appendix at the end of this
document.
Using Flume to Deliver Data to Kafka
While in most situations users will configure BDGlue to deliver data to Kafka directly, BDGlue also
supports the delivery of data to Kafka by way of Flume. This approach might be useful if there more
complicated flow of data required that neither BDGlue nor Kafka can provide on their own. Flume’s
ability to fork and merge data flows, or augment the flow with additional processors (called
‘interceptors’) can prove to be extremely powerful when defining the architecture of a data flow.
Configuring BDGlue
First we must configure BDGlue to deliver the data to Flume. Just as with the KafkaPublisher, the data
must be encoded in one of the supported formats: Delimited Text, Avro, or JSON. You’ll see that the
bdglue.properties file is simpler than some as there isn’t much for BDGlue to do other than encode the
data and hand it on.

35

# Configuring BDGlue to deliver data to Kafka
# by way of Flume (bdglue.properties)
#
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

Configuring the BDGlue Sink for Kafka
You will see that for a number of targets that we can deliver to via Flume, we have developed a custom
Flume “sink” to process the data as we expect. Configuration is still relatively straight forward.

# list the sources, channels, and sinks for the agent
ggflume.sources = s1
ggflume.channels = c1
ggflume.sinks = k1
# Map the channels to the source. One channel per table being captured.
ggflume.sources.s1.channels = c1
# Set the properties for the source
ggflume.sources.s1.type = avro
ggflume.sources.s1.bind = localhost
ggflume.sources.s1.port = 41414
ggflume.sources.s1.selector.type = replicating
# Set the properties for the channels
# c1 is the default ... it will handle unspecified tables.
ggflume.channels.c1.type = memory
ggflume.channels.c1.capacity = 1000
ggflume.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
ggflume.sinks.k1.channel = c1
ggflume.sinks.k1.type = com.oracle.bdglue.publisher.flume.sink.kafka.KafkaSink
ggflume.sinks.k1.batchSize = 3

36

ggflume.sinks.k1.brokerList = localhost:9092
ggflume.sinks.k1.topic = goldengate

You will see that there are some required properties specific to this Kafka delivery:






type – identifies the Flume Sink we are calling
batchSize – identifies the number of Flume events we should queue before actually delivering to
Kafka. Note that this is a “pull” architecture and if the sink looks for and doesn’t find another
event to process it will deliver what it has accumulated to Kafka.
brokerList – is required and identifies the broker(s) that we should deliver to.
topic – provides the name of the Kafka topic that we should publish the events to.

And there are some optional properties as well (not specified in the example above):





requiredAcks – defines the sort of acknowledgement we should expect before continuing. 0 =
none, 1 = wait for acknowledgement from a single broker, and -1 = wait for acknowledgement
from all brokers.
kafka.serializer.class – override the default serializer when delivering the message body. This
capability is present, but it is not likely that you will need to do so.
kafka.key.serializer.class – override the default serializer used for encoding the key (table name
in our case). This capability is present, but it is not likely that you will have reason to override
this property.

Once configured, you simply start Flume and BDGlue as you otherwise would. See the next section to
get ideas on how to validate that data is successfully being delivered.
Validating Delivery to Kafka
Note that for data to be delivered successfully, the Kafka broker must be running when BDGlue attempts
to write to it. The broker may be installed and running as a service, or if not, will need to be started by
hand. There is a script to do this that can be found in the “bin” directory of the Kafka installation, and a
default set of properties can be found in the “config” directory:

./bin/kafka-server-start.sh config/server.properties

In the Kafka architecture, both the BDGlue KafkaPublisher and the Flume Kafka “sink” serve the role of
“Kafka Producer”. In order to see what has been delivered to Kafka, there will need to be a consumer.
Kafka has a sample consumer, called the “Console Consumer” which is great for smoke testing the
environment. The Console Consumer basically reads messages that have been posted to a topic and
writes them to the screen.

37

./bin/kafka-console-consumer.sh --zookeeper localhost:2181
from-beginning --property print.key=true

--topic goldengate --

The “print.key” property causes the consumer to print the topic “key” (in our case, the table name) to
the console along with the message. Note that if you are going to use the Console Consumer, it would
probably be best to configure the JsonEncoder during this time as the data that is output will be in a
text-based format. Data encoded by the AvroEncoder can contain binary data and will not be as legible
on your screen.

Delivering Data to HBase
Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed
File System (HDFS). An HBase system comprises a set of tables. Each table contains rows and columns,
much like a traditional database, and it also has an element defined as a key.
All access to HBase tables must use the defined key. While similar in nature to a primary key in a
relational database, a key in HBase might be used a little differently … defined and based specifically on
how the data will be accessed after it has been written.
An HBase column represents an attribute of an object and in our case likely a direct mapping of a
column from a relational database. HBase allows for many columns to be grouped together into what
are known as column families, such that the elements of a column family are all stored together. This is
different from a row-oriented relational database, where all the columns of a given row are stored
together.
With HBase you must predefine the table schema and specify the column families. However, it is very
flexible in that new columns can be added to families at any time, making the schema flexible and
therefore able to adapt to changing application requirements.
Currently in BDGlue, we map each source table into a single column family of a corresponding table in
HBase, creating a key from the key on the relational side. This may not be the best approach in some
circumstances, however. The very nature of HBase cries out for keys that are geared toward that actual
way you are likely to access the data via map reduce (which is likely quite different than how you would
access a relational table). In some cases, it would be most optimal to combine relational tables that
share a common key on the relational side into a single table in HBase, having a separate column family
for each mapped table. This is all possible in theory, and a future version of this code may support a
JSON-based specification file to define the desired mappings.
Connecting to HBase via the Asynchronous HBase Publisher
Just as with Kafka, configuring the Asynchronous HBase Publisher very straight-forward:



Configure the NullEncoder.
Configure the AsyncHbasePublisher.
38

This is done by setting bdglue.properties as follows:

#
# bdglue.properties file for the AsyncHbasePublisher
#
bdglue.encoder.class = com.oracle.bdglue.encoder.NullEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = false
bdglue.publisher.class = com.oracle.bdglue.publisher.asynchbase.AsyncHbasePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.async-hbase.batchSize = 5
bdglue.async-hbase.timeout = 5000

Using Flume to Deliver Data to HBase
Connecting to HBase via Flume was a bit more complicated than working with HDFS and Hive. While we
were able to make things work properly with multiple tables using the out-of-the-box Flume agent
components with HDFS and Hive, we weren’t able to do that with HBase. To accomplish our goal of
supporting multiple tables with HBase via a single channel, we had to take advantage of the flexibility of
Flume and implement a custom Flume sink and sink serializer. This wasn’t particularly hard to
accomplish, however.
As with our other examples, we first need to configure the bdglue.properties file with the appropriate
properties. In this case, we will transmit the data in JSON format, which the custom sink was designed to
expect, and we will specify HBase as the target.

# bdglue.properties for writing to HBase via Flume
#

39

bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.json.text-only = false
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = false
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

With the exception of specifying the custom sink information, the Flume configuration properties are
actually a little simpler than prior examples.

40

# list the sources, channels, and sinks for the agent
bdglue.sources = s1
bdglue.channels = c1
bdglue.sinks = k1
# Map the channels to the source. One channel per table being captured.
bdglue.sources.s1.channels = c1
# Set the properties for the source
bdglue.sources.s1.type = avro
bdglue.sources.s1.bind = localhost
bdglue.sources.s1.port = 41414
bdglue.sources.s1.selector.type = replicating
# Set the properties for the channels
bdglue.channels.c1.type = memory
bdglue.channels.c1.capacity = 1000
bdglue.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
bdglue.sinks.k1.channel = c1
bdglue.sinks.k1.type =
com.oracle.bdglue.publisher.flume.sink.asynchbase.BDGlueAsyncHbaseSink
bdglue.sinks.k1.batchSize = 100
bdglue.sinks.k1.timeout = 6000

Finally, remember to add the bdglue.jar file to the Flume class path as described in Running Flume
earlier in this document.
Basic HBase Administration
There are a couple of things we need to do to make sure that HBase is ready to receive data.
First off, we need to make sure that HBase is running. This requires ‘sudo’ access on Linux/Unix.

#> sudo service hbase-master start
Starting HBase master daemon (hbase-master):
[ OK ]
HBase master daemon is running
#>
[ OK ]
#> sudo service hbase-regionserver start
Starting Hadoop HBase regionserver daemon: starting regionserver, logging to
/var/log/hbase/hbase-hbase-regionserver-bigdatalite.localdomain.out
hbase-regionserver.
#>

41

And we also need to create the tables in HBase to receive the information we want to write there. If you
are not aware, HBase has something called “column families”. All columns reside within a column family,
and each table can have multiple column families if desired. For the purpose of this adapter, we are
assuming a default name of ‘data’ for the column family, and are putting all columns from the relational
source in there.
The example below creates table CUST_INFO having a single column family called ‘data’. Before doing
that, we check the status to be sure that we have a region server up and running.

[ogg@bigdatalite ~]$
[ogg@bigdatalite ~]$ hbase shell
2014-10-03 17:40:40,439 INFO [main] Configuration.deprecation: Hadoop.native.lib is
deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.96.1.1-cdh5.0.3, rUnknown, Tue Jul 8 15:35:51 PDT 2014
hbase(main):001:0> status
1 servers, 0 dead, 3.0000 average load
hbase(main):007:0> create 'CUST_INFO', 'data'
0 row(s) in 0.8300 seconds

hbase(main):006:0> scan 'CUST_INFO'
ROW
COLUMN+CELL
/7021
column=data:CITY, timestamp=1434739949881, value=Le GrandQuevilly
/7021
column=data:CUST_DATE, timestamp=1434739949901, value=2014
/04/13
/7021
column=data:GENDER, timestamp=1434739949899, value=Male
/7021
column=data:ID, timestamp=1434739949843, value=7021
/7021
column=data:NAME, timestamp=1434739949844, value=Dane Nash
/7021
column=data:OLD_ID, timestamp=1434739949847, value=1
/7021
column=data:PHONE, timestamp=1434739949846, value=(874) 37
3-6196
/7021
column=data:ZIP, timestamp=1434739949847, value=81558-771
/7022
column=data:CITY, timestamp=1434739949834, value=Carlton
/7022
column=data:CUST_DATE, timestamp=1434739949838, value=2014
/03/14
/7022
column=data:GENDER, timestamp=1434739949826, value=Male
/7022
column=data:ID, timestamp=1434739949892, value=7022
/7022
column=data:NAME, timestamp=1434739949825, value=Serina Ja
rvis
/7022
column=data:OLD_ID, timestamp=1434739949835, value=2
/7022
column=data:PHONE, timestamp=1434739949845, value=(828) 76
4-7840
/7022
column=data:ZIP, timestamp=1434739949836, value=70179

42

totalEvents

column=data:eventCount, timestamp=1434739949985, value=\x0
0\x00\x00\x00\x00\x00\x00\x02
2 row(s) in 0.1020 seconds
hbase(main):009:0>hbase(main):009:0> exit
[ogg@bigdatalite ~]$

Delivering Data to Oracle NoSQL
The Oracle NoSQL database3 is a leading player in the NoSQL space. Oracle NoSQL Database provides a
powerful and flexible transaction model that greatly simplifies the process of developing a NoSQL-based
application. It scales horizontally with high availability and transparent load balancing even when
dynamically adding new capacity, bringing industrial strength into an arena where it is often found to be
lacking.
Some key benefits that the product brings to the Big Data “table” are:








Simple data model using key-value pairs with secondary indexes
Simple programming model with ACID transactions, tabular data models, and JSON support
Application security with authentication and session-level SSL encryption
Integrated with Oracle Database, Oracle Wallet, and Hadoop
Geo-distributed data with support for multiple data centers
High availability with local and remote failover and synchronization
Scalable throughput and bounded latency

Oracle NoSQL can be used with or without Hadoop. It supports two APIs for storing and retrieving data:
the KV (key-value) API, and the Table API. Each API has its own strengths. The KV API is more
“traditional”, but the Table API is gaining a lot of momentum in the market. This adapter supports
interfacing with Oracle NoSQL with both APIs.
KV API Support
The KV API writes data to Oracle NoSQL in key-value pairs, where the key is a text string that looks much
like a file system path name, with each “node” of the key preceded by a slash (‘/’). For example, keys
based on customer names might look like:

/smith/john
/smith/patty
/hutchison/don

3

More information on Oracle’s NoSQL Database can be found here:
http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html

43

BDGlue creates the key for each row by concatenating a string representation of each relational column
that makes up the primary key in the order that the columns are listed in the relational table’s
metadata.
A “value” is data of some sort. It may be text-based, or binary. The structure obviously must be
understood by the application. Oracle NoSQL itself is very powerful, however, and there is much
database “work” it is able to do if it can understand the data. As it turns out, Oracle NoSQL supports
Avro schemas, something that we have already discussed in the context of other Big Data targets.
BDGlue makes good use of this Avro encoding.
Table API Support
The Oracle NoSQL Table API is a different way of storing and accessing data. In theory, you can leverage
the same data via the KV and Table APIs, but we are not approaching things in that fashion. The Table
API maps data in a “row” on a column-by-column basis.
BDGlue maps the source tables and their columns directly to tables in Oracle NoSQL of essentially the
same structure. Key columns are also mapped one-for-one.
NoSQL Transactional Durability
Before we get to specific configurations, we should also mention at this point the “durability” property,
which is applicable to all aspects of this adapter: direct to NoSQL, or via Flume; and for both the Table
and KV APIs. Durability effectively addresses “guarantee” that data is safe and sound in the event of a
badly timed failure. Oracle NoSQL supports different approaches to syncing transactions once they are
committed (i.e. durability). BDGlue supports three sync models:






SYNC : Commit onto disk at master and replicate to simple majority of replicas. This is the most
durable. When the commit returns to the caller, you can be absolutely certain that the data will
still be there no matter what the failure situation. It is also the slowest.
WRITE_NO_SYNC: Commit onto disk at master but do not wait for data to replicate to other
nodes. This is of medium performance as it writes to the master, but doesn’t wait for the data to
be replicated before returning to the caller after a commit.
NO_SYNC: Commit only into master memory and do not wait for the data to replicate to other
nodes. This is the fastest mode as it returns to the caller immediately upon handing the data to
the NoSQL master. At that point, the data has not been synced to disk and could be lost in the
event of a failure at the master.

Connecting Directly to Oracle NoSQL via the NoSQL Publisher
Connecting and delivering data to Oracle NoSQL is not particularly complicated.
Configuring for Delivery to the KV API
Delivery to the KV API is straight forward … the key is a concatenated string based on the columns from
the source table that comprise the primary key, and the value is an Avro-encoded record containing all
of the columns that have been captured, including the key columns.

44

The first step, obviously, is to configure the bdglue.properties file.

# bdglue.properties file for direct connection to
# Oracle NoSQL via the KV API.
#
bdglue.encoder.class = com.oracle.bdglue.encoder.AvroEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = true
bdglue.event.avro-schema-path = ./gghadoop/avro
bdglue.publisher.class = com.oracle.bdglue.publisher.nosql.NoSQLPublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.nosql.host = localhost
bdglue.nosql.port = 5000
bdglue.nosql.kvstore = kvstore
bdglue.nosql.durability = WRITE_NO_SYNC
bdglue.nosql.api = kv_api

The above properties are all that is required. See the admin section Basic Oracle NoSQL Administration
for some basic information regarding how to define tables in Oracle NoSQL, etc.
Configuring for Delivery via the Table API
BDGlue maps the source tables and their columns directly to tables in Oracle NoSQL of essentially the
same structure. Key columns are also mapped one-for-one.
Just as always, the first, and in this case the only thing we need to do is configure the adapter to format
and process the data as we expect via the bdglue.properties file. For the NoSQL Table API, we configure
the NullEncoder because BDGlue writes the data to NoSQL on a column-by-column basis.

# bdglue.properties for delivering directly to
# the Oracle NoSQL Table API.
#
bdglue.encoder.class = com.oracle.bdglue.encoder.NullEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false

45

bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = false
bdglue.publisher.class = com.oracle.bdglue.publisher.nosql.NoSQLPublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.nosql.host = localhost
bdglue.nosql.port = 5000
bdglue.nosql.kvstore = kvstore
bdglue.nosql.durability = WRITE_NO_SYNC
bdglue.nosql.api = table_api

Using Flume to Deliver Data into the Oracle NoSQL Database
Just as it did to integrate with HBase, BDGlue also requires a custom Flume sink in order to
communicate with Oracle NoSQL. Communication with Oracle NoSQL occurs via RPC, and this document
assumes that Oracle NoSQL is already up and running, and configured to listen on the specified port.
Configuring for Delivery via the KV API
As with the other target environments, the first step is to configure BDGlue itself via the
bdglue.properties file.

# bdglue.properties for writing to NoSQL KV API via Flume
#
bdglue.encoder.class = com.oracle.bdglue.encoder.AvroEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-longname = true
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414

46

bdglue.flume.rpc.type = avro-rpc

It is pretty much the same as the other targets.
Next we need to configure Flume itself. Configuration of the Flume Source and Flume Channel is the
same as before, but configuration of the sink is much different.

# list the sources, channels, and sinks for the agent
ggflume.sources = s1
ggflume.channels = c1
ggflume.sinks = k1
# Map the channels to the source. One channel per table being captured.
ggflume.sources.s1.channels = c1
# Set the properties for the source
ggflume.sources.s1.type = avro
ggflume.sources.s1.bind = localhost
ggflume.sources.s1.port = 41414
ggflume.sources.s1.selector.type = replicating
# Set the properties for the channels
ggflume.channels.c1.type = memory
ggflume.channels.c1.capacity = 1000
ggflume.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
ggflume.sinks.k1.channel = c1
ggflume.sinks.k1.type = com.oracle.bdglue.target.flume.sink.nosql.BDGlueNoSQLSin
k
ggflume.sinks.k1.kvHost = localhost
ggflume.sinks.k1.kvPort= 5000
ggflume.sinks.k1.kvStoreName = kvstore
ggflume.sinks.k1.durability = WRITE_NO_SYNC
# kv_api or table_api
ggflume.sinks.k1.kvapi= kv_api

Sink configuration is the same for both the KV and Table APIs. The only difference is the last line: kv_api
in this case.
As mentioned previously, BDGlue assumes that Oracle NoSQL is up, running and listening on the
specified port.

47

Configuring for Delivery via the Table API
Usage and access to Oracle NoSQL over Flume via the Table API is somewhat similar to how we interface
with HBase. In fact, just as with HBase, we will pass the data into Flume in a JSON format so that we can
manipulate it directly.
BDGlue maps the source tables and their columns directly to tables in Oracle NoSQL of essentially the
same structure. Key columns are also mapped one-for-one.
Just as always, the first thing we need to do is configure the adapter to format and process the data as
we expect via the bdglue.properties file.

# bdglue.properties for writing to NoSQL Table API via Flume
#
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tx-optype = false
bdglue.encoder.tx-timestamp = false
bdglue.encoder.user-token = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-longname = false
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 41414
bdglue.flume.rpc.type = avro-rpc

The key difference to note for the Table API vs. the KV API is that we specified an encoding of “json”
rather than “avro-binary”. The HBase sink implementation actually parses the JSON-formatted data to
deliver the data to the NoSQL Table API column-by-column as the API expects.
And of course, we also need to configure Flume as well.

48

# list the sources, channels, and sinks for the agent
ggflume.sources = s1
ggflume.channels = c1
ggflume.sinks = k1
# Map the channels to the source. One channel per table being captured.
ggflume.sources.s1.channels = c1
# Set the properties for the source
ggflume.sources.s1.type = avro
ggflume.sources.s1.bind = localhost
ggflume.sources.s1.port = 41414
ggflume.sources.s1.selector.type = replicating
# Set the properties for the channels
ggflume.channels.c1.type = memory
ggflume.channels.c1.capacity = 1000
ggflume.channels.c1.transactionCapacity = 100
# Set the properties for the sinks
# map the sinks to the channels
ggflume.sinks.k1.channel = c1
ggflume.sinks.k1.type = ggflume.sink.nosql.GGFlumeNoSQLSink
ggflume.sinks.k1.kvHost = localhost
ggflume.sinks.k1.kvPort= 5000
ggflume.sinks.k1.kvStoreName = kvstore
ggflume.sinks.k1.durability = WRITE_NO_SYNC
# kv_api or table_api
ggflume.sinks.k1.kvapi= table_api

As mentioned previously, the only difference in the configuration between this and the KV API is the last
line. In this case, we specify “table_api”.
Basic Oracle NoSQL Administration
Oracle NoSQL comes in two flavors, a “lite” version for basic testing, and an Enterprise version that
contains more robust capabilities that enterprise deployments might require. Administratively, they are
virtually identical as far as the features we are leveraging in BDGlue … so we keep it simple and test with
the “lite” version, called KVLite.
Starting Oracle NoSQL from the Command Line
Oracle NoSQL runs as a Java process.

#
# starts the kvlite NoSQL instance, listening on default port of 5000
#
[nosqlhome]$ KVHOME="/u01/nosql/kv-ee"

49

[nosqlhome]$ java -Xmx256m -Xms256m -jar $KVHOME/lib/kvstore.jar kvlite
Opened existing kvlite store with config:
-root ./kvroot -store kvstore -host bigdatalite.localdomain -port 5000 -admin 5001

Running the KVLite Administration Command Line Interface
The command line utility is where you will define tables, review data stored in Oracle NoSQL, etc.

#
# run the kvlite command line interface
#
[nosqlhome]$ KVHOME="/u01/nosql/kv-ee"
[nosqlhome]$ java -Xmx256m -Xms256m -jar $KVHOME/lib/kvstore.jar runadmin -port 5000
-host localhost
kv->
kv-> connect store -name kvstore
Connected to kvstore
Connected to kvstore at localhost:5000.
kv->

KV API: Creating Tables in Oracle NoSQL
The KV API relies on delivery of key-value pairs. In the case of BDGlue, the key is a concatenation of the
columns that make up the key in the relational database source. The “value” is an array of bytes that
contain the data we are storing. In our case, we are choosing to encode the “value” in Avro format both
because it is a compact representation of the data, and because it self-describes to Oracle NoSQL.
KV API: Preparing the Avro Schemas
Preparing the Avro schemas is a two step process:



Generate the schemas from the source table meta data
Load the schemas into Oracle NoSQL.

The steps to generate the schemas are exactly as described in Generating Avro Schemas. Refer to that
section for more information.
Loading the generated schemas into NoSQL is a relatively straight forward process. First you must log
into NoSQL with the admin utility. Once you are logged in and have the command prompt, do the
following:

kv-> ddl add-schema -file ./bdglue.CUST_INFO.avsc -force
Added schema: bdglue.CUST_INFO.1
8 warnings were ignored.
<< Ignore if you see this: the result of not setting
default values
kv-> ddl add-schema -file ./bdglue.MYCUSTOMER.avsc -force

50

Added schema: bdglue.MYCUSTOMER.2
22 warnings were ignored. << Ignore if you see this: the result of not setting
default values
kv-> show schemas
bdglue.CUST_INFO
ID: 1 Modified: 2014-10-21 19:19:22 UTC, From: bigdatalite.localdomain
bdglue.MYCUSTOMER
ID: 2 Modified: 2014-10-21 19:19:53 UTC, From: bigdatalite.localdomain
kv->

Once complete, you are ready to capture data from a source database and deliver into the Oracle NoSQL
data store.
KV API: Validating Your Data
Oracle NoSQL doesn’t provide an easy way to query data stored in KV pairs from the command line. To
do this, you need to have the key to the row you want to see.

kv-> get kv -key /2978
{
"ID" : 2978,
"NAME" : "Basia Foley",
"GENDER" : "Female",
"CITY" : "Ichtegem",
"PHONE" : "(943) 730-2640",
"OLD_ID" : 8,
"ZIP" : "T1X 1M5",
"CUST_DATE" : "2015/01/16"
}
kv->

In the example above, the key was “/2798” (note the preceding slash). Also note that Oracle NoSQL
understood the structure of the stored value object. This is because we generated and used the Avro
schema when writing the key-value pair to the database.

Table API: Creating Tables in Oracle NoSQL
Just as with a relational database, you have to create tables in Oracle NoSQL in order to use the Table
API. Table creation commands can actually be quite cumbersome, but we have actually simplified the
process somewhat by configuring the SchemaDef utility discussed later in this document in Generating
Avro Schemas with SchemaDef to generate the NoSQL DDL for us. Just as for Avro, the utility connects
to the source database via JDBC. Everything is essentially the same as before except for the output
format.

51

Here is what the schemadef.properties file might look like:

# jdbc connection information
schemadef.jdbc.driver = com.mysql.jdbc.Driver
schemadef.jdbc.url = jdbc:mysql://localhost/bdglue
# Oracle JDBC connection info
#schemadef.jdbc.driver = oracle.jdbc.OracleDriver
#schemadef.jdbc.url = jdbc:oracle:thin:@//:/
schemadef.jdbc.username = root
schemadef.jdbc.password = welcome1
# output format: avro, nosql
schemadef.output.format = nosql
schemadef.output.path = ./output
# encode numeric/decimal types as string, double, float
schemadef.numeric-encoding = double
schemadef.set-defaults = true
schemadef.tx-optype = false
schemadef.tx-timestamp = false
# whitespace delimited list of schema.table pairs
schemadef.jdbc.tables = bdglue.MYCUSTOMER bdglue.CUST_INFO \
bdglue.TCUSTORD

And the utility is executed just as before:

DIR=/path/to/jars
CLASSPATH="$DIR/bdglue.jar"
CLASSPATH="$CLASSPATH:$DIR/slf4j-api-1.6.1.jar"
CLASSPATH="$CLASSPATH:$DIR/slf4j-simple-1.7.7.jar"
CLASSPATH="$CLASSPATH:$DIR/commons-io-2.4.jar"
CLASSPATH="$CLASSPATH:$DIR/jackson-core-asl-1.9.13.jar"
CLASSPATH="$CLASSPATH:$DIR/mysql-connector-java-5.1.34-bin.jar"
java –Dschemadef.properties=schemadef.properties -cp $CLASSPATH
com.oracle.bdglue.utility.schemadef.SchemaDef

Here is a sample of a generated output file. Each output file contains the script needed to create a table
in Oracle NoSQL that corresponds to the source table.

52

## enter into table creation mode
table create -name CUST_INFO
add-field -type INTEGER -name ID
primary-key -field ID
add-field -type STRING -name NAME
add-field -type STRING -name GENDER
add-field -type STRING -name CITY
add-field -type STRING -name PHONE
add-field -type INTEGER -name OLD_ID
add-field -type STRING -name ZIP
add-field -type STRING -name CUST_DATE
## exit table creation mode
exit
## add the table to the store and wait for completion
plan add-table -name CUST_INFO –wait

And then we have to add the tables into Oracle NoSQL. We do this from the command prompt in the
Oracle NoSQL admin utility.

kv->
kv-> load -file ./output/CUST_INFO.nosql
Table CUST_INFO built.
Executed plan 5, waiting for completion...
Plan 5 ended successfully
kv-> load -file ./output/MYCUSTOMER.nosql
Table MYCUSTOMER built.
Executed plan 6, waiting for completion...
Plan 6 ended successfully
kv->

And now we are ready to capture data and deliver it into Oracle NoSQL.
Table API: Validating Your Data
Looking at data stored with the Table API is a little easier than with data stored with the KV API, but
don’t expect the power you might have with a SQL query.
Here is sample output from a single row stored in the CUST_INFO table.

53

kv-> get table -name CUST_INFO
{
"ID" : "3204",
"NAME" : "Adria Bray",
"GENDER" : "Female",
"CITY" : "Anklam",
"PHONE" : "(131) 670-1907",
"OLD_ID" : "94",
"ZIP" : "27665",
"CUST_DATE" : "2014/06/01"
}

-field ID -value 3204 -pretty

kv->

In this case, we knew the ID column’s value was “3204”. If you leave off the –field and –value options,
you can get all rows in the table, by the way.
Just to further prove the point, here is some example output from the MYCUSTOMER table.

kv-> get table -name MYCUSTOMER -field id -value 2864 -pretty
{
"id" : "2864",
"LAST_NAME" : "Barnes",
"FIRST_NAME" : "Steel",
"STREET_ADDRESS" : "Ap #325-5990 A Av.",
"POSTAL_CODE" : "V0S 7A8",
"CITY_ID" : "14819",
"CITY" : "Reus",
"STATE_PROVINCE_ID" : "316",
"STATE_PROVINCE" : "CA",
"COUNTRY_ID" : "137",
"COUNTRY" : "Iran",
"CONTINENT_ID" : "1",
"CONTINENT" : "indigo",
"AGE" : "25",
"COMMUTE_DISTANCE" : "14",
"CREDIT_BALANCE" : "2934",
"EDUCATION" : "Zolpidem Tartrate",
"EMAIL" : "feugiat.nec@ante.com",
"FULL_TIME" : "YES",
"GENDER" : "MALE",
"HOUSEHOLD_SIZE" : "3",
"INCOME" : "116452"
}
kv->

54

And there you have it. We have validated that we successfully delivered data into the Oracle NoSQL
database via the Table API.

Delivering Data to Cassandra
Cassandra is a ‘flavor’ of NoSQL database that has a very tabular feel. In fact, the syntax for CQL
(Cassandra Query Language) is very similar to SQL. Cassandra has become quite popular for a number of
reasons:








It has a peer-to-peer architecture rather than one based on master-slave configurations. Any
number of server nodes can be added to the cluster in order to reliability as there is no single
point of failure.
It boasts elastic scalability by adding or removing nodes from the cluster.
Cassandra’s architecture delivers high availability and fault tolerance.
Cassandra delivers very high performance on large sets of data.
Cassandra is column-oriented, giving a tabular feel to things. Cassandra rows can be extremely
wide.
It has a tunable consistency model ranging from “eventual consistency” to “strong consistency”
which ensures that updates are written to all nodes.

The BDGlue Cassandra Publisher makes use of the Cassandra Java API published as Open Source by
DataStax. Make sure that the version of Cassandra you are using is compatible with the DataStax Java
API. Currently, DataStax claims compatibility with the latest stable release, Cassandra 3.0.x. It is not
known at this time if the monthly development releases (currently version 3.5) are compatible or not.
Feel free to experiment.
Each column in a Cassandra table will correspond to a column of the same name found in the relational
source. Data types are mapped as closely as possible, and default to ‘text’ in situations where there is no
direct mapping. Key columns are also mapped directly and in the order they are specified in the DDL
(and in the order they are returned by JDBC if you use the SchemaDef utility to generate the DDL). The
source schema name will correspond to the Cassandra “key space.”
Connecting to Cassandra via the Cassandra Publisher
To deliver data to Cassandra, you need to configure the Cassandra Publisher as follows:



Configure the NullEncoder
Configure the CassandraPublisher

This is done by setting the values in the bdglue.properties file as follows:
bdglue.encoder.class = com.oracle.bdglue.encoder.NullEncoder
bdglue.encoder.threads = 2
bdglue.encoder.tablename = false
bdglue.encoder.txid = false
bdglue.encoder.tx-optype = true
bdglue.encoder.tx-timestamp = false

55

bdglue.encoder.tx-position = false
bdglue.encoder.user-token = false
bdglue.encoder.include-befores = false
bdglue.encoder.ignore-unchanged = false
bdglue.event.generate-avro-schema = false
bdglue.event.header-optype = false
bdglue.event.header-timestamp = false
bdglue.event.header-rowkey = true
bdglue.event.header-avropath = false
bdglue.event.header-columnfamily = true
bdglue.event.header-longname = true
bdglue.publisher.class = com.oracle.bdglue.publisher.cassandra.CassandraPublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.cassandra.node = localhost
bdglue.cassandra.batch-size = 5
bdglue.cassandra.flush-frequency = 500
bdglue.cassandra.insert-only = false

Basic Cassandra Administration
This section briefly explains how to start Cassandra from the command line, run the CQL shell, generate
DDL that corresponds to the relational source tables, and apply that DDL into Cassandra.
Running Cassandra and the CQL Shell
First off, we have to make sure that Cassandra is running. To run Cassandra from the command line, run
Cassandra using ‘sudo’.

#> cd ./apache-cassandra-3.0.5
#> sudo ./bin/cassandra –f

# the Cassandra installation directory
# runs in the console window. CTRL-C to end.

Running the Cassandra shell follows much the same process, but ‘sudo’ is not required.

#> cd ./apache-cassandra-3.0.5
# the Cassandra installation directory
#>./bin/cqlsh
# runs in the console window. CTRL-C to end.
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.5 | CQL spec 3.4.0 | Native protocol v4]
Use HELP for help.
cqlsh>

Note that ‘cqlsh’ is implemented using Python version 2.7. Most Linux systems currently have Python
2.6 installed as the default. Various utilities such as ‘yum’ seem to rely on this version of Python. Note
56

that these versions of Python are not compatible. If you don’t have version 2.7, you will have to install it.
The easiest way is to download the Python 2.7 source, build, and then install it into /usr/local/bin. Be
careful not to overwrite the default Python 2.6 (likely installed in /usr/bin/…).
Creating Tables in Cassandra
As with any “tabular” database, you need to create tables in Cassandra. Cassandra DDL looks much like
DDL for a relational database and we have simplified the process of creating table definitions that
correspond to the source tables that we will be capturing. This is done using the SchemaDef utility
discussed later in this document. See The “SchemaDef” Utility for more information on how to
configure and run SchemaDef for Cassandra and other targets.
Here is an example of generated Cassandra DDL:
CREATE KEYSPACE IF NOT EXISTS "bdglue"
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
DROP TABLE IF EXISTS bdglue.CUST_INFO;
CREATE TABLE bdglue.CUST_INFO
(
txoptype text,
ID int,
NAME text,
GENDER text,
CITY text,
PHONE text,
OLD_ID int,
ZIP text,
CUST_DATE text,
PRIMARY KEY (ID)
);

To define the target tables in Cassandra to the following for each generated table definition:

#> cd 
#> bin/cqlsh < ~/ddl/bdglue.CUST_INFO.cql
#>

Validating Your Schema and Data
To check your schema in Cassandra, you simply do a “describe” as you would against a relational
database:
cqlsh> describe bdglue.CUST_INFO;
CREATE TABLE bdglue.cust_info (
id int PRIMARY KEY,

57

city text,
cust_date text,
gender text,
name text,
old_id int,
phone text,
txoptype text,
zip text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold':
'32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
cqlsh>

And to view some data:
cqlsh> select * from bdglue.CUST_INFO limit 2;
id
| city
| cust_date | gender | name
| old_id | phone
| txoptype | zip
------+-------------------+------------+--------+--------------+--------+---------------+----------+----------4460 | Le Grand-Quevilly | 2014/04/13 |
Male |
Dane Nash |
1 | (874) 3736196 |
INSERT | 81558-771
4462 | Fontaine-l'Evique | 2015/02/06 |
Male | Amos Fischer |
3 | (141) 3986160 |
INSERT |
9188
(2 rows)
cqlsh>

Other Potential Targets
We have done some cursory examination and believe that publishers for other potential target
technologies including Impala, MongoDB, Couchbase, Elastic Search, and others are feasible and might
be developed in the future.

58

Source Configuration
The concept for BDGlue originated as we were building a Java-based adapter to Oracle GoldenGate to
Big Data targets. It quickly became evident that BDGlue had the potential of being far more generally
useful, however, so we made a deliberate effort to decouple the code that is used to tie BDGlue to a
source from all of the downstream logic that interfaces with Big Data targets to encourage broader use
of the solution.
However, BDGlue is of little use until it has been integrated with a data “source”. The data that is
delivered to a Big Data target has to originate somewhere. While the “source” obviously comes first in
any data pipeline, we have saved discussion of the source integrations until last in hope of truly
decoupling the discussion related to configuring BDGlue from the discussion pertaining to configuring a
source, whether GoldenGate at the present time, or some other source to be implemented in the
future.

GoldenGate as a Source for BDGlue
When linked with BDGlue, the GoldenGate Java Adapter becomes a fully functional GoldenGate Adapter
that will deliver database operations that have been captured by Oracle GoldenGate from a relational
database source into various target “Big Data” repositories and formats. Target repositories include
HDFS, Hive, HBase, Oracle NoSQL, and others supported by BDGlue.
BDGlue is intended as a starting point for exploring Big Data architectures from the perspective of realtime change data capture (CDC) as provided by GoldenGate. As mentioned in the introduction, Hadoop
and other Big Data technologies are by their very natures constantly evolving and infinitely configurable.
A GoldenGate Java Adapter is referred to as a “Custom Handler” in the GoldenGate Java Adapter
documentation. This “custom handler” integration with BDGlue is developed using Oracle GoldenGate's
Java API.
A custom handler is deployed as an integral part of an Oracle GoldenGate PUMP process. The PUMP
and the custom handler are configured through a PUMP parameter file and the adapter's properties file.
We will discuss the various properties files in more detail later in this document.
The PUMP process executes the adapter in its address space. The PUMP reads the trail file created by
the Oracle GoldenGate EXTRACT process and passes the transactions into the adapter. Based on the
configuration in the properties file, the adapter will write the transactions in one of several formats.
Please refer to the Oracle GoldenGate Adapters Administrator’s Guide for Java (which can be found on
http://docs.oracle.com) for details about the architecture and developing a custom adapter.

59

Capture

Trail
Files

Pump

BDGlue
Adapter

Source
Database
Pump
Parameter
file

RPC

Big Data Targets
Flume
HDFS
Hive
NoSQL
Hbase
Etc.

Adapter
Properties
files

Configuring GoldenGate for BDGlue
There are three basic steps to getting GoldenGate properly configured to deliver data to BDGlue:




Configure GoldenGate to capture the desired tables from the source database and write them to
a trail file.
Execute the “defgen” command to generate a sourcedefs file that defines the structure of the
captured tables to the pump and Java Adapter.
Configure the PUMP itself to reference the trail file, the sourcedefs file, and execute the Java
Adapter.

Configure the GoldenGate EXTRACT
This User Guide makes no attempt to explain details of configuring GoldenGate itself. Please refer to the
GoldenGate documentation for that information.
Simplistically speaking, however, a GoldenGate EXTRACT process has a parameter file that tells
GoldenGate how to log into the source database to obtain table metadata, and what tables it should be
concerned about capturing.
Here is a very basic example of a parameter file for connecting to a MySQL source database. The most
important things to note are the tables we care about and the fact that there is nothing specific to
configuration of the Java Adapter found there.

60

EXTRACT erdbms
DBOPTIONS HOST localhost, CONNECTIONPORT 3306
SOURCEDB bdgluedemo, USERID root, PASSWORD welcome1
EXTTRAIL ./dirdat/tc
GETUPDATEBEFORES
NOCOMPRESSDELETES
TRANLOGOPTIONS ALTLOGDEST /var/lib/mysql/log/bigdatalite-bin.index
TABLE bdgluedemo.MYCUSTOMER;
TABLE bdgluedemo.CUST_INFO;
TABLE bdgluedemo.TCUSTORD;

Please do make note of the parameters “GETUPDATEBEFORES” and “NOCOMPRESSDELETES”. If you
think about it, in most cases it wouldn’t make sense to propagate a partial record downstream in the
event of an update or a delete operation in the source database when dealing with Big Data targets.
These parameters ensure that all columns are propagated downstream even if they are unchanged
during an update operation on the source, and that all columns are propagated along with the key in the
case of a delete.
Configure the GoldenGate PUMP
Unlike the EXTRACT, there are things specific to the Java Adapter found in the parameter file for the
PUMP. This is because the Java Adapter is invoked by and runs as a part of the PUMP.
What follows is a simple PUMP parameter file. There are several things to note there:




The CUSEREXIT statement which causes the PUMP to invoke code to run the Java Adapter we
are providing. Note also on this statement the parameter “INCLUDEUPDATEBEFORES”. This
ensures that the PUMP passes the “before image” of columns downstream along with the
captured data.
The actual tables that this PUMP will be capturing. Simplistically, this might be the same as the
tables specified in the SOURCEDEFS file and in the EXTRACT, but in more complicated
environments it is possible that we might configure multiple PUMPs, each handling a subset of
the tables we are capturing.

extract ggjavaue
CUSEREXIT ./libggjava_ue.so CUSEREXIT PASSTHRU INCLUDEUPDATEBEFORES
TABLE bdgluedemo.MYCUSTOMER;
TABLE bdgluedemo.CUST_INFO;
TABLE bdgluedemo.TCUSTORD;

The Java Adapter itself has a properties file that has a bunch of configuration information that the Java
Adapter needs to get going. Most of the information is fairly generic to the Java adapter itself and how it
61

executes. The properties file resides in the GoldenGate “dirprm” directory along with the parameters for
the various GoldenGate processes. Note that the name of the properties file is based on the name of the
GoldenGate process it is associated with. In this case, the pump process is an instance of the Java
Adapter called “ggjavaue”. The parameter file for the process is called “ggjavaue.prm”, and the
properties file shown below would be called “bdglue.properties.”
There are a number of things to make specific note of:









-Dbdglue.properties=./gghadoop/bdglue.properties (highlighted below). This defines a Java
“system property” that the GoldenGate BDGlue “source” is looking for so that it can locate a
properties file that is specific to what it needs to configure itself to run. If the system property is
not defined, BDGlue will look for a file called bdglue.properties somewhere in a directory
pointed to by the Java classpath. If the system property is used, calling the properties file
“bdglue.properties” is not strictly required.
gg.handlerlist=gghadoop gives a name to handler which is then used to identify properties to
pass into it
gg.handler.gghadoop.type=com.oracle.gghadoop.GG12Handler identifies the class that is the
entry point into BDGlue. [Note: as of this writing, the GG12Handler supports the GoldenGate
12.2 release of GoldenGate for Big Data.]
gg.handler.gghadoop.mode=op sets the Java Adapter to “operation mode” (rather than
transaction mode) which is most appropriate for Big Data scenarios. Since all data in the trail file
has been committed, this “eager” approach is far more efficient.
gg.classpath=./gghadoop/lib/* points to a directory containing all of the Java dependencies for
compiling and running BDGlue.

#Adapter Logging parameters.
#log.logname=ggjavaue
#log.tofile=true
log.level=INFO

#Adapter Check pointing parameters
goldengate.userexit.chkptprefix=GGHCHKP_
goldengate.userexit.nochkpt=true
# Java Adapter Properties
goldengate.userexit.writers=javawriter
goldengate.userexit.utf8mode=true
# NOTE: bootoptions are all placed on a single line
javawriter.bootoptions= -Xms64m -Xmx512M
-Dlog4j.configuration=ggjavaue-log4j.properties
-Dbdglue.properties=./gghadoop/bdglue.properties
-Djava.class.path=./gghadoop:./ggjava/ggjava.jar
#

62

#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
javawriter.stats.time=3600
javawriter.stats.numrecs=5000
javawriter.stats.display=TRUE
javawriter.stats.full=TRUE
#Hadoop Handler.
gg.handlerlist=gghadoop
# the GG11Handler handler supports GoldenGate versions 11.2 and 12.1.2
gg.handler.gghadoop.type=com.oracle.gghadoop.GG12Handler
gg.handler.gghadoop.mode=op
# all dependent jar files should be placed here
gg.classpath=./gghadoop/lib/*

Here is a sample properties file that provides configuration properties for BDGlue:

# configure BDGlue properties
bdglue.encoder.threads = 3
bdglue.encoder.class = com.oracle.bdglue.encoder.JsonEncoder
bdglue.event.header-optype = true
bdglue.event.header-timestamp = true
bdglue.event.header-rowkey = true
bdglue.publisher.class = com.oracle.bdglue.publisher.flume.FlumePublisher
bdglue.publisher.threads = 2
bdglue.publisher.hash = rowkey
bdglue.flume.host = localhost
bdglue.flume.port = 5000
bdglue.flume.rpc.retries = 5
bdglue.flume.rpc.retry-delay = 10

We won’t go into details on the contents of this file here as they will vary a fair amount depending on
what the target of BDGlue is: HDFS, Hive, HBase, NoSQL, etc. We’ll look at specific configurations in
more detail subsequent sections of this document.

63

The “SchemaDef” Utility
SchemaDef is a java-based utility that connects to a source database via JDBC and generates metadata
relevant to the BDGlue encoding process, the target repository, or both.
Running SchemaDef
The SchemaDef utility can be found in the jar file for BDGlue. Here is how you would run it:

DIR=/path/to/jars
CLASSPATH="$DIR/bdglue.jar"
CLASSPATH="$CLASSPATH:$DIR/slf4j-api-1.6.1.jar"
CLASSPATH="$CLASSPATH:$DIR/slf4j-simple-1.7.7.jar"
CLASSPATH="$CLASSPATH:$DIR/commons-io-2.4.jar"
CLASSPATH="$CLASSPATH:$DIR/jackson-core-asl-1.9.13.jar"
CLASSPATH="$CLASSPATH:$DIR/mysql-connector-java-5.1.34-bin.jar"
java –Dschemadef.properties=schemadef.properties -cp $CLASSPATH
com.oracle.bdglue.utility.schemadef.SchemaDef

Note that the last jar file listed is specific to the database you will be connecting to. In this case, it is
MySQL. Replace this jar file with the jar file that is appropriate for your database type and version.
NOTE: It is up to you to obtain the appropriate JDBC driver for your database platform and version,
and to identify the appropriate connection URL and login credentials.
Details on the various properties that can be configured for SchemaDef can be found in the appendix.
Generating Avro Schemas with SchemaDef
The first step in the process of generating binary encoded Avro data is to create the Avro schemas for
the tables we want to capture. (Note that it is possible to let the BDGlue generate these schemas on the
fly, but be aware that new schema files must be copied into HDFS before we start writing the Avro data
there that corresponds to that version of the schema. The Avro (de)serialization process requires that
the path to the schema file be included in the Avro event header information, and in turn it puts a copy
of the schema in the meta data of each generated *.avro file. This allows various Big Data technologies
(HDFS, Hive, and more) to always know the structure of the data, and in turn also allows Avro to support
schema evolution on a table.
So, the best way to approach this is to generate the Avro schema files (*.avsc) and then copy them into
place in HDFS before we start passing data through Flume to HDFS.
To facilitate generation of the Avro Schema files, we created a simple Java utility, SchemaDef, that
parses connects to the source database via JDBC and generates the Avro schema files directly from the
table definitions of the tables you specify.

64

Like everything else “java”, the utility is configured via a properties file that looks like this:

# jdbc connection information
schemadef.jdbc.driver = com.mysql.jdbc.Driver
schemadef.jdbc.url = jdbc:mysql://localhost/bdgluedemo
# Oracle JDBC connection info
#schemadef.jdbc.driver = oracle.jdbc.OracleDriver
#schemadef.jdbc.url = jdbc:oracle:thin:@//:/
schemadef.jdbc.username = root
schemadef.jdbc.password = welcome1
# output format: avro, nosql, hive_avro
schemadef.output.format = avro
schemadef.output.path = ./avro
# encode numeric/decimal types as string, double, float
schemadef.numeric-encoding = double
schemadef.set-defaults = true
schemadef.tx-optype = false
schemadef.tx-timestamp = false
schemadef.user-token = false
# whitespace delimited list of schema.table pairs
schemadef.jdbc.tables = bdgluedemo.MYCUSTOMER bdgluedemo.CUST_INFO \
bdgluedemo.TCUSTORD

Details on the properties can be found in the Appendix.
Once the schema files have been created, you then need to copy them locally into HDFS. The following
command will do that for you.

hdfs dfs -copyFromLocal -f ./output/*.avsc

/user/flume/gg-data/avro-schema

Generating Hive Table Definitions for Use with Avro Schemas
We can generate the Hive Table definitions (DDL) to read *.avro files written to HDFS by simply changing
the output format from “avro” to “hive_avro” in the schemadef.properties file and rerun the utility.

# jdbc connection information
schemadef.jdbc.driver = com.mysql.jdbc.Driver
schemadef.jdbc.url = jdbc:mysql://localhost/bdgluedemo
# Oracle JDBC connection info

65

#schemadef.jdbc.driver = oracle.jdbc.OracleDriver
#schemadef.jdbc.url = jdbc:oracle:thin:@//:/
schemadef.jdbc.username = root
schemadef.jdbc.password = welcome1
# output format: avro, nosql, hive_avro
schemadef.output.format = hive_avro
schemadef.output.path = ./avro
# encode numeric/decimal types as string, double, float
schemadef.numeric-encoding = double
schemadef.set-defaults = true
schemadef.tx-optype = false
schemadef.tx-timestamp = false
schemadef.user-token = false
schemadef.avro-url = hdfs:///user/flume/gg-data/avro-schema
schemadef.data-location = /user/flume/gg-data
# whitespace delimited list of schema.table pairs
schemadef.jdbc.tables = bdgluedemo.MYCUSTOMER bdgluedemo.CUST_INFO \
bdgluedemo.TCUSTORD

This will generate an hql file for each specified table that looks like this:

CREATE SCHEMA IF NOT EXISTS bdgluedemo;
USE bdgluedemo;
DROP TABLE IF EXISTS CUST_INFO;
CREATE EXTERNAL TABLE CUST_INFO
COMMENT "Table backed by Avro data with the Avro schema stored in HDFS"
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/flume/gg-data/bdgluedemo.CUST_INFO/'
TBLPROPERTIES ( 'avro.schema.url'='hdfs:///user/flume/gg-data/avroschema/bdgluedemo.CUST_INFO.avsc' );

Generating Cassandra Table Definitions
SchemaDef is able to generate appropriate DDL that maps to the source tables for Cassandra just as it
does for other targets. All you need to do is specify Cassandra as the target in the schemadef.properties
file.
# jdbc connection information
# mysql

66

schemadef.jdbc.driver = com.mysql.jdbc.Driver
schemadef.jdbc.url = jdbc:mysql://localhost/bdglue
schemadef.jdbc.username = root
schemadef.jdbc.password = welcome1
#schemadef.jdbc.password = prompt
#
#schemadef.jdbc.driver = oracle.jdbc.OracleDriver
#schemadef.jdbc.url = jdbc:oracle:thin:@//:/
#schemadef.jdbc.url = jdbc:oracle:thin:@//localhost:1521/orcl
#schemadef.jdbc.username = moviedemo
#schemadef.jdbc.password = welcome1

# output format: avro, nosql, hive_avro, cassandra
schemadef.output.format = cassandra
schemadef.output.path = ./ddl
schemadef.cassandra.replication-strategy = { 'class' : 'SimpleStrategy',
'replication_factor' : 1 }
schemadef.set-defaults = false
schemadef.tablename = false
schemadef.tx-optype = true
schemadef.tx-timestamp = false
schemadef.tx-position = false
schemadef.user-token = false
# whitespace delimited list of schema.table pairs
schemadef.jdbc.tables = bdglue.CUST_INFO
bdglue.MYCUSTOMER
bdglue.TCUSTORD bdglue.my$Table

\

This will generate a cql file for each table specified. A generated cql file will look like this:
CREATE KEYSPACE IF NOT EXISTS "bdglue"
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
DROP TABLE IF EXISTS bdglue.CUST_INFO;
CREATE TABLE bdglue.CUST_INFO
(
txoptype text,
ID int,
NAME text,
GENDER text,
CITY text,
PHONE text,
OLD_ID int,
ZIP text,
CUST_DATE text,
PRIMARY KEY (ID)
);

67

BDGlue Developer’s Guide
This section will contain information related to building custom Encoders and Publishers.

Building a Custom Encoder
Information pertaining to building a custom Encoder will go here. Encoders are created by implementing
the interface com.oracle.bdglue.encoder.BDGlueEncoder.

package com.oracle.bdglue.encoder;
import com.oracle.bdglue.meta.transaction.DownstreamOperation;
import java.io.IOException;
public interface BDGlueEncoder {
/**
* @param op
* @return the encoded operation
* @throws IOException
*/
public EventData encodeDatabaseOperation(DownstreamOperation op) throws
IOException;
/**
* @return the EncoderType for this encoder.
*/
public EncoderType getEncoderType();
}

More specific details will follow in a future revision of this document. You can of course review the
source code for examples.

Building a Custom Publisher
Information pertaining to building a custom Publisher will go here. Publishers are created by
implementing the interface com.oracle.bdglue.publisher.BDGluePublisher.

package com.oracle.bdglue.publisher;
import com.oracle.bdglue.encoder.EventData;
public interface BDGluePublisher {
/**
* Connect to the target.
*/
void connect();

68

/**
* Format the event and write it to the target.
*
* @param threadName the name of the calling thread.
* @param evt the encoded event.
*/
void writeEvent(String threadName, EventData evt);
/**
* Close connections and clean up as needed.
*/
void cleanup();
}

More specific details will follow in a future revision of this document. You can of course review the
source code for specific examples of how to implement a BDGlue publisher.

69

Prerequisite Requirements
Be sure you have taken care of the following before attempting to run the GoldenGate Java Adapter:






Download, install and configure GoldenGate to capture from the source database.
o Configure GoldenGate to capture all columns (uncompressed updates and deletes). This
will some additional overhead to the capture process and require additional space in the
trail files, but will eliminate the need to have to do any downstream reconciliation in the
Hadoop environment later.
Download, install, and configure the current version of GoldenGate for Big Data (version 12.2.x
as of this writing).
o This obviously requires Java to be installed and available in the GoldenGate
environment. If it is not present, you will have to download and install it separately. The
GoldenGate Java adapter requires Java SE 1.7 or later. BDGlue was built with Java SE 1.8.
It is recommended that you use that version of Java. Refer to the documentation for the
GoldenGate Java adapter and GoldenGate for Big Data for more information.
Identify the target technology that you will be delivering data to, and ensure that the latest
version of that technology has been installed and configured. You will likely need to know:
o The host name and port number to which BDGlue will connect
o The directory path where GoldenGate will write data if delivering to HDFS.
o The directory path where we will place the Avro schema files in the HDFS environment if
you will be configuring for Avro serialization.

70

Appendix
bdglue.properties
The following table lists the properties that can be specified in the bdglue.properties file.
Property
bdglue.encoder.thre
ads
bdglue.encoder.clas
s

Required
No

Type
Integer

Default
2

Yes

String

com.oracle.bdglue.e
ncoder.JsonEncoder

71

Notes
The number of encoder
threads to run in parallel.
The fully qualified class
name (FQCN) of the class
that will be called to
encode the data. These
Encoders, and any that are
custom built, implement
the interface
com.oracle.bdglue.encode
r.BDGlueEncoder. Built-in
options are:
 com.oracle.bdglue
.encoder.AvroEnco
der (encode in an
Avro formatted
byte array)
 com.oracle.bdglue
.encoder.AvroGen
ericRecordEncoder
(encode an
instance of an
Avro
GenericRecord)
 com.oracle.bdglue
.encoder.Delimted
TextEncoder
(encode in
delimited text
format)
 com.oracle.bdglue
.encoder.JsonEnco
der (encode in
JSON format)
 com.oracle.bdglue
.encoder.NullEnco
der (does not
encode the data.
This is used when
the publisher will

Property

Required

Type

Default

bdglue.encoder.deli
miter

No

Integer

001

bdglue.encoder.txoptype

No

Boolean

true

bdglue.encoder.txoptype-name

No

String

txoptype

bdglue.encoder.txtimestamp

No

Boolean

true

72

Notes
not pass along the
data as encoded,
and instead will
apply the data to
the target
“column-bycolumn”. Example
targets that
approach things
this way include
HBase, Oracle
NoSQL Table API,
Cassandra, and
others.
Default is ^A (001). Enter
the numeric
representation of the
desired character (i.e. a
semicolon is 073 in octal,
59 in decimal).
Include the transaction
operation type in a column
in the encoded data. Note
that this configuration
must match the
corresponding property in
the schemadef.properties
file.
The name of the column to
populate the operation
type value in. Note that
this configuration must
match the corresponding
property in the
schemadef.properties file.
Include the transaction
operation type in a column
in the encoded data. Note
that this configuration
must match the
corresponding property in
the schemadef.properties
file.

Property
bdglue.encoder.txtimestamp-name

Required
No

Type
String

Default
txtimestamp

bdglue.encoder.txposition

No

Boolean

true

bdglue.encoder.txposition-name

No

String

txposition

bdglue.encoder.user
-token

No

Boolean

True

bdglue.encoder.user
-token-name

No

String

usertokens

73

Notes
The name of the column to
populate the transaction
timestamp value in. Note
that this configuration
must match the
corresponding property in
the schemadef.properties
file.
Include information
pertaining to the position
of this operation in the
transaction flow. This is
used to allow sorting of
operations when they are
occurring more frequently
than the granularity of the
tx-timestamp.
The name of the column to
populate the transaction
position value in. Note
that this configuration
must match the
corresponding property in
the schemadef.properties
file.
Populate a field that will
contain a comma
delimited list of any user
tokens that accompany
the record in the form of
“token1=value,
token2=value, …”. This
property must be the
same as the corresponding
property found for
schemadef.
The name of the field that
will contain the list of userdefined tokens. This
property must be the
same as the corresponding
property found for
schemadef.

Property
bdglue.encoder.tabl
ename

Required
No

Type
Boolean

Default
False

bdglue.encoder.tabl
ename-col

No

String

tablename

bdglue.encoder.txid

No

Boolean

False

bdglue.encoder.txidcol

No

String

txid

bdglue.encoder.repl
ace-newline

No

Boolean

False

bdglue.encoder.new
line-char

No

String



bdglue.encoder.json
.text-only

No

Boolean

True

74

Notes
Populate a field with the
name of the source table.
This will be the “long”
table name in
schema.table format.
The name of the field to
populate with the name of
the source table.
Populate a field with a
transaction identifier.
The name of the field to
populate with the
transaction identifier.
Replace newline
characters found in string
fields with another
character. This is needed
because newlines can
cause problems in some
downstream targets.
The character to substitute
for newlines in string
fields. The default is “ “ (a
space). Override with
another character if
needed.
Whether or not to
represent all column
values as quoted text
strings. When ‘true’, a
numeric field would be
represented as “ID”:”789”.
When false, that same
field would be represented
as “ID”:789, (no quotes
around the value), which
allows the downstream
JSON parser to know to
parse this as a number.

Property
bdglue.encoder.incl
ude-befores

Required
No

Type
Boolean

Default
False

bdglue.event.header
-optype
bdglue.event.header
-timestamp

No

Boolean

true

No

Boolean

true

bdglue.event.header
-rowkey

No

Boolean

true

bdglue.event.header
-longname

No

Boolean

true

bdglue.event.header
-columnfamily

No

Boolean

true

75

Notes
Include the before images
representation of all
columns when encoding
an operation. This option
is only supported for JSON
encoding at this time and
will be ignored by other
encoders.
Include the operation type
in the Flume event header
Include the transaction
timestamp in the Flume
event header.
Boolean as to whether or
not to include a value for
the row's key as a
concatenation of the key
columns in the event
header information. HBase
and NoSQL KV API need
this. It is also needed if the
publisher hash is based on
key rather than table
name.
Boolean as to whether or
not to include the "long"
table name in the header.
The long name is normally
in the form of
“schema.tablename”.
FALSE will cause the
"short" name (table name
only) to be included. Most
prefer the long name.
HBase and NoSQL prefer
the short name.
Boolean as to whether or
not to include a
"columnFamily" value in
the header. This is needed
for Hbase.

Property
bdglue.event.header
-avropath

Required
No

Type
Boolean

Default
false

bdglue.event.avrohdfs-schema-path

No

String

hdfs:///user/flume/
gg-data/avroschema/

bdglue.event.genera
te-avro-schema

No

Boolean

false

bdglue.event.avronamespace

No

String

default

bdglue.event.avroschema-path

No

String

./gghadoop/avro

bdglue.publisher.cla
ss

Yes

String

com.oracle.bdglue.p
ublisher.console.Co
nsolePublisher

76

Notes
Boolean as to whether or
not to include the path to
the Avro schema file in the
header. This is needed for
Avro encoding where
Avro-formatted files are
created in HDFS, including
those that will be
leveraged by Hive.
The URI in HDFS where
Avro schemas can be
found. This information is
passed along as the
header-avropath and is
required by Flume when
writing Avro-formatted
files to HDFS.
Boolean on whether or not
to generate the avro
schema on the fly. This is
really intended for testing
and should likely always be
false. It might be useful at
some point in the future to
use to support Avro
schema evolution. Note
that current built-in
schema generation
capabilities are not on par
with those in schemadef.
The namespace to use in
avro schemas if the actual
table schema name is not
present. The table schema
name will override.
The path on local disk
where we can find the
avro schemas and/or
where they will be written
if we were to generate
them on the fly.
This is the fully qualified
class name (FQCN) of the
class that will be called to
Publish the data. These
Encoders, and any that are

Property

Required

Type

Default

77

Notes
custom built, implement
the interface
com.oracle.bdglue.publish
er.BDGluePublisher. Builtin options are:
 com.oracle.bdglue
.publisher.console.
ConsolePublisher
(writes the
encoded data to
the console. Useful
for smoke testing
upstream
configurations
before worrying
about actually
delivering data to
a target. Json
encoding is
perhaps most
useful for this.
 com.oracle.bdglue
.publisher.flume.Fl
umePublisher
(delivers encoded
data to Flume).
 com.oracle.bdglue
.publisher.hbase.H
BasePublisher
(delivers data to
HBase. The
NullEncoder
should be used for
this publisher).
 com.oracle.bdglue
.publisher.nosql.N
oSQLPublisher
(delivers to
OracleNoSQL. Use
the AvroEncoder
for the KV API, and
NullEncoder for
the Table API).
 com.oracle.bdglue
.publisher.kafka.Ka
fkaPublisher

Property

bdglue.publisher.thr
eads

Required

No

Type

Default

Integer

2

78

Notes
(delivers to Kafka.
The AvroEncoder
and JsonEncoder
are perhaps most
useful for this
publisher). Note:
this publisher uses
an older Kafka API
and is included for
reasons of
compatibility.
 com.oracle.bdglue
.publisher.kafka.Ka
fkaRegistryPublish
er (delivers to
Kafka using the
newer Kafka API.
This publisher is
also compatible
with the Confluent
“schema registry”,
although
interfacing with
the registry is not
strictly required to
use this publisher.)
 com.oracle.bdglue
.publisher.cassand
ra.CassandraPublis
her (delivers data
to Cassandra. The
NullEncoder
should be used for
this publisher).
The number of publishers
to run in parallel.

Property
bdglue.publisher.has
h

Required
No

Type
String

Default
rowkey

bdglue.nosql.host

No

String

localhost

bdglue.nosql.port

No

String

5000

bdglue.nosql.kvstore No

String

kvstore

bdglue.nosql.durabil
ity

No

String

WRITE_NO_SYNC

bdglue.nosql.api

No

String

kv_api

bdglue.kafka.topic

No

String

goldengate

79

Notes
Select the publisher thread
to pass an encoded event
to based on a hash of
either the table name
(“table”) or row key
(“rowkey”). This is to
ensure that changes made
to the same row are
always handled by the
same publisher to avoid
any sort of race condition.
The hostname that we will
connect to for NoSQL
The port number where
the NoSQL KVStore is
listening.
The name of the NoSQL
KVStore to connect to.
The NoSQL durability
model for these
transactions. Options are:
 SYNC
 WRITE_NO_SYNC
 NO_SYNC
Specify whether to use the
“kv_api” or “table_api”
when writing to Oracle
NoSQL.
The name of the Kafka
topic that GoldenGate will
publish to.

Property
bdglue.kafka.batchSi
ze

Required
No

Type
Integer

Default
100

bdglue.kafka.flushFr
eq

No

Integer

500

bdglue.kafka.serializ
er.class

No

String

kafka.serializer.Defa
ultEncoder

bdglue.kafka.key.ser
ializer.class

No

String

kafka.serializer.Strin
gEncoder

80

Notes
The number of Kafka
events to queue before
publishing. The default
value should be
reasonable for most
scenarios, but should be
decreased to a smaller
value for low volume
situations, and perhaps
made larger in extremely
high volume situations.
This property only applies
to the KafkaPublisher as
batching is handled by that
publisher directly. Use
bdglue.kafka.producer.bat
ch.size for the
KafkaRegistryPublisher as
batching is handled by the
actual Kafka producer logic
in that case.
The number of
milliseconds to allow
events to queue before
forcing them to be written
to Kafka in the event that
‘batchSize’ has not been
reached.
The serializer to use when
writing the event to Kafka.
The DefaultEncoder passes
the encoded data received
verbatim to Kafka in a
byte-for-byte fashion. It is
not likely that there will be
need to override the
default value.
The serializer to use when
encoding the Topic “key”.
It is not likely that the
default value will need to
be overridden.

Property
bdglue.kafka.metad
ata.broker.list

Required
Yes

Type
String

Default
localhost:9092

bdglue.kafka.metad
ata.helper.class

No

String

com.oracle.bdglue.p
ublisher.kafka.Kafka
MessageDefaultMet
a

Notes
A comma-separated list of
host:port pairs of Kafka
brokers that may be
published to. Note that
this is for the Kafka broker,
not for Zookeeper.
A simple class that
implements the
KafkaMessageHelper
interface. Its purpose is to
allow customization of
message “topic” and
message “key” behavior.
Current built-in options:
 com.oracle.bdglue
.publisher.kafka.Ka
fkaMessageDefaul
tMeta – writes all
messages to a
single topic
specified in the
properties file, and
the key is the table
name.
 com.oracle.bdglue
.publisher.kafka.Ka
fkaMessageTableK
ey – publishes
each table to a
separate topic,
where the topic
name is the table
name, and the
message key is a
concatenated
version of the key
columns from the
table in this
format:
/key1/key2/…

81

Property
bdglue.kafka.reques
t.required.acks

Required
No

Type
Integer

Default
1

bdglue.cassandra.no
de
bdglue.cassandra.ba
tch-size

No

String

localhost

No

Integer

5

bdglue.cassandra.flu
sh-frequency

No

Integer

500

bdglue.cassandra.ins
ert-only

No

Boolean

false

bdglue.flume.host

Yes

String

localhost

bdglue.flume.port

Yes

Integer

5000

bdglue.flume.rpc.ret
ries

No

Integer

5

bdglue.flume.rpc.ret
ry-delay

No

Integer

10

82

Notes
0 – write and assume
delivery. Don’t wait for
response (potentially
unsafe).
1 – write and wait for the
event to be accepted by at
least one broker before
continuing.
-1 – write and wait for the
event to be accepted by all
brokers before continuing.
The Cassandra node to
connect to.
The number of operations
to group together with
each call to Cassandra.
Force writing of any
queued operations that
haven’t been flushed due
to batch-size after this
many milliseconds
Convert update and delete
operations to an insert.
Note that the default key
generated by SchemaDef
may need to be changed
to include operation type
and timestamp if this is set
to ‘true’.
The name of the target
host that we will connect
to.
The port number on the
host where the target is
listening.
The number of times to
retry a connection after
encountering an issue
before aborting.
The number of seconds to
delay after each attempt
to connect before trying
again.

Property
Required
bdglue.flume.rpc.typ No
e

Type
String

Default
avro-rpc

schemadef.replace.i
nvalid_char

No

String

_ (underscore)

schemadef.replace.i
nvalid_first_char

No

String

x (lower case x)

83

Notes
Currently only pertinent
for Flume. Defines the
type of event RPC protocol
being used for
communication. Options
are avro-rpc and thrift-rpc.
Avro is most common. Do
not confuse avro RPC
communication with avro
encoding of data. Same
name, different things
entirely. One does not
require the other.
Replace non-alphanumeric
“special” characters that
are supported in table and
column names in some
databases with the
specified character or
characters. This is needed
because most of the big
data targets are much
more limited in terms of
the characters that are
supported. Note that this
property begins with
schemadef and should be
identical to the property
specified to the schemadef
utility.
Prepend this string to table
and column names that
begin with anything other
than an alpha character.
This is needed because of
limitations on the big data
side of things. Set to a null
value to avoid this
functionality. Note that
this property begins with
schemadef and should be
identical to the property
specified to the schemadef
utility.

Property
schemadef.replace.r
egex

Required
No

Type
String

Default
[^a-zA-Z0-9_\\.]

84

Notes
This is a regular expression
that contains the
characters that *are*
supported in the target.
(Note: the ^ is required
just as in the default). All
characters not in this list
will be replaced by the
character or characters
specified in
schemadef.replace.invalid
_char. Note that this
property begins with
schemadef and should be
identical to the property
specified to the schemadef
utility.

schemadef.properties
The following table lists the properties that can be specified in the schemadef.properies file.
Property
schemadef.jdbc.driver

Required
Yes

Type
String

Default
com.mysql.jdbc.Driver

schemadef.jdbc.url

Yes

String

schemadef.jdbc.userna
me
schemadef.jdbc.passw
ord

Yes

String

jdbc:mysql://localhost
/bdglue
root

Yes

String

prompt

schemadef.jdbc.tables

Yes

String

N/A

schemadef.output.for
mat

No

String

avro

schemadef.output.path

No

String

./output

schemadef.numericencoding

No

String

double

schemadef.set-defaults

No

Boolean

true

schemadef.tx-optype

No

Boolean

true

85

Notes
The fully qualified class name
of the jdbc driver.
The connection URL for JDBC
The database user that we
will connect as.
The database user’s
password. If this property is
set to the value “prompt”,
SchemaDef will prompt the
user to enter the password
from the command line.
A whitespace-delimited list of
schema.table pairs that we
should generate schema/ddl
information for. More than
one table may be specified
per line, and a line may be
continued by placing a
backslash (‘\’) as the last
character of the current line
in the file.
The type of metadata / ddl to
generate. Options are: avro,
hive_avro, and nosql.
The directory where we
should store the generated
files.
How to encode numeric, noninteger fields (decimal,
numeric types) in the
schema: string, double, float.
Whether or not to set default
values in the generated Avro
schema.
Include the transaction
operation type in a column in
the encoded data. Note that
this configuration must
match the corresponding
property in the
bdglue.properties file.

Property
schemadef.tx-optypename

Required
No

Type
String

Default
txoptype

schemadef.txtimestamp

No

Boolean

true

schemadef.txtimestamp-name

No

String

txtimestamp

schemadef.tx-position

No

Boolean

true

schemadef.tx-positionname

No

String

txposition

86

Notes
The name of the column to
populate the operation type
value in. Note that this
configuration must match the
corresponding property in
the bdglue.properties file.
Include the transaction
operation type in a column in
the encoded data. Note that
this configuration must
match the corresponding
property in the
bdglue.properties file.
The name of the column to
populate the transaction
timestamp value in. Note that
this configuration must
match the corresponding
property in the
bdglue.properties file.
Include details of the
operation’s position in the
replication flow in a column
in the encoded data to allow
sorting when transactions are
occurring more rapidly than
the granularity of the
transaction timestamp can
support. Note that this
configuration must match the
corresponding property in
the bdglue.properties file.
The name of the column to
populate the transaction
position information in. Note
that this configuration must
match the corresponding
property in the
bdglue.properties file.

Property
schemadef.user-token

Required
No

Type
Boolean

Default
true

schemadef.user-tokenname

No

String

usertokens

schemadef.tablename

No

Boolean

false

schemadef.tablenamecol
schemadef.txid

No

String

tablename

No

Boolean

false

schemadef.txid-col

No

String

txid

schemadef.avro-url

No

String

/path/to/avro/schema

schemadef.datalocation

No

String

/path/to/avro/data

schemadef.cassandra.r
eplication-strategy

No

String

{ 'class' :
'SimpleStrategy',
'replication_factor' : 1
}

87

Notes
Populate a field that will
contain a comma delimited
list of any user tokens that
accompany the record in the
form of “token1=value,
token2=value, …”. Note that
this configuration must
match the corresponding
property in the
bdglue.properties file.
The name of the field that
will contain the list of userdefined tokens. Note that this
configuration must match the
corresponding property in
the bdglue.properties file.
Populate a field that will
contain the long version of
the table name (schema.table
format).
The name of the field that
will contain the table name.
Populate a field that will
contain a transaction
identifier.
The name of the field that
will contain the transaction
identifier.
Tells the Hive Avro SerDe
where to find the avro
schema for this table.
Required for avro_hive
schema generation
Tells the Hive Avro SerDe
where to find the avroencoded data files for this
table. Required for avro_hive
schema generation.
The replication strategy for
the table. Note that this
string is passed into
SchemaDef and the
corresponding CQL that is
generated verbatim … it must
be syntactically correct.

Property
Required
schemadef.replace.inva No
lid_char

Type
String

Default
_ (underscore)

schemadef.replace.inva No
lid_first_char

String

x

schemadef.replace.reg
ex

String

[^a-zA-Z0-9_\\.]

No

88

Notes
Replace non-alphanumeric
“special” characters that are
supported in table and
column names in some
databases with the specified
character or characters. This
is needed because most of
the big data targets are much
more limited in terms of the
characters that are
supported. This value must
be the same as the value
specified for the equivalent
property in
bdglue.properties.
Prepend this string to table
and column names that begin
with anything other than an
alpha character. This is
needed because of
limitations on the big data
side of things. Set to a null
value to avoid this
functionality. This value must
be the same as the value
specified for the equivalent
property in
bdglue.properties.
This is a regular expression
that contains the characters
that *are* supported in the
target. (Note: the ^ is
required just as in the
default). All characters not in
this list will be replaced by
the character or characters
specified in
schemadef.replace.invalid_ch
ar. This value must be the
same as the value specified
for the equivalent property in
bdglue.properties.

Helpful Reference Sources






Flume Developer Guide: https://flume.apache.org/FlumeDeveloperGuide.html
Flume User Guide: https://flume.apache.org/FlumeUserGuide.html
Hoffman, Steve. Apache Flume Distributed Log Collection for Hadoop. N.p.: Packt, 2013.
White, Tom. Hadoop: The Definitive Guide: (fourth edition). Beijing: O'Reilly, 2015.
Oracle NoSQL Documentation: http://docs.oracle.com/cd/NOSQL/html/index.html

License and Notice Files
LICENSE
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
89

(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
90

granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
91

Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
92

replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

===============
async-1.4.0.jar and subsequent versions
===============
Copyright (c) 2010-2012 The SUAsync Authors. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
- Neither the name of the StumbleUpon nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
93

POSSIBILITY OF SUCH DAMAGE.

====================
asynchbase-1.5.0.jar and subsequent versions
====================
Copyright (C) 2010-2012 The Async HBase Authors. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
- Neither the name of the StumbleUpon nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

===================
slf4j-api-1.6.1.jar and subsequent versions
===================
Copyright (c) 2004-2013 QOS.ch
All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
94

included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
NOTICE
Oracle BDGlue
Copyright (c) 2015 Oracle and/or its affiliates. All rights reserved.
This product includes software developed at
Oracle (http://www.oracle.com/)

======================
kafka_2.10-0.8.2.1.jar and subsequent versions
======================
Kafka
This product includes software developed by the Apache Software Foundation
(http://www.apache.org/).
This product includes jopt-simple, a library for parsing command line options (http://joptsimple.sourceforge.net/).
This product includes junit, developed by junit.org.
This product includes zkclient, developed by Stefan Groschupf, http://github.com/sgroschupf/zkclient
This produce includes joda-time, developed by joda.org (joda-time.sourceforge.net)
This product includes the scala runtime and compiler (www.scala-lang.org) developed by EPFL, which
includes the following license:
This product includes zookeeper, a Hadoop sub-project (http://hadoop.apache.org/zookeeper)
This product includes log4j, an Apache project (http://logging.apache.org/log4j)
This product includes easymock, developed by easymock.org (http://easymock.org)
This product includes objenesis, developed by Joe Walnes, Henri Tremblay, Leonardo Mesquita
(http://code.google.com/p/objenesis)
95

This product includes cglib, developed by sourceforge.net (http://cglib.sourceforge.net)
This product includes asm, developed by OW2 consortium (http://asm.ow2.org)
----------------------------------------------------------------------SCALA LICENSE
Copyright (c) 2002-2010 EPFL, Lausanne, unless otherwise specified.
All rights reserved.
This software was developed by the Programming Methods Laboratory of the
Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland.
Permission to use, copy, modify, and distribute this software in source
or binary form for any purpose with or without fee is hereby granted,
provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of the EPFL nor the names of its contributors
may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
-----------------------------------------------------------------------

==============
avro-1.7.7.jar and subsequent versions
==============
96

Apache Avro
Copyright 2010 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

===============================================================
flume-ng-sdk-1.5.0.1.jar, flume-ng-core-1.5.0.1.jar,
flume-ng-configuration-1.5.0.1.jar, flume-hdfs-sink-1.5.0.1.jar
and subsequent versions of each
===============================================================
Apache Flume
Copyright 2012 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
Portions of this software were developed at
Cloudera, Inc. (http://www.cloudera.com/).

=======================
hadoop-common-2.5.1.jar and subsequent versions
=======================
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).

=================================
hbase-common-0.98.6.1-hadoop2.jar and subsequent versions
=================================
Apache HBase
Copyright 2007-2015 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
-This product incorporates portions of the 'Hadoop' project
Copyright 2007-2009 The Apache Software Foundation
Licensed under the Apache License v2.0
97

-Our Orca logo we got here: http://www.vectorfree.com/jumping-orca
It is licensed Creative Commons Attribution 3.0.
See https://creativecommons.org/licenses/by/3.0/us/
We changed the logo by stripping the colored background, inverting
it and then rotating it some.
Later we found that vectorfree.com image is not properly licensed.
The original is owned by vectorportal.com. The original was
relicensed so we could use it as Creative Commons Attribution 3.0.
The license is bundled with the download available here:
http://www.vectorportal.com/subcategory/205/KILLER-WHALE-FREEVECTOR.eps/ifile/9136/detailtest.asp
-This product includes portions of the Bootstrap project v3.0.0
Copyright 2013 Twitter, Inc.
Licensed under the Apache License v2.0
This product uses the Glyphicons Halflings icon set.
http://glyphicons.com/
Copyright Jan Kovařík
Licensed under the Apache License v2.0 as a part of the Bootstrap project.
-This product includes portions of the Guava project v14, specifically
'hbase-common/src/main/java/org/apache/hadoop/hbase/io/LimitInputStream.java'
Copyright (C) 2007 The Guava Authors
Licensed under the Apache License, Version 2.0

===========================
jackson-core-asl-1.9.13.jar and subsequent versions
===========================
# Jackson JSON processor
Jackson is a high-performance, Free/Open Source JSON processing library.
It was originally written by Tatu Saloranta (tatu.saloranta@iki.fi), and has
been in development since 2007.
It is currently developed by a community of developers, as well as supported
commercially by FasterXML.com.
98

## Licensing
Jackson core and extension components may licensed under different licenses.
To find the details that apply to this artifact see the accompanying LICENSE file.
For more information, including possible other licensing options, contact
FasterXML.com (http://fasterxml.com).
## Credits
A list of contributors may be found from CREDITS file, which is included
in some artifacts (usually source distributions); but is always available
from the source code management (SCM) system project uses.

99



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 99
Language                        : en-US
Tagged PDF                      : Yes
Title                           : Connecting Java Applications to Big Data Targets Using BDGlue
Author                          : Oracle Data Integration Solutions
Subject                         : User Guide
Creator                         : Microsoft® Office Word 2007
Create Date                     : 2016:06:10 14:11:59-05:00
Modify Date                     : 2016:06:10 14:11:59-05:00
Producer                        : Microsoft® Office Word 2007
EXIF Metadata provided by EXIF.tools

Navigation menu