Hortonworks Data Platform Apache Flume Component Guide Bk

User Manual: Pdf

Open the PDF directly: View PDF PDF.
Page Count: 13

DownloadHortonworks Data Platform - Apache Flume Component Guide Bk Flume-component-guide
Open PDF In BrowserView PDF
Hortonworks Data Platform
Apache Flume Component Guide
(December 15, 2017)

docs.hortonworks.com

Hortonworks Data Platform

December 15, 2017

Hortonworks Data Platform: Apache Flume Component Guide
Copyright © 2012-2017 Hortonworks, Inc. Some rights reserved.
The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open
source platform for storing, processing and analyzing large volumes of data. It is designed to deal with
data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks
Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop
Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the
major contributor of code and patches to many of these projects. These projects have been integrated and
tested as part of the Hortonworks Data Platform release process and installation and configuration tools
have also been included.
Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our
code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and
completely open source. We sell only expert technical support, training and partner-enablement services.
All of our technology is, and will remain free and open source.
Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For
more information on Hortonworks services, please visit either the Support or Training page. Feel free to
Contact Us directly to discuss your specific needs.
Except where otherwise noted, this document is licensed under
Creative Commons Attribution ShareAlike 4.0 License.
http://creativecommons.org/licenses/by-sa/4.0/legalcode

ii

Hortonworks Data Platform

December 15, 2017

Table of Contents
1. Feature Updates ..........................................................................................................
2. Introduction .................................................................................................................
2.1. Flume Concepts .................................................................................................
2.2. HDP and Flume .................................................................................................
2.3. A Simple Example .............................................................................................
2.4. Flume 1.5.2 Documentation ..............................................................................
3. Using Apache Flume for Streaming ..............................................................................
3.1. Kafka Sink .........................................................................................................
3.2. Hive Sink ...........................................................................................................

iii

1
2
2
2
3
4
5
5
6

Hortonworks Data Platform

December 15, 2017

List of Tables
1.1. Features by HDP Version ........................................................................................... 1

iv

Hortonworks Data Platform

December 15, 2017

1. Feature Updates
Apache Flume version 1.5.2 includes cumulative 1.6 features and selected 1.7 features. The
table below indicates the features added to Flume 1.5.2 with each release of Hortonworks
Data Platform (HDP).

Note
As of HDP 2.6.0, Flume is deprecated but still supported. It will be removed and
no longer supported as of HDP 3.0.0.

Table 1.1. Features by HDP Version
HDP Release

Added Features

Advantages

2.6.2

New Kafka Source

Provides support for new Kafka Consumer API, including reading multiple Kafka
topics.

New Kafka sink

Provides support for new Kafka Producer API.

Preserve event headers

Preserve event headers when using Kafka Sink and Kafka Source together. Avro
Datums can be read from and written to Kafka.

Log privacy utility

Helps determine whether or not logging potentially sensitive information is
allowed.

Kafka Channel

Uses a single Kafka topic. Provides greater reliability and better performance.

TailDir Source

Greater data reliability, even with rotating file names. Can restart tailing at the
point where Flume stopped, while continuing data ingest.

Kafka Source

Reads messages from a Kafka topic. Can have multiple Kafka sources running
and configure them to read a unique set of partitions for the topic.

Kafka Sink

Publishes data to a Kafka topic. Supports pull-based processing from various
Flume sources.

Hive Sink

Not recommended for use in production. Streams events containing delimited
text or JSON data directly into a Hive table or partition. Provides a preview
feature and not.

2.5.0

2.4.0

2.3.0

1

Hortonworks Data Platform

December 15, 2017

2. Introduction
Flume is a top-level project at the Apache Software Foundation. While it can function as
a general-purpose event queue manager, in the context of Hadoop it is most often used
as a log aggregator, collecting log data from many diverse sources and moving them to a
centralized data store.
The following information is available in this chapter:
• Section 2.1, “Flume Concepts” [2]
• Section 2.2, “HDP and Flume” [2]
• Section 2.3, “A Simple Example” [3]
• Section 2.4, “Flume 1.5.2 Documentation” [4]

2.1. Flume Concepts
Flume Components
A Flume data flow is made up of five main components: Events, Sources, Channels, Sinks,
and Agents:
Events An event is the basic unit of data that is moved using Flume. It is similar to a
message in JMS and is generally small. It is made up of headers and a byte-array body.
Sources The source receives the event from some external entity and stores it in a channel.
The source must understand the type of event that is sent to it: an Avro event requires an
Avro source.
Channels A channel is an internal passive store with certain specific characteristics. An
in-memory channel, for example, can move events very quickly, but does not provide
persistence. A file-based channel provides persistence. A source stores an event in the
channel where it stays until it is consumed by a sink. This temporary storage lets source and
sink run asynchronously.
Sinks The sink removes the event from the channel and forwards it to either to a
destination, like HDFS, or to another agent/dataflow. The sink must output an event that is
appropriate to the destination.
Agents An agent is the container for a Flume data flow. It is any physical JVM running
Flume. An agent must contain at least one source, channel, and sink, but the same agent
can run multiple sources, sinks, and channels. A particular data flow path is set up through
the configuration process.

2.2. HDP and Flume
Flume ships with many source, channel, and sink types. The following types have been
thoroughly tested for use with HDP:
2

Hortonworks Data Platform

December 15, 2017

Sources
• Exec (basic, restart)
• Syslogtcp
• Syslogudp
• TailDir
• Kafka
Channels
• Memory
• File
• Kafka
Sinks
• HDFS: secure, nonsecure
• HBase
• Kafka
• Hive
See the Apache Flume 1.5.2 documentation for a complete list of all available Flume
components.

2.3. A Simple Example
The following snippet shows some of the kinds of properties that can be set using the
properties file. For more detailed information, see the Apache Flume 1.5.2 documentation.
agent.sources = pstream
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.sources.pstream.channels = memoryChannel
agent.sources.pstream.type = exec
agent.sources.pstream.command = tail -f /etc/passwd
agent.sinks = hdfsSinkagent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path = hdfs://hdp/user/root/flumetest
agent.sinks.hdfsSink.hdfs.fileType = SequenceFile
agent.sinks.hdfsSink.hdfs.writeFormat = Text

The source here is defined as an exec source. The agent runs a given command on startup,
which streams data to stdout, where the source gets it.
In this case, the command is a Python test script. The channel is defined as an in-memory
channel and the sink is an HDFS sink.
3

Hortonworks Data Platform

December 15, 2017

2.4. Flume 1.5.2 Documentation
See the complete Apache Flume 1.5.2 documentation for details about using Flume with
Hortonworks Data Platform. The documentation is updated to reflect support for new
features.
The Flume 1.5.2 documentation is also included with the Flume software. You can access
the documentation in the Flume /docs directory.

4

Hortonworks Data Platform

December 15, 2017

3. Using Apache Flume for Streaming
Kafka Sink and Hive Sink are integrated with Flume to provide streaming capabilitles for
Hive tables and Kafka topics. For more information about Flume, see the Apache Flume
1.5.2 documentation.

3.1. Kafka Sink
This is a Flume Sink implementation that can publish data to a Kafka topic. One of the
objectives is to integrate Flume with Kafka so that pull-based processing systems can
process the data coming through various Flume sources. This currently supports Kafka 0.8.x
series of releases.
Property Name

Default

Description

type

-

Must be set to
org.apache.flume.sink.kafka.KafkaSink.

brokerList

-

List of brokers Kafka-Sink will connect
to, to get the list of topic partitions.
This can be a partial list of brokers, but
we recommend at least two for HA.
The format is a comma separated list of
hostname:port.

topic

default-flume-topic

The topic in Kafka to which the
messages will be published. If this
parameter is configured, messages will
be published to this topic. If the event
header contains a “topic” field, the
event will be published to that topic
overriding the topic configured here.

batchSize

100

How many messages to process in
one batch. Larger batches improve
throughput while adding latency.

requiredAcks

1

How many replicas must acknowledge
a message before it is considered
successfully written. Accepted
values are 0 (Never wait for
acknowledgement), 1 (wait for leader
only), -1 (wait for all replicas) Set this
to -1 to avoid data loss in some cases of
leader failure.

Other Kafka Producer Properties

-

These properties are used to
configure the Kafka Producer.
Any producer property supported
by Kafka can be used. The only
requirement is to prepend the property
name with the prefix "Kafka.". Fr
exampleafka.producer.type.

Kafka Sink uses the topic and key properties from the FlumeEvent headers to send events
to Kafka. If the topic exists in the headers, the event is sent to that specific topic, overriding
the topic configured for the Sink. If key exists in the headers, the key is used by Kafka to
partition the data between the topic partitions. Events with the same key are sent to the
same partition. If the key is null, events are sent to random partitions.
An example configuration of a Kafka sink is given below. Properties starting with the
prefix Kafka (the last 3 properties) are used when instantiating the Kafka producer.
5

Hortonworks Data Platform

December 15, 2017

The properties that are passed when creating the Kafka producer are not limited to the
properties given in this example. It is also possible include your custom properties here
and access them inside the preprocessor through the Flume Context object passed in as a
method argument.
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink a1.sinks.k1.topic =
mytopic
a1.sinks.k1.brokerList = localhost:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1

3.2. Hive Sink
This sink streams events containing delimited text or JSON data directly into a Hive table
or partition. Events are written using Hive transactions. As soon as a set of events are
committed to Hive, they become immediately visible to Hive queries. Partitions to which
flume will stream to can either be pre-created or, optionally, Flume can create them if they
are missing. Fields from incoming event data are mapped to corresponding columns in the
Hive table.
Property Name

Default

Description

channel

–

type

–

The component type name, needs to
be hive.

hive.metastore

–

Hive metastore URI (e.g. thrift://
a.b.com:9083).

hive.database

–

Hive database name
.

hive.table

–

Hive table name.

hive.partition

–

Comma separated list of partition
values identifying the partition to write
to. May contain escape sequences. E.g.:
If the table is partitioned by (continent:
string, country :string, time : string)
then ‘Asia,India,2014-02-26-01-21’
will indicate
continent=Asia,country=India,time=2014-02-26-01-21.

hive.txnsPerBatchAsk

100

Hive grants a batch of transactions
instead of single transactions to
streaming clients like Flume. This
setting configures the number of
desired transactions per Transaction
Batch. Data from all transactions
in a single batch end up in a single
file. Flume will write a maximum of
batchSize events in each transaction in
the batch. This setting in conjunction
with batchSize provides control
over the size of each file. Note that
eventually Hive will transparently
compact these files into larger files.

heartBeatInterval

240

(In seconds) Interval between
consecutive heartbeats sent to Hive
to keep unused transactions from
expiring. Set this value to 0 to disable
heartbeats.

6

Hortonworks Data Platform

December 15, 2017

Property Name

Default

Description

autoCreatePartitions

true

Flume will automatically create the
necessary Hive partitions to stream to.

batchSize

15000

Max number of events written to Hive
in a single Hive transaction.

maxOpenConnections

500

Allow only this number of open
connections. If this number is
exceeded, the least recently used
connection is closed.

callTimeout

10000

(In milliseconds) Timeout for Hive &
HDFS I/O operations, such as openTxn,
write, commit, abort.

serializer

–

Serializer is responsible for parsing
out field from the event and mapping
them to columns in the hive table.
Choice of serializer depends upon
the format of the data in the event.
Supported serializers: DELIMITED and
JSON.

roundUnit

minute

The unit of the round down value second, minute or hour.

roundValue

1

Rounded down to the highest multiple
of this (in the unit configured using
hive.roundUnit), less than current time.

timeZone

Local

Name of the timezone that should
be used for resolving the escape
sequences in partition, e.g. Time
America/Los_Angeles.

useLocalTimeStamp

false

Use the local time (instead of the
timestamp from the event header)
while replacing the escape sequences.

The following serializers are provided for Hive sink:
• JSON: Handles UTF8 encoded Json (strict syntax) events and requires no configuration.
Object names in the JSON are mapped directly to columns with the same name in the
Hive table. Internally uses org.apache.hive.hcatalog.data.JsonSerDe but is independent
of the Serde of the Hive table. This serializer requires HCatalog to be installed.
• DELIMITED: Handles simple delimited textual events. Internally uses LazySimpleSerde but
is independent of the Serde of the Hive table.
Property Name

Default

Description

serializer.delimiter

,

(Type: string) The field delimiter in
the incoming data. To use special
characters, surround them with double
quotes like “\t”.

serializer.fieldnames

–

The mapping from input fields to
columns in hive table. Specified as a
comma separated list (no spaces) of
hive table columns names, identifying
the input fields in order of their
occurrence. To skip fields leave the
column name unspecified. E.g..
‘time,,IP,message’ indicates the 1st, 3rd
and 4th fields in input map to time, IP
and message columns in the hive table.

serializer.serdeSeparator

Ctrl-A

(Type: character) Customizes the
separator used by underlying serde.

7

Hortonworks Data Platform

Property Name

December 15, 2017

Default

Description
There can be a gain in efficiency if
the fields in serializer.fieldnames are
in same order as table columns, the
serializer.delimiter is same as the
serializer.serdeSeparator and number
of fields in serializer.fieldnames
is less than or equal to number
of table columns, as the fields in
incoming event body do not need
to be reordered to match order of
table columns. Use single quotes for
special characters like ‘\t’. Ensure input
fields do not contain this character.
Note: If serializer.delimiter is a single
character, preferably set this to the
same character.

The following are the escape sequences supported:
Alias

Description

%{host}

Substitute value of event header named “host”. Arbitrary
header names are supported.

%t

Unix time in milliseconds
.

%a

Locale’s short weekday name (Mon, Tue, ...)

%A

Locale’s full weekday name (Monday, Tuesday, ...)

%b

Locale’s short month name (Jan, Feb, ...)

%B

Locale’s long month name (January, February, ...)

%c

Locale’s date and time (Thu Mar 3 23:05:25 2005)

%d

Day of month (01)

%D

Date; same as %m/%d/%y

%H

Hour (00..23)

%I

Hour (01..12)

%j

Day of year (001..366)

%k

Hour ( 0..23)

%m

Month (01..12)

%M

Minute (00..59)

%p

Locale’s equivalent of am or pm

%s

Seconds since 1970-01-01 00:00:00 UTC

%S

Second (00..59)
%y last two digits of year (00..99)

%Y

Year (2015)

%z

+hhmm numeric timezone (for example, -0400)

Example Hive table:
create table weblogs ( id int , msg string )
partitioned by (continent string, country string, time string)
clustered by (id) into 5 buckets
stored as orc;

Example for agent named a1:
8

Hortonworks Data Platform

December 15, 2017

a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.channel = c1
a1.sinks.k1.hive.metastore = thrift://127.0.0.1:9083
a1.sinks.k1.hive.database = logsdb
a1.sinks.k1.hive.table = weblogs
a1.sinks.k1.hive.partition = asia,%{country},%y-%m-%d-%H-%M
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =id,,msg

Tip
For all of the time related escape sequences, a header with the key
“timestamp” must exist among the headers of the event (unless
useLocalTimeStampis set to true). One way to add this automatically is to use
the TimestampInterceptor.
The above configuration will round down the timestamp to the last 10th
minute. For example, an event with timestamp header set to 11:54:34 AM,
June 12, 2012 and ‘country’ header set to ‘india’ will evaluate to the partition
(continent=’asia’,country=’india’,time=‘2012-06-12-11-50’. The serializer is configured to
accept tab separated input containing three fields and to skip the second field.

9



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
Linearized                      : No
Page Count                      : 13
Profile CMM Type                : lcms
Profile Version                 : 2.1.0
Profile Class                   : Display Device Profile
Color Space Data                : RGB
Profile Connection Space        : XYZ
Profile Date Time               : 1998:02:09 06:49:00
Profile File Signature          : acsp
Primary Platform                : Apple Computer Inc.
CMM Flags                       : Not Embedded, Independent
Device Manufacturer             : IEC
Device Model                    : sRGB
Device Attributes               : Reflective, Glossy, Positive, Color
Rendering Intent                : Perceptual
Connection Space Illuminant     : 0.9642 1 0.82491
Profile Creator                 : lcms
Profile ID                      : 0
Profile Copyright               : Copyright (c) 1998 Hewlett-Packard Company
Profile Description             : sRGB IEC61966-2.1
Media White Point               : 0.95045 1 1.08905
Media Black Point               : 0 0 0
Red Matrix Column               : 0.43607 0.22249 0.01392
Green Matrix Column             : 0.38515 0.71687 0.09708
Blue Matrix Column              : 0.14307 0.06061 0.7141
Device Mfg Desc                 : IEC http://www.iec.ch
Device Model Desc               : IEC 61966-2.1 Default RGB colour space - sRGB
Viewing Cond Desc               : Reference Viewing Condition in IEC61966-2.1
Viewing Cond Illuminant         : 19.6445 20.3718 16.8089
Viewing Cond Surround           : 3.92889 4.07439 3.36179
Viewing Cond Illuminant Type    : D50
Luminance                       : 76.03647 80 87.12462
Measurement Observer            : CIE 1931
Measurement Backing             : 0 0 0
Measurement Geometry            : Unknown
Measurement Flare               : 0.999%
Measurement Illuminant          : D65
Technology                      : Cathode Ray Tube Display
Red Tone Reproduction Curve     : (Binary data 2060 bytes, use -b option to extract)
Green Tone Reproduction Curve   : (Binary data 2060 bytes, use -b option to extract)
Blue Tone Reproduction Curve    : (Binary data 2060 bytes, use -b option to extract)
Format                          : application/pdf
Title                           : Hortonworks Data Platform - Apache Flume Component Guide
Language                        : en
Date                            : 2018:01:14 04:22:26Z
Producer                        : Apache FOP Version 1.1
PDF Version                     : 1.4
Creator Tool                    : DocBook XSL Stylesheets with Apache FOP
Metadata Date                   : 2018:01:14 04:22:26Z
Create Date                     : 2018:01:14 04:22:26Z
Page Mode                       : UseOutlines
Creator                         : DocBook XSL Stylesheets with Apache FOP
EXIF Metadata provided by EXIF.tools

Navigation menu