The Ultimate Guide To Switching Careers Big Data
User Manual:
Open the PDF directly: View PDF .
Page Count: 83
Download | |
Open PDF In Browser | View PDF |
The Ultimate Guide to Switching Careers to Big Data Upgrading Your Skills For the Big Data Revolution Jesse Anderson © 2017 SMOKING HAND LLC ALL RIGHTS RESERVED Version 1.0.87fcead 2 Contents 1 Introduction About This Book . . . . . . . . . . About Big Data . . . . . . . . . . A Little About Me . . . . . . . . . Warnings and Success Stories . . Who Should Read This . . . . . . Navigating the Book Chapters . Conventions Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 6 7 8 9 9 2 The Benefits of Getting Into Big Data Pay and Career Advancement . . . . . . . . . . . . . . . Interesting Work . . . . . . . . . . . . . . . . . . . . . . . High Demand/Low Supply . . . . . . . . . . . . . . . . How Can You Improve Your Skills and Be in Demand? Industries Using Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 13 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Are the Skills Needed on a Data Engineering Team? What Is a Data Engineer? . . . . . . . . . . . . . . . . . . . . What Is a Data Engineering Team? . . . . . . . . . . . . . . . Qualified Data Engineers . . . . . . . . . . . . . . . . . . . . Data Scientists and Data Science Teams . . . . . . . . . . . . Multidisciplinary . . . . . . . . . . . . . . . . . . . . . . . . . Where Do You Fit In These Teams? . . . . . . . . . . . . . . . Data Engineering Team Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 21 22 22 24 24 4 What You Should Know About Big Data What Putting in Your Own Lawn Has to Do With Big Data Why Is Big Data So Much More Complicated? . . . . . . . How Long Will It Take to Learn? . . . . . . . . . . . . . . . . Changes You’ll Need to Make . . . . . . . . . . . . . . . . . Not Just Beginner Skills . . . . . . . . . . . . . . . . . . . . Which Technology Should You Learn? . . . . . . . . . . . . Hadoop Is Dead . . . . . . . . . . . . . . . . . . . . . . . . . Future Proofing Your Career . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 32 35 36 38 39 40 41 CONTENTS . . . . . . . . 3 Is Big Data Going to Last? . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Switching Careers Retooling . . . . . . . . . . . . . . . . . . . Newly Graduated . . . . . . . . . . . . . . General Advice on Switching . . . . . . . Programmers and Software Engineers . . Managers . . . . . . . . . . . . . . . . . . . Data Analysts and Business Intelligence . DBAs and SQL-focused Positions . . . . . Operations . . . . . . . . . . . . . . . . . . What if You’re Not on this List? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 How Can You Do It? General Learning . . . . . . . . . . . . . . . . . . Learning to Program . . . . . . . . . . . . . . . . Which Technologies to Learn? . . . . . . . . . . Can You Do This At Your Current Organization? Can You Do This Without a Degree? . . . . . . . Can You Do This in Your Country? . . . . . . . . How Diverse Is Big Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . 44 44 45 45 46 49 50 53 56 59 . . . . . . . 61 61 67 67 69 69 70 70 7 How Do You Get a Job? No Experience Necessary? . . . . . . . . . . . . . . . . . Where Do You Fit in the Data Engineering Ecosystem? Personal Project . . . . . . . . . . . . . . . . . . . . . . . Networking . . . . . . . . . . . . . . . . . . . . . . . . . Getting a Job Fast(er) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 72 72 74 75 8 What Are You Going to Do? Questions to Answer Before Starting to Learn Big Data Your Checklist For Starting to Learn Big Data . . . . . . Parting Advice . . . . . . . . . . . . . . . . . . . . . . . . About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 77 78 79 A Appendix: Are Your Programming Skills Ready for Big Data? 80 Example Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Are your programming skills ready? . . . . . . . . . . . . . . . . . . . . . 83 CONTENTS 4 CHAPTER 1 Introduction About This Book This book is for individuals who want to change careers to Big Data. You want to join a data engineering or Big Data team. You’re seeing that Big Data skills are in high demand and you want to become in high demand yourself — but you’re lost. You’re lost in this sea of weird names and don’t know where to start so you don’t waste your time. This book is for everyone who wants to join a data engineering team or work with Big Data. This includes: • Programmers who want to become Data Engineers • Managers who want to lead data engineering teams • Data Analysts and Business Intelligence professionals who want start using Big Data and eventually become Data Scientists • Data Warehouse Engineers, DBAs, and similar SQL-like focused roles that want to become part of a data engineering team • Operations specialists who want to get into the Big Data side of operations This book focuses on what you need to do to get a job. It calls out some of the technologies to focus on, but doesn’t go too deep. I don’t show the actual code that you’ll need to learn, that’s another product altogether. I want to you to come away from this guide with a very good understanding of what it’s going to take to switch careers to Big Data. From there, you can make an educated decision whether this is the right path for you. This is an ultimate guide. That means it goes into every question I get. No section or chapter is too verbose, but everything is covered. I don’t want you to have to email me asking about a subject. It’s all here. Chapter 1 Introduction 5 About Big Data First there’s the definition of Big Data itself. The most orthodox definition is the 3 V’s, 4 V’s, or 5 V’s. The actual number of V’s depends on which company is trying to sell you something. These V’s generally mean: • • • • Storing massive datasets Large user bases Data made up with a variety of types Running computationally complex algorithms on large datasets The direct definition of the 3 V’s is volume, variety, and velocity. Personally, I don’t like this definition. It’s far too difficult to quantify and measure. This definition makes everything become Big Data and the term gets misused. I prefer the definition of can’t. If you are experiencing Big Data problems, you will start to say can’t. You can’t do something because: • The processing will take too long • There aren’t enough resources or memory to process the data • There is a technical limitation, like the RDBMS, that prevents you from doing it This is what Big Data aims to solve. I like to tell people that Big Data puts the limitation on your imagination instead of on technical limitations. This opens up some really interesting possibilities. Now that those small data constraints are lifted, you can go as deep as you want to. This includes processing all customer data from the beginning of time to see what their lifetime interactions are. This allows your data and results to become the lifeblood of the organization. With data, you’ll be interacting with all parts of your organization. A Little About Me Let me tell you a little bit about myself. I’ve seen, helped, and mentored thousands of people and hundreds of companies as they’ve gone through the process of learning Big Data. I know the things that work, and more importantly, I know the things that don’t. I’m going to tell you the exact skills that every person needs, Chapter 1 Introduction 6 depending on which position they’re looking for. If you follow the strategies and plans I outline in this book, you’ll be way ahead of the others trying to get into Big Data who’ve failed. When other people interview for the same job, you’ll be far ahead of them. You’ll understand the fundamental differences between small data and Big Data jobs. When others just focus on a single technology, you’ll give employers what they’re looking for with your ability to show knowledge of all the right technologies. I’ll show you what to do because I’ve interfaced with hundreds of companies and taught thousands of students. I’ve helped these students get their data engineering and Big Data dream jobs. Warnings and Success Stories This book is the result of years of carefully observing teams, individuals, and entire companies. I’ve also spent years teaching at hundreds of companies and to thousands of students. These companies span the world and are in all kinds of industries. My work starts with my very first interaction with teams and continues as I follow up with those teams to see the final outcome of their project. Other times, I start with students who are absolute beginners to Big Data and help them get their first data engineering job. From there, I analyze what went right, what went wrong, and why. Throughout this book, I’m going to share some of these interactions and stories. They’ll look something like this: A cautionary story Learn from this person’s mistakes These will be cautionary tales from the past. Learn from and avoid these mistakes. Follow this example Learn from this person’s success Chapter 1 Introduction 7 These will be success stories from the past. Learn from and imitate their success. My story Here’s a story that will give you more color on the subject. These will be stories or comments from my experience in the field teaching individuals. They will give you more background or color on a topic. In Their Own Words Here’s a story that comes directly from a person. These will be stories or comments from former students or someone who’s been kind enough to share their own experiences with me. These are people who’ve shared the same journey as you. They will give you their own point of view on a topic. Who Should Read This This book is primarily written for individuals. These are people like: • • • • • • • • • • Programmers and Software Engineers Managers Data Analysts Business Intelligence Analysts Data Warehouse Engineers DBAs ETL and SQL Developers Operations and Administrative Specialists Enterprise, Data, and Software Architects Others with a general desire to get into Big Data It will help you understand why some people succeed at getting a Big Data position while many others fail. Chapter 1 Introduction 8 Navigating the Book Chapters I highly recommend you read the entire book from start to finish to understand every concept and point. Without the entire background and all of the concepts, you may not understand fully why I recommend a technique or make a specific suggestion. Here are the chapters and what we’ll be covering in each chapter: • Chapter 2 shares the benefits of getting into Big Data. There are many good reasons to start your Big Data journey and I’ll share them. • Chapter 3 tells you the skills that you’ll need to have before you can join a data engineering team. • Chapter 4 will talk about what you should know about Big Data before you start going down that path. These are things every person should know before they start a Big Data journey. • Chapter 5 tells you the exact changes you’ll need to make depending on your current position and skill set. • Chapter 6 covers how to get the skills and knowledge to join a data engineering team or start working with Big Data. I talk about any of the extenuating circumstances you may have such as industry, country, or education. • Chapter 7 shares the secrets to getting a Big Data job. These are the secrets I share with my students to stand out and get the best jobs. • Chapter 8 goes step by step through the questions you should answer before starting to learn Big Data and gives you a checklist to verify you’re ready to go. Conventions Used in This Book A DBA (Database Administrator) is someone who maintains a database and often writes SQL queries. For the purposes of this book, I’m going to group several different titles that write SQL queries together for simplicity’s sake. In addition to DBA, these titles would be Data Warehouse Engineer, SQL Developer, and ETL Developer. I will use the terms Software Engineer and programmer interchangeably. These are individuals who have programming or software engineering skills. They are Chapter 1 Introduction 9 the team members responsible for writing the project’s code. With all of the housekeeping out of the way, let’s get started and learn how to switch careers and get into Big Data! Chapter 1 Introduction 10 CHAPTER 2 The Benefits of Getting Into Big Data There are some great reasons to switch careers and get into Big Data. These are some of the outward reasons people start in Big Data. Pay and Career Advancement Big Data is one of the leading areas where companies are increasing their investment and resources. My students looked at other people advancing their careers by getting into Big Data and thought that they can do it too. They looked at the other teams in the company and saw the data engineering team had open positions. Whereas, their team didn’t have any open positions or was laying people off. Pay and career advancement are some of the biggest motivators for people. You can make more, sometimes substantially more, with Big Data skills. People are taking their stagnating or disappearing career path and finding an expanding career path in Big Data. How Much Higher? In my previous career path there was a cap on how much I could progress, and the salary could reach maybe $100,000 to $130,000 — in data it’s 60% higher and I’m no longer stuck on that other track. — Robert H. Want to start interfacing with the CxO on a regular basis? A data engineering team frequently interacts with the CxOs and VPs of the company. They’re creating the data products that these people consume and they’re vital. My students Chapter 2 The Benefits of Getting Into Big Data 11 advance faster because they’re interfacing with the top people in the company and they’re making a direct impact on the company’s bottom line. Interesting Work Let’s face it, most enterprise software is downright boring. There’s only so many times you can write a CRM or the same select query. It just gets old after a while when you’re doing the same thing over and over again. An enterprise software developer’s day is more like Groundhog Day than Independence Day. Working with data is different. Yes, there is some drudgery, but much of the job gives you the freedom to experiment. I find that good candidates for being a Data Engineer are people that are bored with the routine and want to start working on something that doesn’t have a deterministic ending. You get to analyze, create the data pipelines, and consume the data pipelines that give you cool insights into what’s happening within the organization. Burning Out My transition was due to being burnt out at work. I was a software support engineer for a company that made software tools that encrypted apps, obfuscated keys and other tools that kept people from ripping off licenses, etc. The work was very techincal, but really demanding. I wanted more ’creative time’. — Stephan W. High Demand/Low Supply If you’ve ever taken an economics class, you know that the best position in a market is when there is a high demand and low supply for an item. In Big Data, there is a very high demand for qualified people who know Big Data technologies. However, there is a low supply of these people (we’ll talk more about why the supply is so low in Chapter 4 “What You Should Know About Big Data”). This inequality in the market means several good things for you: Chapter 2 The Benefits of Getting Into Big Data 12 • You will have fewer people competing with you for the same job • Pay for the positions will go up as companies compete for qualified people • There is a decent barrier to entry for new people as not everyone can just up and learn Big Data • People tend to get promoted quickly as their data engineering teams are newly established How Can You Improve Your Skills and Be in Demand? You’ve read the good things about Big Data. Now there’s a gap between your current skills and the skills that will get you a Big Data job. I see this gap all the time because I teach Big Data. I’ve taught thousands of students who are in your shoes right now. I’ve seen some students succeed and some students fail. My successful students are not unique snowflakes with tons of money or classes or special circumstances that have allowed them to be successful at learning Big Data. These students have taken an open and honest look at themselves and asked the following: • • • • Do I have a desire to learn Big Data? Do I have some of the prerequisite skills? Do I have the time to dedicate to learning? Do I have an expert (or experts) to guide me through the experience? You will need all four of those points, in their entirety. If you’re missing one of those points, it will take forever to learn Big Data and you’ll give up. I’ve seen this many times when people talk to me at conferences or email me. They lack one of the essential points and never make any progress. I want you to really look over these points so you don’t waste your time pursuing something you can’t fully realize. Let’s go through them in more detail. A Desire to Learn Big Data You will have to put in the effort to learn Big Data. Just hoping and trying to passively learn isn’t going to get you anywhere. I see this in the comments section of Big Data videos. “I sat there, learned a little passively, and wasn’t challenged. This was easy.” Six months later, this person still hasn’t switched careers. Chapter 2 The Benefits of Getting Into Big Data 13 Some of the Prerequisite Skills Depending on the position, you will need some skills in your toolbox. In Chapter 5 “Switching Careers,” I go through these skills and positions in more depth. These required skills can range from Linux to programming skills. It all depends on the position you’re seeking. The Time to Dedicate to Learning Let’s say for whatever reason a person doesn’t have the time to dedicate to learning. They’ve been misguided by their previous experience with small data technologies. They think they can get to an intermediate, or maybe even an advanced level, in a week or two. Those sorts of timelines don’t carry over into Big Data. They will miss out on job opportunities because they aren’t willing to put in the time and effort necessary to switch careers. An Expert to Guide You Through the Experience Learning Big Data isn’t easy and it’s even harder without someone who is a recognized expert to learn from. You probably can’t look through a class or syllabus and spot the signs of a massive waste of time. You will need an expert to guide you through a complex Big Data landscape. Spend the extra time and money to find the right person to learn from. Industries Using Big Data Virtually every industry is using Big Data. Some organizations and industries have more data than others. Still others have been using Big Data for a longer period of time than others and some organizations are just starting out with Big Data. Let’s talk about how a few industries are using Big Data and where you’d fit into their team with a Big Data background. All of these industries are looking for qualified people and are having difficulty finding them. Chapter 2 The Benefits of Getting Into Big Data 14 IoT The Internet of Things (IoT) is an exciting usage of Big Data. There are two general things most IoT companies need. IoT companies need to ingest or acquire data — and this ingestion needs to happen very fast. This is because so many devices are sending in data at all times and the company is careful that important data doesn’t get lost. Next, they need to analyze that data in some way. This analysis could happen as the data comes in (real-time), later on as a file (batch), or both. The actual analysis will be driven by the use case. A data engineering team is responsible for both the ingestion and analysis of incoming data. They may work with other parts of the organization to write the analysis or understand the sort of data they’re working with. Finance Financial organizations make extensive use of Big Data. These organizations were some of the first with Big Data problems. They have all sorts of data that needs to be processed. This could be doing end-of-day reports and calculations. It could be providing the data for trading and predicting when to trade based on large amounts of input data. Working at a financial organization requires a great deal of domain knowledge. If you’re able to mix your in-depth domain knowledge with the Big Data technical knowledge, you’ll be in high demand. Chapter 2 The Benefits of Getting Into Big Data 15 Financial Companies I spend a good portion of my time teaching at financial companies. There are a few reasons for this. They have very specific Big Data needs and they can’t just go out and hire new people. Usually, it’s more time efficient to train their existing staff on Big Data because they already know the existing systems. Financial organizations may not be the most exciting places, but they pay well and they’re stable. Social Have you ever wondered why social media companies are worth so much despite having a product that is free? The answer is that companies like Facebook, Twitter, and LinkedIn are interested in your data. By making wise use of this data, they can use this data to market products to you. That’s the very basic description of their business model, but how do they do it from a technical perspective? They take their Big Data and process it to understand who you are and what you like. These companies are at the forefront of Big Data and often create their own Big Data technologies to handle their use cases. A few examples of these are Presto from Facebook and Apache Storm from Twitter. These companies want people with the latest skills. If you know the latest cutting edge technology, a social media company can make faster and better analysis about their users. Marketing and eCommerce Marketing and eCommerce companies share a common goal. Both types of organizations use data to sell to their customers. They have vast quantities of data about their customers. Most companies will track every online interaction with a site. The real value comes in analyzing those interactions. Chapter 2 The Benefits of Getting Into Big Data 16 Which pages or products did you visit before you chose the product you bought? Given a category, which product is bought most often? Given a product, which products are the most similar to it? These might sound like easy questions to answer, but at the scales of Big Data you have a more difficult problem. These interactions are spread on 100s of web servers’ log files and on many different systems. These companies have 50-200 million active customers. They’ll be dealing with 10’s to 250 million different products. Your knowledge of Big Data will help them answer these questions, no matter what the scale. Government and Non-profits Big Data is at all levels of government. It isn’t just a federal or state/provincial problem. The datasets are entire countries. Some government organizations deal with data from the entire world. Your knowledge of Big Data helps the government find value in a sea of data. Better yet, you’ll be able to join other datasets together to process them all at once. Some of the biggest insights for government use cases come from taking data from different silos and processing them together. Not Just Military and Spies Usually the military and spy agencies get the most press about their Big Data efforts, but all parts of government use Big Data. One of my friends processed commercial and consumer data for consumer protection and consumer confidence. There are all different use cases. Others These are just a few of the high points of the industry usage. If your industry isn’t here, don’t worry. I didn’t want this section to become an exhaustive list of every single industry. I do want to share some common use cases that exist in almost every industry. Chapter 2 The Benefits of Getting Into Big Data 17 It’s very common to take logging data and process it. This could be logs from your custom application, Apache web logs, or system logs. Once you have a large enough infrastructure, it’s difficult to know what’s running and what’s about to encounter problems. These logging use cases bring all data together so it can be processed. Most companies have a website. They’ll often want to know more information about what’s happening than their off-the-shelf software can provide. Or, they’ll want to know something custom to their use case that’s happening. Sometimes, they’ll want deeper analytics into what their users did over large periods of time. Either way, they’ll need Big Data and custom analytics. Other companies need a customer service backend that operates at large scale. These backends are for companies with many customers. For each of those customers, there are all sorts of data points and the data may be coming from many different sources. The Big Data solution allows the company to combine all of the data in a single place so the customer service representatives have a complete view of the customer. This leads to happier customers and enables companies to process this data for other purposes. Chapter 2 The Benefits of Getting Into Big Data 18 CHAPTER 3 What Are the Skills Needed on a Data Engineering Team? Before I can explain the exact skills you need, I need to give you some definitions. These definitions will help you understand how a Big Data team or data engineering team is set up and works. I’ll also talk about how a data engineering team interacts with other parts of the organization. If you are a team lead, manager, VP, or CxO and want to learn more about how to run data engineering teams, I’ve written an entire book on the subject titled Data Engineering Teams. Please visit http://tiny.bdi.io/detbook for more information about the book. Don’t worry, I’ll give specific information about positions, titles, and what specific skills each member of the team needs in Chapter 5 “Switching Careers”. What Is a Data Engineer? Let me start off by giving the title Data Engineer a formal definition. A Data Engineer is someone with specialized skills in creating software solutions around data. Their skills are predominantly based around Hadoop, Spark, and the open source Big Data ecosystem projects. Data Engineers come from a Software Engineering background and program in Java, Scala, or Python. A Data Engineer has realized the need to go from being a general Software Engineer and specialize in Big Data as a Data Engineer. This is because Big Data is changing and they need to keep up with the changes. Also, there is a copious amount of knowledge that a Data Engineer needs to know and there isn’t enough time to keep up with Big Data and other general software topics. Chapter 3 What Are the Skills Needed on a Data Engineering Team? 19 What Is a Data Engineering Team? Next, we need to define a data engineering team, talk about where it lives in the organization, and how the team interacts with the rest of the organization. I call the data engineering team the hub of the wheel for data. It shows how a data engineering team becomes an essential part of the business process. Figure 3.1: The data engineering team as the hub of data pipeline information for the organization Being the hub in the wheel means the data engineering team needs to understand the whole data pipeline, and need to disseminate this information to other teams. The data engineering team will also need to help other teams know what data is available and what that data format is. Finally, they’ll have to review code or write the code for other teams. People often confuse data engineering teams and Data Engineers as being one in the same. They aren’t. A data engineering team isn’t made up of a single type of person or title. Rather, there are required skills that every data engineering Chapter 3 What Are the Skills Needed on a Data Engineering Team? 20 team needs. The sorts of titles on a data engineering team are Data Engineer, Data Architect, DBA (or variation thereof like Data Warehouse Engineer), and DevOps Engineer. This multidisciplinary approach is required because the team doesn’t just handle data. They aren’t programmers behind the scenes that no one interacts with. The team interacts with virtually every part of the company, and the team’s products become the company’s lifeblood. This isn’t your average backend software; this will be how your company improves its bottom line and starts making money from its data. These results require people who aren’t just ordinary programmers. They’re programmers who have cross-trained in other fields and skills. The data engineering team also has some members or skillsets that aren’t normally in an orthodox software engineering team. The team is almost entirely Big Data focused. This is where the team’s specialty is consistent. Everyone on the data engineering team needs to understand how Big Data systems are created and how they work. That isn’t to say that a data engineering team can’t or won’t create small data solutions. Some small data technologies will be part of a Big Data solution. Using the data engineering team for small data is entirely possible, but it is generally a waste of their specialty. Qualified Data Engineers I often talk about qualified Data Engineers. This means that they have shown their abilities in at least one real-world Big Data deployment. Their code is running in production, and they’ve learned from this experience. These engineers also know 10 to 30 different Big Data technologies. A qualified Data Engineer’s value is to know the right tool for the job. They understand the subtle differences in use cases and between technologies, and they can create data pipelines. These peope are the ones you rely on to make the difficult technology decisions. Chapter 3 What Are the Skills Needed on a Data Engineering Team? 21 Data Scientists and Data Science Teams I will briefly define a Data Scientist and data science team, before I talk about their relationship to data engineering teams. A Data Scientist is someone with a math and probability background who also knows how to program. They often know Big Data technologies in order to run their algorithms at scale. A data science team is multidisciplinary, just like a data engineering team. The team has the variety of skills needed to prevent any gaps. It’s unusual to have a single person with all of these skills and you’ll usually need several different people. A Data Engineer is different from a Data Scientist in that a Data Engineer is a much better programmer and distributed systems expert than a Data Scientist. A Data Scientist is more skilled at the math, analysis, and probabilities than a Data Engineer. That isn’t to say there isn’t some crossover, but my experience is that Data Scientists usually lack the hardcore engineering skills to create Big Data solutions. Conversely, Data Engineers lack the math backgrounds to do advanced analysis. Hence, the teams are more complementary than heavily overlapping. Multidisciplinary Data engineering teams are multidisciplinary in nature. In contrast to other teams, not everyone on the data engineering team will have the title Data Engineer. This is because the skills that are needed on a data engineering team aren’t always found in just Data Engineers. The team will be predominantly made up of Data Engineers, but some companies will embed a few non-Data Engineer positions. These titles include: • • • • • Data Warehouse Engineer DevOps Engineer Data Scientist Business Intelligence or Data Analyst DBA Chapter 3 What Are the Skills Needed on a Data Engineering Team? 22 Figure 3.2: The data engineering team interacting with the data science team as the hubs of data for the organization Chapter 3 What Are the Skills Needed on a Data Engineering Team? 23 Where Do You Fit In These Teams? Looking at this list of titles and positions on a data engineering team, you may not see your title. Does that mean you shouldn’t be doing Big Data? No, the data engineering team creates the data pipelines. These pipelines are then consumed by the rest of the organization. These other teams will need Big Data skills, though not at the technical level as their Data Engineer counterparts, to process or analyze the data. You or your team may exist on the outside of the hub making extensive usage of the data created inside the hub. Data Engineering Team Skills Now that we’ve defined some terms and helped you understand how you’ll be interacting with the team, let’s talk about the specific skills that are needed on a data engineering team. Every person on the team should have at least one of these skills, and ideally several. The skills needed on a data engineering team are: • • • • • • • • Distributed systems Programming Analysis Visual communication Verbal communication Project veteran Schema Domain knowledge Distributed Systems Big Data is a subspecialty of distributed systems. Distributed systems are hard. You’re taking many different computers and making them work together. This requires systems to be designed a different way — you actually have to design how data moves around those computers. Chapter 3 What Are the Skills Needed on a Data Engineering Team? 24 Having taught distributed systems for many years, I know this is something that takes people time to understand. It takes time and effort to get right. Common titles with this skill: Software Architect, Software Engineer Programming This is the skill for someone who actually writes the code. They are tasked with taking the use case and writing the code that executes it on the Big Data framework. The actual code for Big Data frameworks isn’t difficult. Usually the biggest difficulty is keeping all of the different APIs straight; programmers will need to know 10 to 30 different technologies. I also look to the programmers to give the team its engineering fundamentals. They’re the ones expecting continuous integration, unit tests, and engineering processes. Sometimes data engineering teams forget that they are still doing engineering and operate as if they’ve forgotten their fundamentals. Common titles with this skill: Software Engineer Analysis A data engineering team produces data analysis as a product. This analysis can range from simple counts and sums all the way up to more complex products. The actual bar or skill level can vary dramatically on data engineering teams; it will depend entirely on the use case and organization. The quickest way to judge the skill level needed for the analysis is to look at the complexity of the data products. Are they equations that most programmers wouldn’t understand or are they relatively straightforward? Other times, a data product is a simple report that’s given to another business unit. This could be done with SQL queries. Very advanced analysis is often the purview of a Data Scientist and may not be directly part of the data engineering team. Common titles with this skill: Software Engineer, Data Analyst, Business Intelligence Analyst, DBA Chapter 3 What Are the Skills Needed on a Data Engineering Team? 25 Visual Communication A data engineering team needs to communicate its data products visually. This is often the best way to show what’s happening with data, especially vast amounts of it, so others can readily use it. You’ll often have to show data over time and with animation. This function combines programming and visualization. A team member with visual communication skills will help you tell a graphic story with your data. They can show the data not just in a logical way, but with the right aesthetics too. Common titles with this skill: Software Engineer, Business Intelligence Analyst, UX Engineer, UI Engineer, Graphic Artist Verbal Communication The data engineering team is the hub in the wheel where many spokes of the organization come in. You need people on the team who can communicate verbally with the other parts of your organization. Your verbal communicator is responsible for helping other teams be successful in using the Big Data platform or data products. They’ll also need to speak to these teams about what data is available. Other data engineering teams will operate like internal solutions consultants. This skill can mean the difference between increasing internal usage of the cluster and the work going to waste. Common titles with this skill: Software Architect, Software Engineer, Technical Manager Project Veteran A project veteran is someone who has worked with Big Data and has had their solution in production for a while. This person is ideally someone who has extensive experience in distributed systems or, at the very least, extensive multithreading experience. This person brings a great deal of experience to the team. Chapter 3 What Are the Skills Needed on a Data Engineering Team? 26 The project veteran is the person that holds the team back from bad ideas. They have the experience to know when something is technically feasible, but a bad idea in the real world. They will give the team some footing or long-term viewpoint on distributed systems. This translates into better design that saves the team money and time once things are in production. Common titles with this skill: Senior Software Architect, Senior Software Engineer, Senior Technical Manager Schema The schema skill is another odd skill for a data engineering team, because it’s often missing. Members with this skill help teams lay out data. They’re responsible for creating the data definitions and designing its representation when it is stored, retrieved, transmitted, or received. The importance of this skill really manifests as data pipelines mature. I tell my classes that this is the skill that makes or breaks you as your data pipelines become more complex. When you have 1 PB of data saved on Hadoop, you can’t rewrite it any time a new field is added. This skill helps you look at the data you have and the data you need to define what your data looks like. Often, teams will choose a small data format like JSON or XML. Having 1 PB of XML, for example, means that 25% to 50% of the information is just the overhead of tags. It also means that data has to be serialized and deserialized every time it needs to be used. The schema skill goes beyond simple data modeling. Practitioners will understand the difference between saving data as a string and a binary integer. They also advocate for binary formats like Avro or Protobuf. They know to do this because data usage grows as other groups in a company hear about its existence and capabilities. A format like Avro will keep the data engineering team from having to type-check everyone’s code for correctness. Common titles with this skill: DBA, Software Architect, Software Engineer Chapter 3 What Are the Skills Needed on a Data Engineering Team? 27 Domain Knowledge Some jobs and companies aren’t technology focused; they’re actually domain expertise focused. These jobs focus 80% of their effort on understanding and implementing the domain. They focus the other 20% on getting the technology right. Domain-focused jobs are especially prevalent in finance, health care, consumer products, and other similar companies. Domain knowledge needs to be part of the data engineering team. This person will need to know how the whole system works throughout the entire company. They’ll need to deeply understand the domain for which you’re creating data products; these data products will need to reflect this domain and be usable within it. Common titles with this skill: DBA, Software Architect, Software Engineer, Technical Manager, The Graybeard, Project Manager Chapter 3 What Are the Skills Needed on a Data Engineering Team? 28 CHAPTER 4 What You Should Know About Big Data What Putting in Your Own Lawn Has to Do With Big Data I’m going to tell you a story about the time I put in my own lawn and sprinkler system. Trust me, it all relates to Big Data. What’s a Lawn and Sprinkler System? For the folks around the world who don’t know what lawns and sprinkler systems are, let me explain. A lawn is something (most) Americans want. It’s an area of grass that you spend your weekends mowing and caring for. It doesn’t rain enough rain where I live to keep a lawn green and yet my homeowner’s association requires it to be green. I had to install a system that sprays water on the lawn. See also: • American obsession with lawns • Homeowner associations • Sprinkler systems I’m a programmer, not a landscaper. I read a book and checked a few sites on how to install sprinkler systems and lay sod. Within a short amount of time I could go from an amateur to an intermediate level and get things done. I didn’t have to spend a copious amount of time learning; the majority of the time was spent on the backbreaking labor. As long as the sprinkler system generally worked, I was happy. Any mistakes that I made during the installation would be localized to a small area and under Chapter 4 What You Should Know About Big Data 29 Figure 4.1: My backyard after putting the sod down Chapter 4 What You Should Know About Big Data 30 ground. If I had to fix something, I would simply dig up that small part of the yard, fix the issue, and put it back to normal. With these skills, I could do 99% percent of lawns. I could have helped other friends install their lawns. The same sorts of principles will apply. Small Data Working with small data has given you a similar comfort level. You didn’t need an expert-level knowledge of every technology to use it right. You could implement things the same way even in different circumstances. You have a software stack, be it Linux-Apache-MySQL-PHP or Linux-TomcatMySQL-Java. This same software stack can handle 99% of whatever is thrown at it. A Common Frustration This is one of the most common reasons people get frustrated learning Big Data. You don’t have a have a simple software stack for everything. You’ll need to know and understand more technologies at a much deeper level. If there was a problem, the problem just had to be metaphorically dug up and fixed in one place. There weren’t far reaching effects to the system and the system was relatively uncomplicated. Big Data Working with Big Data will get you outside of your comfort zone. You’ll need to know 10-30 different technologies to create a data pipeline. This knowledge will need to range from cursory to expert level. Instead of having a standard stack, you’ll be focused on the use case. Every decision from technology choices to implementation will be predicated on your use case. Chapter 4 What You Should Know About Big Data 31 When you make a mistake, you’ll have to metaphorically dig up the entire backyard. The system is integrated and data pipelines have far reaching effects on the system. Likewise, the mistakes are exaggerated too. Mistakes with Big Data can take weeks or months to fix instead of hours or days. There is a really big difference in how Big Data systems are built and updated. Why Is Big Data So Much More Complicated? Let’s start off with a diagram that shows a sample architecture for a mobile product, with a database back end. Figure 4.1 illustrates a run-of-the-mill mobile application that uses a web service to store data in a database. Figure 4.1: Simple architecture diagram for a mobile application, with a web backend. Let’s contrast the simple architecture for a mobile app with Figure 4.2, which shows a starter Hadoop solution. Figure 4.2: Simple architecture diagram for a Hadoop project. As you can see, Hadoop weighs in at double the number of boxes in our diagram. Why? What’s the big deal? It’s that a “simple” Hadoop solution actually isn’t very Chapter 4 What You Should Know About Big Data 32 simple at all. You might think of this more as a starting point or, to put it another way, as “crawling” with Hadoop. The “Hello World!” of a Hadoop solution is more complicated than other domain’s intermediate to advanced setups. Just look at the source code for a “Hello World!” in Big Data to see it’s not so simple. Now, let’s look at a complicated mobile project’s architecture diagram in Figure 4.3. Figure 4.3: Complex architecture diagram for a mobile application with a web backend (yes, it’s the same as the simple one). Note: That’s the same diagram as the simple mobile architecture. A more complex mobile solution usually requires more code or more web service calls, but no additional technologies are added to the mix. Let’s contrast a simple Big Data/Hadoop solution with a complex Big Data/Hadoop solution in Figure 4.4. Yes, that’s a lot of boxes, representing various types of components you may need when dealing with a complex Big Data solution. This is what I call the “running” phase of a Big Data project. You might think I’m exaggerating the number of technologies to make a point; I’m not. I’ve taught at companies where this is their basic architectural stack. I’ve also taught at companies with twice as much complexity as this. Instead of just looking at boxes, let’s consider how many days of training it would take between a complex mobile and Big Data solution, assuming you already know Java and SQL. Based on my experience, a complex mobile course would take four to five days, compared to a complex Big Data course, which would take 18 to 20 days, and this estimate assumes you can grok the distributed systems side of all of this training. In my experience teaching courses, a Data Engineer Chapter 4 What You Should Know About Big Data 33 Figure 4.4: More complex architecture diagram for a Hadoop project with realtime components. can learn mobile, but a Mobile Engineer has a very difficult time learning data engineering. You’ll see me say that Data Engineers need to know 10 to 30 different technologies in order to choose the right tool for the job. Data engineering is hard because we’re taking 10 complex systems, for example, and making them work together at a large scale. There are about 10 shown in Figure 4.4. To make the right decision in choosing, for example, a NoSQL cluster, you’ll need to have learned the pros and cons of five to 10 different NoSQL technologies. From that list, you can narrow it down to two to three for a more in-depth look. During this period, you might compare, for example, HBase and Cassandra. Is HBase the right one or is Cassandra? That comes down to you knowing what you’re doing. Do you need ACID-ity? There are a plethora of questions you’d need to ask to choose one. Don’t get me started on choosing a real-time processing system, which requires knowledge of and comparison among Kafka, Spark, Flink, Storm, Heron, Flume, and the list goes on. Chapter 4 What You Should Know About Big Data 34 Distributed Systems Are Hard Distributed systems frameworks like Hadoop and Spark make it easy, right? Well, yes and no. Yes, distributed systems frameworks make it easy to focus on the task at hand. You’re not thinking about how and where to spin up a task or threading. You’re not worried about how to make a network connection, serialize some data, and then deserialize it again. Distributed systems frameworks allow you to focus on what you want to do rather than coordinating all of the pieces to do it. No, distributed systems frameworks don’t make everything easy. They make it even more important to know the weaknesses and strengths of the underlying system. They assume that you know and understand the unwritten rules and design decisions they made. One of those assumptions is that you already know distributed systems or can learn the fundamentals quickly. I think Kafka is a great example of how, in making a system distributed, you add 10x the complexity. Think of your favorite messaging system, such as RabbitMQ. Now, imagine you added 10x the complexity to it. This complexity isn’t added through some kind of over-engineering, or my PhD thesis would fit nicely here. It’s simply the result of making it distributed across several different machines. Making a system distributed, fault tolerant, and scalable adds the 10x complexity. How Long Will It Take to Learn? Given this level of complexity and the sheer number of technologies you need to learn. It’s going to take a while to learn everything you need. In my own experience, it took me 4 months before I really felt comfortable with all of the things I needed to know. Keep in mind that I had a decent background in distributed systems before moving over to Big Data. This background helped a great deal, but didn’t completely alleviate the complexity of learning. Honestly, nothing prepared me for the sheer number of technologies I had to learn. In my experience teaching, the average person is going to take 6-12 months to feel comfortable with everything. They’ll take about 1-4 months to learn enough about the various frameworks to be productive. Chapter 4 What You Should Know About Big Data 35 Ability Gap There’s another crucial part to how long it will take you to learn Big Data technologies. Instead of a question of when, it’s a question of if. As a trainer, I’ve lost count of the number of students and companies I’ve taught. One thing is common throughout my teaching: there is an ability gap. Some people simply won’t understand Big Data concepts on their best day. Most industry analysts usually talk about skills gaps when referring to a new technology. They believe it’s a matter of an individual simply learning and eventually mastering that technology. I believe that too, except when it comes to Big Data. Big Data isn’t your average industry shift. Just like all of the shifts before it, it’s revolutionary. Unlike the previous shifts, the level of complexity is vastly greater. This manifests itself in the individuals trying to learn Big Data. I’ll talk to people where I’ve figured out they have no chance of understanding the Big Data concept I’m discussing. They’ve simply hit their ability gap. They’ll keep asking the same question over and over in a vain attempt to understand. Yes, this is harsh The technical bar for Data Engineers is pretty high. This ability gap is specific to Data Engineers where they’ll have to create these data pipelines. It isn’t something that other members of the data engineering team or consumers of data pipelines will have to deal with as much. These positions can rely on the Data Engineers to help them with the really difficult parts. Changes You’ll Need to Make Oddly enough, not every change you’ll need to make will be purely technical. Some changes will be around your mindset and perception of data. The biggest Chapter 4 What You Should Know About Big Data 36 change is around scale. When thinking about scale, I encourage students to think in terms of: • 100 billion rows or events • Processing 1 PB of data • Jobs take 10 hours to complete I’ll talk about each one of these thoughts in turn. When you are processing some data, in real-time or batch, you need to imagine that you’re processing 100 billion rows. These can be 100 billion rows of whatever you want them to be. This affects how you think about reduces or the equivalent of reduces in your technology of choice. If you reduce inefficiently or when you don’t have to, you’ll experience scaling issues. When you are thinking about amounts of data, think in terms of 1 PB. Although you may have substantially less data stored, you’ll want to make sure your processing is thought about in those terms. As you’re writing a program to process 100 GB, you’ll want to make sure that same code can scale to 1 PB. One common manifestation for this problem is to cache data in memory for an algorithm. While that may work at 100 GB, it probably will get an out of memory error at 1 PB. When you are thinking about long-running processes, I encourage teams to think of them running for 10 hours to complete. This thought has quite a few manifestations. The biggest manifestation is about coding defensively. If you have an exception at 9.5 hours into a 10-hour job, you have two problems: to find and fix the error, and to rerun the 10-hour job. A job should be coded to check assumptions, whenever possible, to avoid an exception from exiting a job. A common example of these issues is when a team is dealing with string-based formats, like JSON and XML, and then expecting a certain format. This could be casting a string to a number. If you don’t check that string beforehand with a regular expression, you could find yourself with an exception. Given the 10-hour job considerations, the team needs to decide what to do about data that doesn’t fit the expected input. Most frameworks won’t handle data errors by themselves. This is something the team has to solve in code. Some common options are to skip the data and move on, log the error and the data, or to die immediately. Each of these decisions is very use case dependent. If losing Chapter 4 What You Should Know About Big Data 37 data or not processing every single piece of data is the end of the world, you’ll end up having to fix any bad data manually. Every engineering decision needs to be made through these lenses. I teach this to every team, even if their data isn’t at these levels. If they are truly experiencing Big Data problems, they will hit these levels eventually. Not Just Beginner Skills Big Data is complex. Organizations who are hiring for teams have been bitten by hiring people who’ve hit their ability gap. They hired someone with beginner skills thinking that they’d be able to progress over time. That person didn’t progress due to their ability gap and the organization is forced to hire another person. Suffice it to say, the organization doesn’t repeat that mistake. Now, organizations require intermediate to advanced skills before hiring new people on the team. This is to verify that the new person doesn’t have an ability gap beforehand. Organizations simply cannot afford to hire people with the wrong skills and ability. Many people haven’t internalized this fact — they continue to think their beginner skills will get them a job. They fail interview after interview or they never get the call for the interview in the first place. A person wanting to work with Big Data will need intermediate to advanced skills. It’s important to note that just memorizing interview questions is another route people take. A competent interviewer will see this quickly. Chapter 4 What You Should Know About Big Data 38 The Hiring Manager Conversation "Pass" he said while on the phone. "Pass" he said again into the phone. Very curious, I waited until he finished with his phone call. He’s the hiring manager for a large company who is looking for Data Engineers. He was telling their recruiter to pass (not hire) on the job candidates they had just interviewed. Why was he passing on these candidates? "They weren’t qualified enough," he said. "They either didn’t have the experience or breadth of knowledge to do the job. It’s better to not even hire them." You may be reading this and start to worry about a chicken and egg scenario. How will you get the skills without getting a job? In this case, people are equating skill with experience. You can gain and demonstrate skills without having on the job experience (see Chapter 7 “How Do You Get a Job”). Companies do realize that data engineering is new and not everyone will have job experience. They do expect that you have spent the time beforehand getting the skills through your own learning. Understanding the distinction betwen skill and experience is the difference between getting a job and wasting countless hours applying and interviewing. Which Technology Should You Learn? People who are new to Big Data fall into a trap from their time in small data. They think they just need to know a stack or one technology. So, they look through the marketing materials and the blogosphere for what people are recommending. Others go on LinkedIn and ask what the latest technology is that they should learn. The sad truth is that these people fall into another trap that wastes their time. Their perception is that everything just needs one technology. That differs from the real world at companies and their perception. The reality is that companies Chapter 4 What You Should Know About Big Data 39 use many different technologies all working together to create a data pipeline. Even those consuming and analyzing the data pipelines will need to work with several different technologies. The answer is that Data Engineers will need to know 10-30 different technologies. Their value is that they know the right tool for the job and that saves the organization time and money. This book doesn’t go deeply into the technologies themselves, but we go a little deeper in Chapter 6 “How Can You Do It.” Hadoop Is Dead You’ll read that some technology is dead. That leads smart people to always be searching for the technology that isn’t dead and they get into an infinite loop searching for what isn’t dying. The reality is that every technology is dying; some are simply dying faster than others. X is Dead Most of these "technology X is dead" articles are coming from a vendor whose product competes with that "dead" technology. The vendor has a vested interest in furthering the perception that the technology is dead. If you use this as a reason to not learn a technology, you’ll never get anywhere. Trust me, this is a great way to never get anything done — and I see it happen all the time. Chapter 4 What You Should Know About Big Data 40 Hadoop MapReduce Is there any value to learning MapReduce now? I still teach MapReduce. I find it’s a great way to see how things are working in the behind the scenes and with a simpler API than others. The newer technologies like Spark and Flink are more supersets that add more functionality. Understanding MapReduce makes learning Spark and Flink easier. Many companies have legacy MapReduce code that Data Engineers need to maintain or use. Still other companies haven’t made the jump and they’re waiting for the next technology after Spark to get adopted. Future Proofing Your Career The Big Data landscape is changing constantly. Some technologies are changing faster than others. There is no such thing as future proofing your career by choosing the right technology. The right technology will eventually become the wrong technology. You will need to spend time and effort to keep up with what’s happening. This is part of the reason that Data Engineers must specialize. They can’t be a general Software Engineer anymore. They need to specialize in Big Data technologies in order to keep up with sheer number of changes and new technologies coming out every day. Is Big Data Going to Last? Since Big Data is changing so much, does that mean it’s going to burn out like so many fads before it? Personally, I went down the path of a few fads (ahem Ruby on Rails). I know the pain and wasted time of going down a path of something that showed so much promise and burnt out faster that you can say fad. I’ve spent a good amount of time looking at Big Data’s life span and thinking about it. I obviously have a dog in this fight, but I did all of my analysis before I Chapter 4 What You Should Know About Big Data 41 went heavily into Big Data. I redid my analysis before I started my own company focused on Big Data. I’m also at the forefront of Big Data. Almost every week, I’m teaching a company that is either starting their Big Data journey or expanding it. There is an obvious bias in this data — the companies that stopped or cancelled their Big Data projects aren’t bringing me in to train their staff (aka survivor bias). I’ve found these companies stop for three reasons: • They thought Big Data is a silver bullet that would cover over the shortcomings of their company • They didn’t have real Big Data problems in the first place, could replace Big Data technologies with small data technologies, and be better off. • They didn’t know how to run a Big Data project and it failed. If an organization gets past these issues and they have true Big Data needs, they can’t do anything else but use Big Data technologies. Think about the companies that have real Big Data needs like Google, Apple, Facebook, etc. They’ve never said “We don’t have Big Data problems anymore. Let’s switch back to these small data technologies and fire the teams.” That doesn’t happen because they can’t go back to small data technologies and teams with only small data skills. Companies like Google have been researching how to make Big Data easier for years. They’ve made it easier, but they haven’t lowered the bar so much that anyone can do it. There’s still a high technical bar for the Data Engieers who can create these systems. It’s somewhat easier for these skilled people to create these systems and they don’t have to write these systems from scratch anymore. Chapter 4 What You Should Know About Big Data 42 Why I Don’t Think Big Data Is Going Away In my view, Big Data isn’t going away and will increase in usage because: • When you have real Big Data problems you can’t switch back to small data. • Other organizations are seeing what’s possible and wanting to mimic that success • The supply of Data Engineers will go up, but won’t become saturated because the technical bar is much higher. You’re not going to wake up one day and be replaced by a bunch of Web Developers who learned Big Data technologies. • Data pipelines aren’t a one and done project. The project and data pipeline are constantly changing. Organizations will need people to continually update and evolve these pipelines. Chapter 4 What You Should Know About Big Data 43 CHAPTER 5 Switching Careers Retooling A good portion of my students are people who’ve spent several years in small data. They’re looking to retool their careers and switch to Big Data. In these scenarios, you’ll need to focus on what you bring that others don’t. You may bring more domain experience — and that’s difficult to train a new person on. You’ll have more systems and coding experience than others. Once you add Big Data skills to existing experience, you become an asset to the organization. Most data engineering teams skew towards the senior level titles. Other teams who aren’t part of the data engineering team tend to be evenly distributed on seniority. When Layoffs Happen I’ve had the unfortunate luck of teaching while layoffs were happening. They literally sent the email while I was lecturing. I know the face of teams who’ve just found out that there are layoffs. I tell them to focus on the class even more. This Big Data class is how they’re going to get their next job. If you’ve been laid off or are feeling like layoffs are imminent, focus on your learning. The organization can fire you but they can’t take your acquired knowledge. Use your existing domain knowledge and your new Big Data skills to get your next job. Chapter 5 Switching Careers 44 Newly Graduated While data engineering teams skew towards senior people, that doesn’t mean that newly graduated people or people who are fresh out of school don’t work on data engineering teams. There are fewer of these people and they’ll need to improve their skills. I’ve had interns and recent graduates people in my classes. They tend to have their Master’s degrees in Computer Science. Those with their Bachelor’s’ degree had a focus on distributed systems and multi-threading. Other teams who aren’t part of the data engineering team tend to be evenly distributed on seniority and have more junior engineers. General Advice on Switching Reading this book will put you ahead of others. You’ll actually know going in what you’re up against. You’ll have a much better idea of how much effort and work it will be given your existing skills. Setting up your expectations is a major reason I wrote this book. Unrealistic expectations is a major reason I see people fail at learning Big Data. They come from their small data background and think Big Data will follow suit. It doesn’t. Learning Big Data will be difficult and time consuming. People who are new to Big Data think I’m exaggerating when I say that Data Engineers need to know 10-30 different technologies. This is how people don’t get jobs. They know 1 technology and no one will hire them; their perception doesn’t match reality. You will need to know multiple technologies because a data pipeline is made up of multiple technologies all working together. Those who aren’t part of a data engineering team but still interact with data pipelines have a lower technical bar. You will still need to know various technologies, but not at the depth a Data Engineer needs to know them. Now, I’ll give specific advice and recommendations to different positions and skills. Chapter 5 Switching Careers 45 Programmers and Software Engineers Programmers and Software Engineers generally become Data Engineers. These Data Engineers take their existing knowledge of programming and augment it with Big Data. I put architects in this section too. In my experience, architects need to do some level of coding with Big Data. They may help create the PoC, but not do the majority of coding. I’ve found that architects trying to do Big Data pipelines without a coding background don’t understand the technology tradeoffs enough to make the correct decision. Some Software Engineers have a math or statistics background. These people are often good candidates to become Data Scientists. I’ve had students who’ve joined data science teams to improve the data science team’s programming abilities, but still know the math behind the algorithms. Data Science Team’s Engineering Skills Most data science team’s software engineering skills vary from fair to absent. They’re people without engineering backgrounds doing engineering. A programmer with a good engineering background is an asset to the team. Required Skills First and foremost, you will need programming skills. You should have at least intermediate programming skills. People who are brand new to programming struggle as Data Engineers. You should have a general understanding of Linux. You’ll need to be relatively familiar with using a Linux command line to issue commands. Most work can be done in the GUI, but some Hadoop-specific interactions are on the command line. More helpful, but not required, is a background in distributed systems. Big Data frameworks like Hadoop are distributed systems. These frameworks make it easier to work with distributed systems, but don’t completely mask all of the Chapter 5 Switching Careers 46 complexity. At some point in your journey, you’ll need to learn these concepts to really master Big Data frameworks. Another possibility is to build on your existing multi-threading skills. If you have done cross thread and concurrency work, some of Big Data’s concepts will be familiar. With Big Data, your doing concurrency on many processes spread out on many nodes instead of threads in a single process. Which Languages? The majority of Hadoop and Hadoop ecosystem is written in Java. You should have an intermediate to advanced level of Java knowledge. You need to understand concepts like generics, inheritance, and abstract classes. In Java 8, the language introduced Lambda functions. Most of the new Big Data frameworks are moving to using Lambda functions throughout. It’s well worth your time to learn them as they improve your Big Data code significantly. Hadoop and the its ecosystem support other languages to varying extents. Scala is a popular language. Since Scala is a JVM language, you can use it with any other JVM-based framework. Some Big Data frameworks are written in Scala. Apache Spark is one of those technologies. In my experience, Data Engineers and data engineering teams are 95% Javabased. Scala is more popular with data science teams because of its dynamic nature. Python is another popular language with Data Engineers and Data Scientists. Support for Python is improving in the various frameworks out there. Apache Spark and Apache Beam have native Python support. Python is usually supported as a quasi-first class citizen. Everything gets added and tested in Java/Scala first and then ported to Python. This means that Python will lag in support and not have access to everything. Other languages, will work to varying degrees of effectiveness and gotchas. If you mostly program in a language that I didn’t mention, check to see if your target industry or company is using that language. Otherwise, you may want to learn one of the predominant languages above. Chapter 5 Switching Careers 47 SQL Skills I’m seeing a push to add more SQL support in Big Data frameworks. That’s because it’s just easier to express some things in SQL. Joins are one great example. To that end, I highly recommend that Data Engineers learn SQL if they don’t already know it. Cloud vs Open Source/On Prem As you’re choosing a path, you’ll have to decide between learning Cloud-specific technologies and open source technologies. Cloud-specific technologies will limit your job search to companies that are either currently using the cloud or will be using the cloud in the future. They have the benefit that you won’t have to deal with or learn any of the operational sides. Keys to Getting the Job • • • • Programming skills In-depth knowledge of the technologies to know the right tool for the job An awesome personal project A desire and interest in data How I See People Fail • Using the wrong materials to learn by cheaping out on learning (I’ll use YouTube to learn this) • Underestimating the complexity (I don’t believe Jesse that it’s really that hard) • Underestimating or not allocating the time to learn • A lack luster or Hello World-level personal project How I See People Succeed • Saving time with the right learning materials Chapter 5 Switching Careers 48 • Having an awesome personal project with a beginning, middle, and end that you can demonstrate to the interviewer • Showing an interest in the company’s data and results during the interview • Taking the time and effort to learn these technologies thoroughly Managers Why would a manager need to change anything for Big Data? Isn’t it all the same thing as a small data project? This is how I’ve seen so many Big Data projects fail. The management team doesn’t internalize that Big Data projects need to be run and resourced differently. To give you an idea of how differently, I’ve written an entire book just on how teams should be skilled and how projects should be run called Data Engineering Teams. Every manager should read that book. A small data manager is usually becoming a Data Engineering Team Manager. Required Skills Managers don’t need to know the technology at the same depth as their Data Engineers. That said, managers need at least a cursory knowledge of the technology behind the scenes. Managers that don’t understand the basics, won’t understand what the Data Engineers are talking about otherwise. A technical background is helpful, but not required for this position. Keys to Getting the Job • • • • Knowing the skills that need to be on a data engineering team Knowing how to run a Big Data project A cursory understanding of the technologies you’ll be using Reading my Data Engineering Teams book Chapter 5 Switching Careers 49 How I See People Fail • Thinking that running a Big Data project is just like running a small data project • Trying to get to the same knowledge level as the Data Engineers • Not gaining at least a cursory level of understanding • Not listening to their Data Engineers on technical decisions How I See People Succeed • Truly internalizing how Big Data is different • Knowing the skills on a data engineering team and knowing how to do a skills gap analysis (covered in Data Engineering Teams) • Having concrete ways that the team can be improved • Not being afraid to make tough choices on the team (some team members may have an ability gap) Data Analysts and Business Intelligence Data Analysts and Business Intelligence Analysts are usually consumers of a data pipeline. As such, they’re usually not members of the data engineering team. Sometimes, Data Analysts and Business Intelligence Analysts are part of the data engineering team. I’ve seen this happen when the company is small or when an organization’s analyst team isn’t technically proficient enough to write/code what they need. Chapter 5 Switching Careers 50 The Difference Between a Data Scientist and Analyst I’m often asked what the difference between a Data Scientist and Data Analysts/Business Intelligence Analysts. Both of those positions have backgrounds in math. They differ in two key areas. First, a Data Scientist has better programming skills than an analyst. These usually range from intermediate to those of a Data Engineer. They predominantly use Python and Scala. They know how to use Big Data frameworks to run their code and models at scale. Second, a Data Scientist will know how to apply their statistical background to problems such as machine learning. It is a common progression for Analysts to improve their programming skills and learning Big Data frameworks to become Data Scientists. Required skills Whether they’re part of a data engineering team or not, Data Analysts and Business Intelligence Analysts need to have the technical skills to use the Big Data frameworks. The technical bar is lower and the Data Engineers have created the data pipeline. You will need to have enough technical knowledge to run your analysis. A common question is if Analysts need to learn to program. I highly suggest that Analysts learn to program. This will set you apart from other people going for the same position. Python is a language that’s commonly used by Analysts. It has expanding support in the Big Data ecosystem. R is another popular language with analysts. Its support in the Big Data ecosystem is emerging. Most projects will have little to no support of R. Chapter 5 Switching Careers 51 SQL Skills I’m seeing a push to add more SQL support in Big Data frameworks. That’s because it’s just easier to express some things in SQL. It’s also easier for Analysts to query the data themselves. I highly recommend that Analysts learn SQL if they don’t already know it. Keys to Getting the Job • • • • Understanding the Big Data technologies you’ll be using Having either SQL or programming skills A personal project that shows your analytic and technical skills A true interest in data and finding newfound insights in data How I See People Fail • Not understanding the technology behind the scenes • Thinking that the analysis/math is the only hard part (the technology is just as hard) • Not learning the Big Data technologies aimed at Analysts • A personal project that doesn’t show insights that you’d expect a good Analyst to find How I See People Succeed • Having SQL or programming skills (ideally both) • A personal project that shows a true understanding of the domain and the insights you found • A good knowledge of the Big Data technologies for Analysts like Apache Impala, Hive, Spark SQL, etc • Being able to compliment a data engineering team with your analytics background Chapter 5 Switching Careers 52 DBAs and SQL-focused Positions In this section I discuss the positions that primarily focus on SQL. These are titles like: • • • • DBA Data Warehouse Engineer SQL Developer ETL Developer For ease of reading, I’ll collectively refer to these titles as DBAs. Of all the careers, DBAs are facing the biggest crises from Big Data. They’re faced with a changing landscape of technologies. Data used to be their purview and now they’re finding a brand new team and title emerging, data engineering team and Data Engineer. DBAs are faced with a difficult decision. They’ll need to dramatically increase their technical skills to become part of a data engineering team. By learning to program and learning the Big Data frameworks, I’ve seen some become Data Engineers. I’ve seen some DBAs put the time and effort into their technical skills and handle the analysis part of a data engineering team. Other DBAs will go into more of an operations role outside the data engineering team. All of this boils down to your programming skills. Programming skills are the determining factor when a DBA is figuring out to go into operations or join a data engineering team. Chapter 5 Switching Careers 53 The Traditional DBA The traditional DBA role as we know it is gradually decreasing. The title isn’t going away completely, but we’re going to see a gradual decrease in the size of tradition DBA teams. Part of the reason for sizable DBA teams was that we were using the wrong tool for the job. Doing a full table scan brings the RDBMS to its knees because we didn’t have a better choice. With Hadoop and Spark, we have systems that are purpose built to do the equivalent of full table scans with ease. This isn’t to say that DBAs or RDBMSs are going away. Rather, their use for the wrong jobs will diminish. That will, in turn, reduce the number of DBAs required for the job; instead of a team of 10 DBAs there will be 5 DBAs on the team. Those 5 fired/laid off DBAs are faced with the dilemma of what to do. I strongly encourage all DBAs to start learning Big Data frameworks now. Required Skills You should have advanced SQL skills. There are technologies in the Hadoop ecosystem that use SQL. However, these technologies aren’t enough to create a full-fledged data pipeline. Being limited to SQL as your only means of querying is inherently limiting. It leads to data pipelines that don’t really accomplish their goals. I highly suggest you learn how to program. DBAs often get in an infinite loop figuring out which programming language to learn. For those coming from a PL/SQL background, Python is a general recommendation. If you’re wanting to learn the most widely used language, Java is my recommendation. Learning to program isn’t the memorization of APIs; it’s the application of an API to solve a problem. I find that DBAs fill the schema skill on a data engineering team the best. DBAs Chapter 5 Switching Careers 54 have spent their careers dealing with the manifestations and requirements of schemas. Programmers, on the other hand, don’t understand schema as often. It’s critical that DBAs learn the Big Data frameworks. They may not need to learn the systems as deeply as a Data Engineer, but they still need to understand the systems. These new systems aren’t RDBMS with a few differences; they’re completely different systems with no RDBMS corollary. In the Big Data space, DBAs will often gravitate to the NoSQL databases. I’ve worked with DBA teams who’ve designed and implemented NoSQL databases without really learning the systems. The projects tend to fail because they’re implemented like a RDBMS and that’s a recipe for disaster. There are some WYSIWYG-style (What You See Is What You Get) frameworks for Big Data. They layer on top of the existing Big Data frameworks and they vary in maturity and price. WYSIWYGs I don’t have enough data points to comment one way or another. My initial data is showing that you still need to understand the systems underneath. A WYSIWYG helps you connect and do some pre-canned processing. It doesn’t help you design or create the data pipeline. It also remains to be seen how often you have to write some of your own code to get things done. The examples I’ve seen left the DBA with weird and inefficient work arounds to get things done. Keys to Getting the Job • Filling the data engineering team’s schema and domain knowledge requirements • Learning Hadoop and the Hadoop ecosystem • Understanding and learning about the NoSQL ecosystem • Learning to program Chapter 5 Switching Careers 55 How I See People Fail • Thinking Data Warehousing is the same level of complexity as a Big Data pipeline • Not increasing their knowledge of these new Big Data systems • Thinking that just RDBMS knowledge will get them on a data engineering team • That every problem can be solved with SQL How I See People Succeed • • • • Changing their mindset about Big Data Getting serious about coding Showing your coding skills with an awesome personal project Deeply understanding NoSQL and when a team should be using it Operations Operations are usually the maintainers of the clusters. They’re tasked with keeping the various processes running and maintaining the health of the nodes. They serve as the first line of troubleshooting when something goes wrong in the cluster. Even if you’re doing everything in the cloud, I still suggest teams have at least one operations person. They may not be part of the data engineering team, but they would be assigned cluster maintenance as their primary task. Sometimes, I see operations people who are part of a data engineering team. This happens when the team or organization is doing a DevOps model. In DevOps, I find that the team members are more on the operations spectrum than on the development side. Doing both operations and development puts DevOps Engineers in a difficult situation. They have to know both the operational parts and API parts of the system. This can be a massive undertaking. Chapter 5 Switching Careers 56 Required Skills You will need to learn Big Data frameworks from an administrator point of view. These frameworks require daemons to be running. There are a plethora of configuration properties you will need to know and tune. Often, operations is tasked with the actual running of the jobs and queries that the Big Data framework processes. Hadoop runs on Linux. The majority of clusters run on RHEL/CentOS or Ubuntu. The Big Data nature of things will stress things in ways you might not have seen before. It will expose weird problems you only see at scale. To diagnose and fix these issues, you’ll need to be very good with Linux, especially from the command line. Most of the computers in your Hadoop cluster will be sitting in a data center’s rack or in the cloud. Some of the Hadoop companies like Cloudera and Hortonworks are making cluster administration easier with web-based GUIs. These will help in detecting and monitoring Hadoop clusters. Despite these programs, you’ll still need to know how to troubleshoot a computer with a Linux command line. If you’re planning on administrating an enterprise cluster, you’ll probably be dealing with security. This includes everything from authentication with Kerberos, to line encryption, to at rest encryption. It’s the administrator’s job to set all of this up and keep things secure. Security is becoming a key part of operations as hacks expose entire data pipelines. Some Big Data technologies highly benefit from a DevOps model. Apache HBase, for example, is one of those technologies. In order to really be successful with HBase, the team needs to have equal parts operations and programming knowledge. SQL Skills? Operations teams don’t need SQL skills, but it definitely helps. You don’t need to know every operation and feature, but I suggest you have some basic SELECT statements. Cloud vs Open Source/On Prem Chapter 5 Switching Careers 57 Cloud does not eliminate operations. It reduces the number of people you need for operations. I would never let 90% of the developers I’ve taught near a production system. As you decide to learn cloud technologies or open source technologies, remember that they’re inherently limiting. Between the options, learning open source gives you the most possibilities. Keys to Getting the Job • Knowing how to set up a cluster from scratch • Knowing how to troubleshoot issues whether they’re hardware, custom software, or the framework itself • In-depth knowledge of Linux • Knowledge of the operations on a wide breadth of Big Data technologies How I See People Fail • Thinking that managing a Big Data framework is easy • Failing to understand the various issues of running a 100+ node cluster • Thinking that maintaining a cluster just means making sure the hardware and network are working • Thinking that cloud means no operations How I See People Succeed • Having excellent troubleshooting skills • Showing demonstrable mastery of Linux, especially from the command line • Being able to operate a cluster both from the Web GUIs and command lines • Providing excellent first level support so that only the right things get escalated Chapter 5 Switching Careers 58 What if You’re Not on this List? In the previous list of titles and positions, I tried to capture 80-90% of what I’ve encountered in the real world. That means that I didn’t mention everyone. I’ll try to give some general suggestions here. I’ve taught a number of Physicists. They have a background in data and doing large scale data processing. Often, they’ll become Data Scientists after improving their programming skills and learning the Big Data frameworks. Other times, people don’t come from a computer science background. I’ve had a few Electrical Engineers learn programming and then learn Big Data technologies. They’ve said some of the electrical engineering concepts are very similar to the Big Data concepts. A few people have a general math background. They’re going for an Analyst or Data Science position. Once again, they’ll need to improve/learn programming and then learn Big Data. Keys to Getting the Job • Clearly identifying which position they’re going after • Figuring out what skills they have now and what skills they’ll need eventually • Learning and applying the skills they need to get a job • Continuing to learn and improve those skills How I See People Fail • Thinking that programming is writing out equations in code • Thinking that the technical skills don’t matter • Failing to learn to code well before moving on to more complicated problems • Taking an honest look at your abilities and skills when self-evaluating Chapter 5 Switching Careers 59 How I See People Succeed • • • • Taking enough time to get to an intermediate-level programming skill Finding the best learning resources to guide them Being honest about your skill level during an interview Having a personal project that showcases your skills and how your different point of view will compliment a data science or data enginering team Chapter 5 Switching Careers 60 CHAPTER 6 How Can You Do It? General Learning There’s a common misconception that because the Big Data frameworks are open source, everything else is free and open source. When it comes to learning, the best resources aren’t free. I find that people who try to go the free route eventually stop learning because they don’t progress and just stop trying. Your choice of learning materials directly affects the amount of time and wasted effort you’ll have. There are no fact checkers on the internet. I find that many of the “Learn Fast” or “Learn Cheap” sites are only cheap and light on the learning. They waste your time with either completely wrong or really bad materials. The goal of this guide is to get you a job doing Big Data. Learning about Big Data for pleasure is one thing. Learning enough to get a job on a data engineering team or on a team using Big Data is another level all together. Learning Big Data to get a job means that you can’t just learn the basics. You will need to find a learning resource that teaches the foundational concepts and the advanced materials. Companies don’t hire people with an introductory knowledge; they want people with advanced skills, but not necessarily advanced experience. Will These Materials Get You a Job? If there are testimonials or comments for the materials, read through them. Do they say "great video" or "thanks!"? These are materials that will waste your time. If the testimonials say "I got a job," those are the materials that will get you a job too. If a learning resource doesn’t go deep enough for you to get a job, you’re wasting Chapter 6 How Can You Do It? 61 your time. I see this scenario too often. Someone is trying to go the free or cheap route to learning and getting a job. The problem is that they spend all of their time trying to find and understand poorly written and instructed videos. If they had just spent the money up front for world class materials, they would have been much better off. Let’s go through an example. Say you could make $20,000 per year more with Big Data skills. If you go the free route and it takes 9 months, you will only make an extra $5,000 and lose out on $15,000. Those free materials will likely only give you a beginner-level knowledge and your job search will be very difficult to impossible. You run a real risk of failing to get a job. If you went a premium route and it takes 2 months, you will make an extra $16,600 and lose out on $3,400. Those premium materials cost $5,000 and give you an advanced-level knowledge and making your job search much easier. After paying for the class, you will have made an extra $11,600 and saved 7 months of your time. How Much Is Your Time Worth? One of the biggest mistakes I made during my career was not valuing my time correctly. I would spend hours and hours toiling with free resources because they were free and I was too cheap to pay for them. I would have saved so much time and earned significantly more had I used better resources. Let’s talk about the most common ways of learning Big Data. Books and Blogs Books are one of the most common ways to learn Big Data frameworks. This is how I personally learned some of the Big Data frameworks. I had a considerable background in distributed systems and network programming that helped me. I wasn’t learning the concepts from scratch; I was using the books as more of a reference guide than anything. I think that’s where most books have a problem with Big Data. Few books really teach the concepts. Most are there to serve as a reference guide and to show advanced pieces that aren’t used as often. The people who can pick up a book and learn Big Data are few and far between. Chapter 6 How Can You Do It? 62 Blogs cover a wide range of topics. I find that blogs are best for a single concept or a deep dive into a specific use case. That’s how I write my blogs. They’ll dive deeply into something specific. In order for you to make any sense of the blog, you’ll need a deeper foundation. Without this foundation, these blogs don’t make sense. After getting a solid foundation, these blogs serve as a great tool to learn about new features and use cases. Books and Blogs I Like Once you have your fundamentals, here are some blogs that I like: • http://www.jesse-anderson.com/ • https://www.confluent.io/blog/ • https://blog.cloudera.com/ • https://hortonworks.com/blog/ From O’Reilly, I like the entire The Definitive Guide series. These aren’t great books for learning the fundamentals, but they’re great for references and going deeper into a specific topic. The authors are top notch too. For learning languages, I like Pragmatic Programmers books on languages. They’re to the point and have great voice. Twitter is another great way to keep up on what’s happening. Most of the discussion about what’s happening is on Twitter. I generally digest all of this on my @jessetanderson account. Classes There are a plethora of virtual and in-person classes. The quality of these classes varies dramatically. A class can be broken down into two parts: the course materials and the delivery/instruction. I’ve seen some course materials that were just copying and pastes of the Apache Chapter 6 How Can You Do It? 63 documentation (which is terrible). I’ve seen course materials that were woefully out-of-date. Still other materials will never compile or are so simple you’ll be left with a Hello World level knowledge. The instructor has a great deal to do with the quality of a class. The instructor should specialize in Big Data — otherwise, they won’t really know what’s happening and what’s coming. The instructor should be able to answer your questions, whether they are code or conceptual. An average person won’t be able to select the best training. Just know that there’s a big difference in quality between a $2,000 per person class and a $3,000 per person class. The cutting corners extends to the course materials and the instructor. Virtual Classes I only specialize in Big Data courses. You can see my courses at http://tiny.bdi.io/courses. Boot Camps and Intensives I’ve taught boot camps and interacted with people who’ve come out of bootcamps. The major issue is being jobless for a while. I taught a bootcamp that was 2 months long. My students came out of it with jobs as Data Engineers and Data Scientists. That’s not to say these people didn’t take a big risk quitting their jobs and having to find a new one. In Their Own Words I waffled for a while because I realized I would have to quit my job and look for another after the immersive. In the past, that search took 6 months. I finally took time to talk about this with some friends and decided to take the class. The journey was interesting. After the class, I kept studying. — Stephan W. The upside to these intensives is the level of interaction you get. As opposed to Chapter 6 How Can You Do It? 64 short term classes, you will have months of time to interact with an instructor. The downside to these intensives is the cost. It’s expensive to have that much access to an instructor. These courses start at $20,000 or so. Online Courses Online courses vary greatly in quality. Before you take the plunge, you’ll want to be very certain that you’ve chosen the right one for you. There are all sorts of companies seeking to cash in on the Big Data market. That means they’ll put anything out there. Look at the testimonials and comments for the product. Do the people say that they got a job or that they liked the course? When someone says that they liked the course, that means they learned passively and didn’t get a job. If they say they got a job, they were actually learned and applied the materials to accomplish their goals. There is a big difference in price and quality between a passive learning course and a course that teaches you how to do a new job. Shameless Plug I have an entirely online course called Professional Data Engineering. It is about 8 weeks long and guides you through the technologies you need to know. To make sure people people are completely happy with it, I give a 60 day unconditional money back guarantee. You can purchase it at http://tiny.jesse-anderson.com/ pdesales. MOOC Massive Online Open Courses (MOOC) are massive online classes. They can have 1000s of students. These courses vary even more in quality. Since they’re attracting so many students, they can offer super cheap prices. Chapter 6 How Can You Do It? 65 These classes focus on introductory materials. Any interaction with the instructor is limited or nonexistent. This leads many students to get stuck and never progress in their learning. Since they’re so cheap, there’s also an ‘I’ll do it later’ trap that people fall into. YouTube/Free This is a route I see many people taking and I don’t understand why. I have a YouTube channel at http://tiny.jesse-anderson.com/youtube where I show videos of Big Data concepts. My channel gives you some concepts in 5-10 minute videos. I’ve went through many of the other Big Data videos on YouTube. They cover some coverage of basic concepts. Most of these videos are total waste of time because they’re either incorrect technically or too high level. If you’re starting out with Big Data, learning basic concepts won’t get you a job. You’re going to need more substantial learning materials. No Really, Does It Work? One of my goals for this guide is to keep you from wasting your time and getting discouraged. I think a YouTube-centric route is exactly that. In preparation for writing this section, I went through the people who’d responded to surveys or emailed me directly saying that they’re using YouTube as their primary learning resource. I compared their titles on LinkedIn to see if they’ve actually switched to a Big Data role. None of these people accomplished their goal. This isn’t to to say YouTube is a bad resource. I have some advanced videos on my channel, but without a solid foundation, you’re not going to understand what I’m talking about. A YouTube-centric approach doesn’t give you a solid enough foundation to keep building. Chapter 6 How Can You Do It? 66 Learning to Program In chapter 5 “Switching Careers,” I talked about the need for some titles and positions to either learn to program or advance their programming skills. To be honest, Big Data won’t challenge your syntax and knowledge of the language. In Java, for example, you’re not going to be using the synchronized keyword, but you might be using the transient keyword. You’re not going to be making extensive use of design patterns or heavy lifting with a repository pattern. You will be making extensive use of system architectural patterns. For Data Engineers, you should have intermediate to advanced programming skills. You should know at least one compiled language like Java and one dynamic language like Scala or Python. For other positions that need programming skills, you should have beginner to intermediate skills. Which Technologies to Learn? You’ve read that people interacting with data pipelines need to know 10-30 different technologies. Here is just a small example of what you should know: • • • • • • • • • • • • • • Apache Hadoop Apache Spark Apache Hive Apache HBase Apache Impala Apache Kafka Apache Crunch Hue Apache Oozie Apache NiFi Apache Flink Apache Apex Apache Storm Heron Chapter 6 How Can You Do It? 67 • Apache Beam • Apache Cassandra Just Apache? You’ll notice that this list is made up of mostly Apache projects. You are correct. In my dealings with organizations, they’re either using or moving to Apache projects. Even the ones that aren’t Apache projects are Apache licensed. This isn’t the exhaustive list. These are just some of the high points that you should know. Also, I’m also not including the technologies that are still in use, but are dying a slow death. How do you figure out which ones you should know? How do you understand how all of these fit together? How do you make a data pipeline with all of these technologies? You’re going to need a guide. This guide will have to be an expert. Which One Is Better? Which is better Apache Hadoop or Apache Spark? Which is better Apache HBase or Apache Hive? These questions come from a small data mindset that each technology is interchangeable and does about the same thing. There are several direct answers to these questions: • The choice of technology is entirely dependent on the use case. • Some of these technologies are complimentary and don’t do the same thing. You may need to use both of them in the same data pipeline. • You should learn all of them to choose the right tool for the job. Chapter 6 How Can You Do It? 68 Over Analyzing A common issue when starting out is to spend all of your time trying to figure out what the next big thing will be? Which one is the single technology that’s going to get you a job. You get too focused on the minutiae and never make any progress. The answer is that you’ll need to know a little to a lot about every one of these technologies. For Data Engineers, it will be a lot. For other teams, it will be a little with a focus on certain technologies. Can You Do This At Your Current Organization? Your organization may have Big Data needs. It’s critical that you establish that your organization really has Big Data needs. I’ve seen too many people start using Big Data technologies just to get that on their resume. That’s not a good idea. Once you’ve established the need, you may be able to convince your boss of the need for you to become a Data Engineer. You will still need to make sure you get the skills, but you won’t have to get a new job. Can You Do This Without a Degree? You don’t need a degree in Computer Science or a Master’s degree to start using Big Data. I’ve taught people with all different education levels. They’ve ranged from university professors to completely self-taught. If you show the skills with a personal project, you can do it. My Education This may surprise you, but I’m completely self-taught. I’ve never attended a single college class. The only time I’ve spent at universities is to guest lecture on Big Data. Chapter 6 How Can You Do It? 69 Can You Do This in Your Country? I’ve taught and worked with people from many different countries. The majority of countries are experiencing Big Data problems. If you’re worried that your country doesn’t have Big Data jobs, take some time to look at job postings and ask around. Chances are, you’ll be surprised to find that someone or an entire company is doing it. Countries Accessing My Blog To help you understand which countries you’ll want to double check, I took a look at the countries that have accessed my blog over the past 3 months. This should give you an idea of where interest in Big Data interest lies since my blog is entirely about Big Data. In North and South America, only Paraguay, Guyana, French Guiana, and Suriname don’t have hits. In Europe, Asia, and Australia these countries don’t have hits Papua New Guinea, Mongolia, Tajikistan, Kyrgyzstan, Turkmenistan, Yemen, and Greenland. In Africa, about half the countries have hits. How Diverse Is Big Data? I find Big Data to be more diverse than other specialties in technology. I really do want to see that improve. To make that happen, I have a scholarship that is open to African-Americans, Hispanic-Americans, and Native Americans. You must live in the United States. If you meet these qualifications, go to http://tiny.jesse-anderson.com/diversity to apply. Chapter 6 How Can You Do It? 70 CHAPTER 7 How Do You Get a Job? Your goal should be to get a job in a data engineering team or a team that’s using Big Data. Every chapter, section, and tip in this book is leading you up to this goal. If you aren’t making specific and appreciable progress, you are not going to meet your goal. You have the goal of getting into Big Data, but aren’t willing to put in the time, money, and effort into actually doing it. You’re the person who’s sending out resumes to Big Data companies and never getting a call back. Or you’re getting the interviews, but failing miserably. What’s happening? You have to come in with a full stack. The first person looking at your resume will verify that you have the right frameworks. For example, if you just learn Apache Spark, the person looking at your resume is going to look at that and say “where’s the rest of the stack?” On the other hand, my students have put the work in beforehand. They’ve learned the technologies that are actually used at companies and in production. They get the jobs. No Experience Necessary? Other students try and fail during interviews and chalk it up to a lack of experience. How does someone who’s brand new get experience? You have a chicken and egg problem. The real root of the problem is that you have to show skill and not necessarily experience. You can show skill without experience with an awesome personal project. Chapter 7 How Do You Get a Job? 71 Experience, Skill, and Personal Projects I verify this assertion when I teach at a company. I ask the hiring manager, "if someone has no experience, but they can show skill with a personal project, will you hire them?" The resounding answer is yes they will. The key is that you have to show skill if you lack the experience. Where Do You Fit in the Data Engineering Ecosystem? Not every person and position will be a fit on the data engineering team. Some positions will be on a team that is consuming data from the data engineering team. You will need to take a very honest look at where you fit. If you’re looking at becoming a Data Engineer, the technical bar will be much higher for you than a person on a different team. Either way, you will need the Big Data skills to use the data. Once you’ve figured this out, you can closely target the position. This involves creating a plan to acquire the skills and technologies to get this position. Personal Project How do you show skill without experience? You do it with a personal project. You saw several references to personal projects in Chapter 5 “Switching Careers.” A personal project is a project where you show your skills in creating something with the relevant Big Data technologies. A great personal project takes away all of the “can you do it?” questions because you did it all yourself. This personal project should have the full source code on GitHub (the interviewers may or may not look it over). Another key is to demonstrate this personal project. You should be able to open your laptop and show it fully running. This could be running on a VM on your laptop or perhaps Chapter 7 How Do You Get a Job? 72 on a cluster in the cloud. If you do go the cloud route, make sure you have internet access during the interview. Some people try to do a Hello World as their personal project. That doesn’t show any skill and is worse than nothing. It actually shows that maybe you don’t really have the skills and can only accomplish Hello World. If you are applying for a senior position, your personal project should show the skills and code of a senior engineer. It should show some creativity and mastery of the technology. Datasets One of the more difficult parts of a person project is to find a dataset that interests you. I have a list of unique and interesting datasets on my blog at http://tiny.jesse-anderson.com/ datasources. What Have Personal Projects Done For Me? Personal projects have been a game changer for me. As a direct result of awesome personal projects, I’ve gotten job offers and jobs. These aren’t just job offers from unknown companies, but from well-known companies like Google, etc. I’ve also received extensive notoriety. I’ve been in the Wall Street Journal, CNN, BBC, NPR, and virtually every tech media. Why Didn’t Your Personal Project Work? More than likely your project wasn’t compelling and didn’t show any technical prowess. I’ve followed up with people who’ve looked through others’ personal projects. The common thread was that they were too simple, boring, or didn’t show any creativity. They looked like they were thrown together at the last minute instead of the meticulous planning and execution someone would really put in if they wanted a job. Chapter 7 How Do You Get a Job? 73 Should You Get Certified? People often ask me if they should get certified. I’ve talked to many hiring managers about this. Managers say they generally distrust certifications. They all know that they’re just a multiple choice test and don’t show any true mastery. They also wonder if the person took the test themselves or cheated in some fashion to pass the test. There is a trend towards certifications that require coding or doing something specific. These are generally considered more prestigious, but chances are the hiring manager won’t know anything about them. Sometimes a certification will get you in the door for an interview. If companies are involved in government contracting, they’ll often favor certified people. Are They Worth It? In my experience and conversations, most certifications just don’t make enough of a difference to be worth it. You’d be better off putting your time, and money into a better personal project that shows true mastery. Networking People forget the personal touch. If you want to to be sure that your resume isn’t deleted by some person in HR who doesn’t know what’s going on, then you need to meet the hiring manager in person. Chances are, the manager or a member of their team attends the local meetups. Attending these meetups gives you the opportunity to meet and interact with these people. They’ll see that you have the technical skills and a genuine interest to work at their company. This is something that is difficult to convey in an email or phone call. Chapter 7 How Do You Get a Job? 74 Referral Bonus Companies will give employees who refer a candidate that gets hired a referral bonus. Be sure to ask them if they’ll pass on your resume. If you made a good impression, they will pass it on for the referral bonus. Not So Fast! At one meetup, the presenter didn’t show. The organizers said ’no talk but stay and enjoy the food’. Being the kind of guy I am ... I stood up and said - Not so fast! Please if anyone is hiring or wants to be hired ... meet over here. I got 3 pings and I followed up on one at [company name]. — Stephan W. Conferences are another great place to meet people. They’ll often have job boards filled with positions. Better yet, the hiring managers are often attending. Take the time to talk to them and understand what their company is doing. This is one of the best ways to get a job. Getting a Job Fast(er) I’ve talked about the need to really learn and plan ahead of time so that you can pace yourself. Sometimes, you just get blindsided by a layoff. You’ll need to dust yourself off and decide if this is a good time to switch careers over to Big Data. Here is the advice I’ve given to others in that same situation. Now that you’re not working, you need to spend that time learning and preparing for job seeking. Here’s what I’d do: • Update your resume to focus on the impact your work had on the organization. Your resume shouldn’t say things like: Analyzed data with SQL It should say: Found a way to reduce costs by 20% by analyzing our logistics data Chapter 7 How Do You Get a Job? 75 • • • • There’s a night and day difference between these two resumes. Spend as much time as possible going through your learning materials. Yes, you may have to buy expensive materials to make this jump. Practice interviewing with a friend. Have them ask you really hard questions and not just programming questions. They should be questions like “Why do you want to work here?” to “Why did you leave your last position?” Practice the answers to these questions. Make a kick ass personal project. Put the full code in GitHub. Make sure that it’s easy to pull up with the VM on your laptop so you can show it actually running. Another option is to sign up for Google Cloud or Amazon Web Service and have it running there. The main points are that you need to show your code and that it works. Network at local meetups and conferences. Practice beforehand how you’re going to introduce yourself and the position you’re looking for. You’ll notice the theme through all of these tips and strategies is to practice beforehand! Chapter 7 How Do You Get a Job? 76 CHAPTER 8 What Are You Going to Do? Questions to Answer Before Starting to Learn Big Data These are questions you should think about before you start switching careers to Big Data: • What is your specific goal for learning Big Data? Do you want to switch careers and get a new job? Do you just love to learn new things? • Where will you be in relation to the data engineering team? Will you be in the data engineering team or using a pipeline created by the data engineering team? • Honestly, how long will it take you to accomplish this goal of learning Big Data? • Do you think you’re giving yourself enough time? • What source(s) are you going to use for your learning materials? • Do these learning materials actually give you the Big Data skills you need? Did others say the learning materials helped them accomplish their goal? • How are you going to stand out from the crowd? • What do you think would make for an awesome personal project? • What meetups and conferences happen in your area to network? Your Checklist For Starting to Learn Big Data These are specific items that you should know the answer to before you start switching careers to Big Data: 1. Have a clear and attainable goal for switching to Big Data Example: You are going to switch careers to Big Data and get a job as a Data Engineer. Chapter 8 What Are You Going to Do? 77 2. Identified the skills you need to acquire Example: You need to improve your programming skills from an intermediate level to an advanced level. 3. Identified the technologies you need to learn Example: You are learning Apache Hadoop, Spark, Kafka, etc. 4. Purchased the materials that will teach you the skills and technologies Example: You have purchased Professional Data Engineering from http: //tiny.jesse-anderson.com/pdesales 5. Created a realistic and maintainable schedule to learn Example: You are going to take 10-20 hours a week for the next 2 months to learn. 6. Started thinking about your personal project Example: You’re looking for interesting datasets to see how they could be augmented. You’re looking for ways to wow a company with your Big Data mastery. 7. Started to network with others in the Big Data space Example: There is a local meetup that you are going to attend to get a feel for who is hiring and what they’re looking for. Parting Advice By reading this book, you’ve already done more preparation than most people. My goal for this book was to help you decide if switching careers to Big Data is right for you, show you the steps to switch, and how to get a job once you’re ready. In Chapter 2, I outlined the 4 ways I’ve seen people succeed: • • • • A desire to learn Big Data Some of the prerequisite skills The time to dedicate to learning An expert (or experts) to guide through the experience You can switch careers to Big Data if you follow these 4 steps. Don’t waste your limited time on things that get you nowhere. Do make a plan and stick to it. Most of all, make an awesome personal project and blow your interviewers away. Always keep up to date because this field is changing rapidly. Best of luck on your Big Data journey. Chapter 8 What Are You Going to Do? 78 About the Author Jesse Anderson is a Data Engineer, Creative Engineer and Managing Director of Big Data Institute. He trains at companies ranging from startups to Fortune 100 companies on Big Data. This includes training on current edge technologies and Big Data management techniques. He’s mentored hundreds of companies on their Big Data journeys. He has taught thousands of students the skills to become Data Engineers. He is widely regarded as an expert in the field and his novel teaching practices. Jesse is published on O’Reilly and Pragmatic Programmers. He has been covered in prestigious publications such as The Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. Chapter 8 What Are You Going to Do? 79 CHAPTER A Appendix: Are Your Programming Skills Ready for Big Data? This appendix is just for those of you who will be programming. This is especially relevant for the Data Engineers and Data Scientists. You will need to make sure that you’re ready to program. The question arises — how good does a person’s programming skills need to be? This is because programming skills are on a wide spectrum. There are people who are: • • • • Brand new to programming Never programmed before and will have to learn how to program Program in a language other than Java/Scala/Python Been programming in Java for many years Another dimension is your role on the team. For example, a Data Engineer will need far better programming skills than a Data Analyst or a Data Scientist. Usually, people with a solid Java and Scala background will have their programming skills ready. For other people with the a solid non-Java background, they will need to learn enough Java to get by. As you’ll see in the next sections, the Java code is not the most complex syntactically. The programming side does give those who are new to programming or have never programmed the most difficulty. Example Code To give you an idea of what some Big Data code looks like, here is an example Mapper class from my Uno Example. Chapter A Appendix: Are Your Programming Skills Ready for Big Data? 80 public class CardMapper extends Mapper{ private static Pattern inputPattern = Pattern.compile("(.*) (\\d*)"); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String inputLine = value.toString(); Matcher inputMatch = inputPattern.matcher(inputLine); // Use regex to throw out Jacks, Queens, Kings, Aces and Jokers if (inputMatch.matches()) { // Normalize inconsistent case for card suits String cardSuit = inputMatch.group(1).toLowerCase(); int cardValue = Integer.parseInt(inputMatch.group(2)); context.write(new Text(cardSuit), new IntWritable(cardValue)); } } } You’ll notice a few things about this code. The class is relatively small and self-contained. This is true for most Big Data code, even production code. This is because the framework, in this case Hadoop MapReduce, is doing all sorts of things behind the scenes for us. We also see that we’re using regular expressions to parse incoming data. It’s very common to use regular expressions in processing data. They’re so necessary, that I cover them in my Professional Data Engineering course. You’ll also notice this class isn’t doing anything exotic. The code and syntax itself isn’t going to stress most people’s knowledge of Java. The closest thing to exotic is the occasional use of the transient keyword. In this sense, Big Data knowledge of syntax can be intermediate. As you just saw, the programming side is necessary, but not extremely difficult. You will need to know how to program. I’ve seen people come from other Chapter A Appendix: Are Your Programming Skills Ready for Big Data? 81 languages without significant difficulties. What Is Difficult Then? There are two main difficulties for the programming side. They are understanding the framework and the algorithms you need to write when creating a distributed system. The Framework Looking back at the code above: • • • • • How does the map function get data? Where does the key’s data come from? Where does the value’s data come from? What happens when you do a write? What should you use for your output key and value? Some of these questions are answered by knowing and understanding what the framework is doing for you. In this example code, Hadoop MapReduce is doing several things for you. What should come in and out of the map function is dependant on what you’re trying to do. At its core, you need to have a deep understanding of the framework before you can use it or code for it. This lack of realization of where the difficulty lies is a common issue for people starting out with Big Data. They think they can use their existing small data background and not make a concerted effort to learn Big Data technologies. They’re dead wrong in this thinking and this causes people to fail in switching careers to Big Data. I talk more about these necessary realizations and what to do in my book The Ultimate Guide to Switching Careers to Big Data. The Algorithms With Big Data, you’re doing things in parallel and across many different computers. This causes you to change the way you process and work with data. As you saw in the code above, you will need to decide what should come in and out of your map function. But how do you this in a distributed system? Chapter A Appendix: Are Your Programming Skills Ready for Big Data? 82 A simple example of the difference can be shown with calculating an average. Let’s say we want to calculate the average of these numbers: 88 91 38 3 98 79 3 31 23 61 On a single computer, that’s easy. The answer is to iterate through all 10 values and the answer is 51.5. Now let’s distribute out the data to 3 computers. Computer 1: 88 91 38 Computer 2: 3 98 79 Computer 3: 3 31 23 61 Now, we run the averages on all 3 computers. Computer 1: 72.3 Computer 2: 60 Computer 3: 29.5 But we don’t have an average of the dataset. We average out the results from all three computers to get 53.94. Now we’re off by 2.44. Why? Because an average of averages isn’t correct. In order to distribute out data and run an algorithm in parallel, we need to change the way we’d calculate the averages. Are your programming skills ready? The answer comes down to your current programming skill in Java and what your position on the team will be. If you looked over the code and readily understood it, you’re probably ready. If you struggled to understand the code, you need to spend some time on your programming skills before embarking on your Big Data career. Remember that programming is just one piece of the puzzle. You will need to learn and understand the Big Data frameworks. You’ll also need to understand how algorithms are done at scale. For this, you’ll need materials and help to learn. Chapter A Appendix: Are Your Programming Skills Ready for Big Data? 83
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 83 Page Mode : UseOutlines Author : Jesse Anderson Title : The Ultimate Guide to Switching Careers to Big Data Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2018:06:04 12:02:48-07:00 Modify Date : 2018:06:04 12:02:48-07:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/Debian) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools