Big Data Engineering: 4 Critical Considerations for Automating Your Processes
Jonathan Bentz July 31, 2018
As more organizations begin to comprehend the powerful insights and value of big data, adoption is steadily increasing -- evidenced by the growing revenues of big data platform vendors both on premise and in the cloud.
However, countless challenges come with leveraging big data. For instance, consider that of all the work involved in moving a project out of the sandbox, only 15 percent of big data projects ever move into production, and many are looking wherever they can to improve their processes.
Whether it’s to help meet business goals or anticipate when systems will fail, big data analytics can be just the ticket to drive an organization forward in a meaningful way. A key benefit of big data is its ability to empower an organization by making it more agile in response to new data and its trends.
The inherent problem with big data right now is that implementation of big data solutions is inherently complex. Projects require a high level of expertise that is difficult to find and expensive to hire when you do find it.
In addition, these projects tend to be an issue of “death by 1,000 paper cuts,” where there are a lot of little issues that have to be addressed before you can successfully implement. No single issue will kill you, but the collective nature of all the challenges add up to a mountain of small challenges that, in whole, are difficult to overcome.
As a result, when an organization is trying to deploy big data projects, it’s not uncommon for projects to take several months before ever reaching production, if they reach production at all. This doesn’t have to be the case if you can eliminate the many underlying factors that can extend the project timeline. But the only way to do that is to automate away the complexity and fundamentally simplify the project deployment.
Automation of data engineering can help reduce resources needed and time to implement big data projects.
Here are four considerations when trying to decide if big data automation tools are right for your organization.
1. Workflow complexity
Big data is complicated at every turn. More complicated than most people realize, until they attempt to fully operationalize a big data project, and it takes a talented team of data scientists and engineers to manage it appropriately.
For instance, simply loading data from existing data sources into Hadoop ends up being a non-trivial exercise that typically requires a lot of manual, hand-written code. Coding effort that eats up time and focus that could be spent on other work.
For example, Hadoop and other big data environments don’t handle change data capture, incremental merging, or synching of data into the data lake. Hadoop was designed to load an entire table each time, which is not very efficient for large tables of data. As a result, developers have to hand write code to deal with change data capture (CDC) and slowly changing dimensions (SCD type I and II).
Other issues, for instance: handling the parallelization of loading large amounts of data, is an exercise that isn’t just handled automatically. Tools like Sqoop let you ingest data, but one data pipe at a time. If you ramp up multiple pipes, Sqoop supports that, but not without a lot of coding to determine the number of pipes and mappers.
In a real life example I was recently told about by some colleagues in the industry, a data team with a financial services firm spent a month of two data engineers’ time coding the ingestion process to load many terabytes of data into its data lake.They succeeded, but the load time took 18 hours and their service level agreement was five hours and did not handle a CDC scenario.
Using data automation software designed specifically to handle data ingestion, which also used machine learning algorithms to automatically tune the number of mappers and configuration parameters, the load time was brought down to four hours after just one hour of effort. In addition, the automated solution included change data capture so subsequent data loads didn’t require the loading of the entire table.
This is just one example of how automation replaces manual efforts that would otherwise take months to code from scratch.
2. Automation of auditability
Many big data teams are aware of how critically important compliance and providing full audit trails are in those efforts. If something fails or goes wrong in a manual process, providing an accurate trail of what happened is extremely difficult — if not impossible. This is critically important in industries that are highly regulated, like banking or health care, and are vital for the finance department of pretty much any organization.
Tracking lineage as data flows from one system to another requires organizations to trace the path that data took, such as financial data. Without a full audit trail, there’s very little hope of identifying what failed because of how hard it is to trace hand-written code. Representing data lineage is easiest to do when you can visualize the lineage in a graphical manner. Hand-written code, especially code written by different developers, doesn’t lend itself to being easily traceable.
Automation and visual development tools significantly improve auditability because traceability is built into most modern tools. These solutions have a visual component where they represent exactly how a piece of data moved and who wrote the data movement code along the way. Not only is it easy to determine the source of a particular piece of data, it is easy to determine who wrote or changed a data pipeline automatically.
3. Productivity of data engineers
Managing big data requires the skillset and experience of not only data scientists, but also data engineers. While these two roles have many overlapping skills, there is a symbiotic relationship between them to make big data work.
Let’s say that a data scientist creates a breakthrough algorithm but has no team of data engineers to put it into production. As the saying goes, if a tree falls in the forest, does it make a sound? Similarly, if the world’s greatest algorithm can’t be deployed into production, does it have any value? If you can’t deploy the algorithm, it is relatively worthless.
It turns out that companies need at least two-to-five data engineers for every one data scientist because data engineers creating a production workflow that is repeatable, supportable, and highly available is a lot of work. Unfortunately, many organizations struggle to fill these positions.
While data engineers are in high demand, automation can help eliminate many of the more tedious aspects of their roles. In turn, automation allows data engineers to be exponentially more productive so they can focus more on the logic of data pipelines and less of the mundane details that make those data pipelines bullet-proof.
4. Future-proofing big data investment
Change is a constant in the realm of big data, and companies who have invested in it need to be flexible enough to deal with future evolution. The latest big data trends all started with Hadoop as the preferred big data environment, but now people are talking about using Spark. Plus, we are quickly progressing to discussing serverless environments.
In addition, many organizations are looking to cloud vendors for their big data needs. In many cases, the environments of these cloud solutions vary drastically from one to the next and often require code be rewritten to accommodate these environments.
Many cloud vendors attempt to lock in customers to their services by providing “value added” services that might grant additional worth, but also ensure a lack of portability. Automation can alleviate these issues by taking these differences into account and make moving from one environment to the next relatively painless.
As information technology (IT) and operations teams are constantly assessing efficiency and costs, it’s not uncommon for a company to change data storage and computing solutions several times over a few years. The result is that portability across big data environments is key if you want to be able to take advantage of the latest and greatest technologies, as well as have the ability to play your big data platform vendors against each other to get the best price.
Automation, once again, can enable portability and ensure you are future proofing your big data investments.
Big data is a figurative mountain of valuable insights. It’s up to each organization to determine how it can best turn those insights into impactful decisions. But managing big data is not easy, considering the large number of variables that can affect how long projects take to reach full production quality.
Automation in big data is by far the best approach to simplifying away the natural complexity of big data by streamlining tedious processes, improving auditability, future-proofing your big data investment, and making data engineers infinitely more productive.
All images credit to Pixabay user mohamed_hassan