How to build smarter data integration in a multicloud world

used with permission from IBM Big Data & Analytics Hub
by Bharath Chari

Let’s say you’re the Chief Technology Officer of a bank or retailer struggling to infuse AI that aims to improve customer experiences. You likely face three main challenges:

Data sprawl: Your customer data is currently on multiple clouds, including on-premises and a cloud data lake storage repository. But the data often exists in different formats in Db2, Oracle and other databases—and those databases are being used by transactional systems in different departments. Data latency across clouds and systems can be a serious concern in a multicloud environment, specifically when dealing with large data volumes.
It takes too long to build and update applications: Your team has tools in place to help with CI/CD – continuous integration and continuous deployment, a method to more rapidly deliver apps to customers by introducing automation into the stages of app development for containerized applications. But the team is struggling with managing the complexity of containers across your entire IT infrastructure.
Lack of quality and governance: the data in the data lake is not governed, and it’s managed by field sales reps who aren’t always vetting it first to determine if the data pertains to existing customers or not. This lack of quality control has resulted in your data lake being more of a “data swamp.” Populating the data is straightforward, but most of it is either unusable or needs to be cleansed or modified in order to deliver value.

Adding to the complexity, there are a number of ways to problem-solve these challenges. There’s a plethora of products and open source tools available?, including Kafka, Docker, RedHat OpenShift. Other products can perform data virtualization, data transformation, data replication, data quality or governance. With so many options, it can be easy to fall into “paralysis by analysis” and postpone the work you know you need to do.

If we take a step back from this scenario and look at the market trend overview, these are actually common challenges many businesses are faced with. According to Forrester, 62 percent of public cloud adopters work with at least two cloud providers.

Gartner predicts that 70 percent of enterprise workloads currently access open source databases on cloud environments. As a result of this data sprawl, it can be a big challenge to put together a business-ready data foundation for AI initiatives to speed up the time-to-value for your business.

So, facing data complexity and common challenges—where do you go from here?

What does a platform or solution for your actual business look like?

Let’s start by looking at it systematically. Your platform would need to support the following key capabilities:

Multicloud data integration: You could deploy integration components through containers on any cloud environment to reduce latency due to large data volumes. Your ETL tool – to complete extract, transform and load processing – needs to have strong performance characteristics with a variety of prebuilt functions and connectors to address the challenge of data sprawl and data in different formats. ETL is the backbone for transforming your data and integrating it across multiple clouds. Ideally, you would write a job only once and not each time for different environments. Your ETL tool would have machine learning-based capabilities to assists users, even non-technical ones, to build flows and stages within a job.
Diverse data integration styles and real-time access to data for analytics: Another component to addressing the challenge with the data sprawl is a data integration tool with built-in data replication. This would allow you to provide real-time access to data at scale for transactional data in your marketing and sales systems with data from your data lakes and databases. Then you can share this transactional data for analytics purposes. In addition to supporting traditional ETL, which supports batch processing, your tool would need to support more complex data delivery styles such as streaming data integration.
Reduce time to build and update applications: To address the challenge of managing the number of containerized applications across different operating systems, you need a robust open source tool such as Red Hat Openshift. This platform helps you scale and provision containers to support key IT initiatives such as microservices and cloud migration strategies. Specifically, your ETL tool needs to create a CI/CD pipeline by supporting source control tools such as Github to frequently publish jobs and release to production.
In-line data quality and governance: Your data integration platform should include capabilities which enable you to resolve data quality issues according to the data policies and rules in place when the data is being delivered to a target environment such as a data lake. Ideally, such a platform should also allow for active metadata and policy enforcements to prevent giving unauthorized users access to your sensitive data.

Examples of businesses that have successfully delivered benefits from a robust data platform include:

A retailer that reduced the time it took to build a customer affinity analysis from 20 days to less than 24 hours, helping their decision-makers to supercharge sales and marketing strategies.
A bank increased data processing speed by 30 percent, helping lead to more efficient decisions and reducing their operating costs.

Are you ready to deliver a winning data integration solution to your business?

Keep reading our Blog For more Productivity and Technology Tips or Contact us for any IT related question!