One of the biggest myths remains that only large companies can afford big data solutions, they are only suitable for huge amounts of data, and that they cost a fortune. That is no longer true today because significant technological developments have characterized the past few years.
The first revolution includes maturity and quality. It’s no secret that big data technologies involved efforts to make them work or make all the pieces worked together ten years ago.
There have been countless stories of developers wasting 80 percent of their time fixing glitches with Spark, Hadoop, Kafka, etc. Nowadays, these technologies have become more reliable, have overcome teething problems, and have learned to work together.
The likelihood of failures in the infrastructure is much greater than discovering internal errors here. Even infrastructure problems are no longer a big problem in most cases since most Big Data frameworks are now designed to be fault-tolerant. In addition, these technologies provide stable, powerful, and simple abstractions over computations, allowing developers to focus on the business side of development.
The Diversity Of Big Data Technologies
The second revolution is happening right now: myriads of open source and proprietary technologies have been developed over the past few years – Apache Pino, Delta Lake, Hudi, Presto, Clickhouse, Snowflake, Upsolver, Serverless, and many more. The creative energy and ideas of thousands of developers made fantastic solutions possible.
Let’s talk about a typical analytical data platform (ADP). It consists of four main levels:
- Dashboards and Visualization – the “facade” of ADP that exposes analytical summaries to end-users.
- Data Processing – Data pipelines to validate, enrich and convert data from one format to another.
- Data Warehouse – a place to keep well-organized data – rollups, data marts, etc.
- Data Lake is where pure raw data resides, a base for the data warehouse.
Each level offers alternatives for every need and taste. Half of these technologies have been developed in the past five years.
The important thing about them is that these technologies are developed to be compatible. For example, a typical low-cost small ADP might consist of Apache Spark as the basis of the processing components, AWS S3 or similar as the data lake, Clickhouse as the warehouse, and OLAP for low-latency query Grafana for a nice dashboard.
More complex ADPs can be assembled differently. For example, the introduction of Apache Hudi with S3 as the data warehouse can ensure a much larger scale while still having Clickhouse available to access aggregated data with low latency.
The third revolution is related to clouds. Cloud services became a real game-changer. They view Big Data as a ready-to-use platform (Big Data-as-a-Service) that allows developers to focus on developing features while letting the cloud take care of the infrastructure.
AWS is just one example here. This ADP could also be built using any other cloud provider.
Developers have the option to choose specific technologies and a degree of serverlessness. The serverless it is, the more you can customize the solution. On the other hand, there is a strong bond with manufacturers. Answers locked to a specific cloud provider and serverless stack can be quick to market. A wise choice between serverless technologies can make the solution cost-effective.
However, this option isn’t useful for startups as they tend to leverage typical $100,000 cloud budgets and hop back and forth between AWS, GCP, and Azure. That needs to be clarified upfront, and more cloud-agnostic technologies should be proposed instead.
Usually, engineers distinguish the following costs:
- development costs
- maintenance costs
- change costs
Let’s address them one by one.
Cloud technologies simplify the work. There are several areas where they have a positive impact.
On the one hand, there are architectural and design decisions. The serverless stack provides a variety of patterns and reusable components that provide a solid and consistent foundation for the solution’s architecture.
There is only one concern that could slow down the design phase — big data technologies are inherently dispersed, so solutions must be designed with possible errors and failures in mind to ensure data availability and consistency. As a bonus, the solutions require less effort if you want to scale them.
Integration and end-to-end testing come second. The serverless stack allows for isolated sandboxing, replay, testing, and troubleshooting, reducing development loops and time.
Another benefit is that the cloud enforces automation of the solution deployment process. This trait is a mandatory trait of any successful team.
One of the main goals that cloud providers were trying to solve was reducing the overhead of monitoring and maintaining production environments. They tried to create a kind of ideal abstraction that involved almost no developers.
However, the reality is somewhat different. As for this idea, maintenance usually still requires some effort. The table below highlights the most important ones.
In addition, the overall bill also depends very much on the infrastructure and the license costs. The design phase is extremely important as it provides an opportunity to challenge specific technologies and estimate their run-time costs in advance.
Another important issue with big data technologies is the cost of change. Our experience shows no difference between big data and other technologies. If the solution is not over-provisioned, then the price of a change is entirely comparable to a non-big data stack. However, one advantage comes with big data: big data solutions are designed decoupled. Properly designed solutions don’t look like a monolith, so local changes can be made when needed quickly, with little risk of disrupting production.
In summary, we think big data can be affordable. Big Data proposes new design patterns and approaches for developers to compose any analytical data platform that meets the most stringent business requirements while remaining cost-effective.
Big data-driven solutions could be an excellent foundation for fast-growing startups that want to be flexible, apply rapid changes, and have a short time to market. When companies need to process larger amounts of data, the big data solutions can grow along with the companies.
Big data technologies enable near-real-time analytics on a small or large scale, while classic solutions struggle with performance.
Cloud providers have taken big data to the next level, offering reliable, scalable, and out-of-the-box capabilities. It’s never been easier to get affordable ADPs. Grow your business with Big Data!
ALSO READ: How IoT Standards For Big Data