Open-Core And Long-Term Alpha (Pt.3)

Open-Core And Long-Term Alpha (Pt.3)


  • As the open-core business model is proving successful, we are seeing next-gen startups emerge at rapid rates and public open-core names experience a slower growth decay.
  • In Part 2, we looked at major names that are public or have been taken private, such as MongoDB, Cloudera & Hortonworks, Elastic, and Confluent.
  • In Part 3, we'll cover emerging next-gen open-core startups within the data markets to put on your radar.


As one of the most hyped sectors in enterprise tech, there are a large number of startups receiving substantial capital and commanding high valuations. In this third part of the Open-Core series, we will highlight some names to monitor as they could eventually emerge as integral players within the Modern Data Stack (MDS).

The Emergence of a Data-Driven World

As we stride further into the digital age, the sheer volume, velocity, and variety of data we generate is growing exponentially. As we began the 21st century, technology giants like Google grappled with the daunting task of managing, storing, and processing these colossal amounts of data produced by their burgeoning array of services. Traditional enterprise data solutions, offered by leading companies like Oracle, IBM, and EMC, failed to meet Google's specific requirements in terms of performance, scalability, and cost-effectiveness.

Here we mention some of the milestones that have been integral to the Big Data industry we know today.

Google's Ground-breaking Paradigm Shift

Faced with these challenges, Google ventured into uncharted territories, building innovative data systems to tackle the issues. The company introduced new architectural models that disrupted the existing paradigms:

  • Google File System (GFS, 2003): This system laid the foundation for distributed storage across inexpensive, commodity servers, marking the inception of concepts like sharding and replication.
  • MapReduce (2004): With MapReduce, Google revolutionized the way massive datasets could be processed in a distributed and parallel manner, greatly simplifying large-scale computation.
  • Bigtable (2006): Building upon GFS, Bigtable presented a distributed structured storage system capable of petabyte-scale analytics.

These trailblazing systems not only addressed Google's data management issues but also formed the basis of the company's competitive edge in harnessing the potential of big data. Google's approach underscored the benefits of scaling data infrastructure across horizontally networked commodity hardware, shifting away from expensive, monolithic servers.

The Hadoop Era: Drawing Inspiration from Google

Google's game-changing techniques weren't confined within its walls. The release of Google's papers inspired the development of open-source Hadoop, a platform that replicated parts of Google's architectural model. The Hadoop Distributed File System (HDFS) and Hadoop MapReduce offered open-source distributed storage and processing. HDFS leveraged clusters of commodity x86 servers for cost-effective storage and MapReduce was the parallel processing platform for distributing the compute, and then for processing and recombining the data for the final output.

With the support of tech giant Yahoo and others, Hadoop started gaining traction in big data analytics, particularly for use cases that required scaling beyond traditional databases while remaining cost-effective.

However, as data volumes and the complexity of use cases increased, Hadoop's limitations became apparent. The platform's batch processing framework lacked the interactivity necessary for efficient analytics. MapReduce code was intricate, and data ingestion into HDFS was cumbersome. These issues gradually started to act as constraints on Hadoop's utility.

The Spark Revolution

Spark, created at UC Berkeley in 2009, addressed Hadoop's drawbacks. Hadoop's MapReduce only used disk storage for read and write operations, which adds significant latency. This is/was not much of an issue for batch processing that is not time sensitive. However, as enterprises needed more regular processing due to the increasing data flows, Hadoop became a limitation.

Spark addressed this by introducing in-memory processing. Read and write operations to memory led to increased performance and faster processing of the data. Spark also introduced a DataFrame API for greater developer productivity.

Offering a unified engine for batch, interactive, and real-time workloads, Spark quickly ascended as the preferred successor to Hadoop's MapReduce for big data processing. It provided code that was 4-10 times faster and proved itself to be a game-changer in the big data landscape.

Spark became the power source for Databricks, a company that led the way in bringing big data analytics to the cloud. Spark also forms the bedrock of streaming frameworks like Kafka, ML platforms like TensorFlow, and storage systems like Delta Lake, powering the Modern Data Stack.

Today's Thriving Startup Landscape

In the current enterprise tech space, data-driven startups are garnering significant attention, backed by heavy investments and impressive valuations. The open-core and closed-source data startup sector is booming, meriting a dedicated report due to its scale and potential. In this constantly evolving field, numerous promising startups are emerging and pushing boundaries, enriching the ecosystem and driving further innovation in big data analytics.