Continuing my tour of the Spark ecosystem today’s focus will be on Alluxio, a distributed storage system that integrates nicely with many compute engines – including Spark.
What is Alluxio ?
The official definition of Alluxio is (or at least that’s how one of its author presents it):
Alluxio is an open source memory speed virtual distributed storage
Let’s see what each of these terms actually means:
- open source: Alluxio is open sourced under the Apache license 2.0. It has attracted one of the fastest growing community with thousands of contributors on github. It is also used by many companies (including the big ones) all over the world.
- memory speed: Like Spark, Alluxio leverages the fact that memory IO is much faster than disk IO and improves data access time which is quite important in a big data environment.
- virtual: Alluxio provides a virtual naming making it easy to access any data wherever it might be stored. Much like we can easily access any file from a file system without knowing on which disk (or host) it resides – all we need to know is the path to that file.
- Distributed storage: Alluxio is distributed meaning it scales horizontally and run on commodity hardware.
These all look like promising! But really what is it?
Well if you take a quick look at the big data world today things can be split into 2 categories:
- The compute engine: Spark, Hadoop, Flink, … These are the big data frameworks that allows to run some computation over the data
- The physical storage: All this data must be stored somewhere most likely using some distributed filesystems: HDFS, S3, GlusterFS, Parquet, …
And each compute engine should be able to access any of these file systems.
And this is exactly where Alluxio comes into place, right in the middle, between these 2 categories.
Alluxio architecture is follows a master/workers scheme. It requires a master node (which can be fault tolerant – using zookeeper?).
The master node is in charge of maintaining the unified namespace (i.e. the file system tree) and keeping track of the available workers.
The workers manage local resources (RAM, SSD, disk), stores chunk of data and fetch data from the underlying physical storage. They also respond to client request and report heartbeat to the master node.
With this architecture it’s easy to increase the storage capacity by adding more workers to the system.
This architecture can be also located in the same cluster as the application (e.g. Spark) so that you benefit from data locality.
The most obvious benefit is that is decouples the applications from the physical storage. The application just needs to integrate with Alluxio and they can automatically support any of the physical storages supported by Alluxio.
The integration is pretty simple as Alluxio provides different interfaces to integrate with including HDFS, key/value and file system interfaces. It means that as soon as your framework can “talk” to one of these interfaces it can integrate with Alluxio.
E.g if your framework works only on top of HDFS it can integrate with Alluxio and support S3, GlusterFS, GCS (Google Cloud Storage), …. for free.
As Alluxio sits between the application and the physical storage it can save the data into its own memory as it fetches it from the real storages.
As a result you gain speed: it ‘s much faster to read data from Alluxio’s memory than from S3 going over the internet.
Alluxio provides some additional tiered storage for SSD and disks in addition to memory.
The data stored into the tiered storage is manage by Alluxio and is not directly accessible. (in a similar way that HDFS as not directly readable – data is chunked into many files with some additional metadata).
Usage shows that it’s common to gain an order of magnitude in terms of speed when running computation using Alluxio.
The unified naming really works in the same way as we mount disks and hosts into a file system.
It’s the master job to maintain this unified naming.
This feature is disabled by default but seemed quite interesting to me. The idea – taken from Spark (I suppose) – is to keep track of which jobs produce which data.
So that if some data is no longer accessible Alluxio can re-trigger the jobs that generated the data.
This allows huge saving in terms of space. You don’t need to duplicate all the data several times over the cluster. You can only store it once and recompute it in the seldom case of failure.
I am not sure how useful it is in practice but the idea sounds interesting and worth to explore.