If we attentively look at IT industry at 2017, all of us will see “containers” and “Docker” as the top buzzwords ever. We started to package developed software in Docker containers in every field. We’re using containers everywhere. From small startups to huge microservices platforms. From CI platform to Raspberry Pi ARM-targeted research. From database management systems to….
Sorry, WHAT?!! Are you sure you’re going to put your database inside the container? In production? Have you gone mad, dude?!
Unfortunately, this is not the fiction. I see many fast-growing projects who’re managing persistent data into containers. Moreover, managing at the same machine with computing services! Hopefully, mindful experienced people are not going to do this solution. But a lot of inexperienced people do.
So, here is my point of view, answering “Why?” for this subject. Note: this is the situation for the current day, January 29th 2017. We know some research projects who’s trying to figure out how to safely manage databases in Docker. But database containerization is not reasonable for today.
Let’s get started from reasons!
1. Data insecurity
This is a totally right point of view at this article about Docker, starting from the “Banned from DBA” part. Even if you’re going to put the volume from host where data located – it doesn’t guarantee you anything. Yes, Docker volumes designed go around the Union FS image layers to provide persistent storage. But it still doesn’t guarantee you anything.
Docker is still unreliable working with currently available storage drivers. You may corrupt the data in case of container crash where database didn’t shutdown correctly. And lose the most important part of your service.
2. Specific resource requirements
I’ve seen DBMS containers running on the same host with service layer containers. But these service layers are not compatible according to hardware requirements.
Database, especially relational, requires extra resources. It’s about memory, disk I/O provisioning. Generally database engines dedicated environment to avoid resource concurrency. Putting your database inside the container, you’re going to waste your project’s budget. Why? Because you’re putting a lot of extra resources to the single instance. And it’s going out of control. In cloud case you have to launch the instance with 64GB memory when you need a 34. In practice some of this resources will stay unused.
What about workaround? You might separate layers and spin up multiple instances with fixed resource requirements. Horizontal scaling is always better than vertical. Ok, almost always!
3. Network problems
To understand Docker network you must have a deep knowledge about network virtualization. Also get ready for unexpected cases. You may be forced by circumstances to fix the bug with no support or use the extra tool for permanent fix.
We also know: database requires dedicated and persistent throughput for higher load acceptance. We also know container is a one more isolation layer behind the hypervisor and host virtual machine (we’re about cloud, right?). We also know network is critical for replication, where 24/7 stable connection between replicas required. And, of course, unsolved Docker network troubles. Still unsolved after dedicated network solution released in 1.9 version.
To put all these things together, we’re making sure database containers are hard to manage. Very hard if we talk about networking. I know, my friend, you’re a top-class engineer who doesn’t know the ‘hard’ and ‘impossible’ words. But how much time will you spend solving Docker networking problems? Wouldn’t be better to put the database in dedicated environment? Save the time to concentrate on what really matters for target business?
4. State in computing environment
Sometimes we’re using Docker with ‘stateless’ buzzword. By the way, can you see how many buzzwords we’re using now? Please, don’t talk about ‘DevOps’ today!
That’s cool to pack stateless services in Docker, implement orchestration and stop caring about single point of failure. What about databases? Putting the database on the same environment, you’re making this stateful. You’re making the application failure domain broader. Next time when your application instance or application itself will crash. it’ll probably touch the database workflow.
5. They just don’t fit major Docker features
OK, I don’t know anything about Docker. OK, I’m an old-school crappy system administration. OK, I don’t care about innovations and business value. So, thinking about database in the containers, we must estimate the value and profit. Let’s copy the official answer about what Docker actually is:
Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.
So, according to the answer, we can easily define the main Docker value:
- Easy to setup the software
- Easy to redeploy (Continuous Integration)
- Easy to scale horizontally (nof from the answer, but from practice)
- Easy to maintain environments parity
And let’s start of thinking about how do these features fit the database world.
Easy to setup the database? Is there any BIG difference in time running…
docker run -d mongod:3.4
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6 echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list sudo apt-get update && sudo apt-get install -y mongodb-org
If we talk about MongoDB cluster – probably it is. But what about Configuration Management systems? They’re designed to solve this kind of routine by running one command. Here is example of Ansible role for Mongo setup you can use for dozens of instances. Coding for your favorite CM system – whatever it is – is not a rocket science as sure as read this. As you can see, there is no huge increase of value.
Easy to redeploy? How frequently do you redeploy the database upgrading to the next version. Database upgrade is not a usability problem, but engineering problem. Even for usability in clusters. Think about how will your application work with the new DB engine version. What may be broken when engine will be changed. It is much more valuable thought. And Docker won’t solve this problem.
Easy to scale horizontally? Do you want to share the data directory between many instances. Are you not afraid of direct data concurrency and possible data corruption? Wouldn’t be more secure to deploy one more instance with dedicated data environment? And finally organize a master-slave replication?
Easy to maintain environments parity? How frequently does your database instance environment changed? Do you upgrade the OS every day? May be database version or dependent software like libs and modules? Wouldn’t be much easier to make a consensus with engineering team?
OK, let’s imagine it will. But is it possible for “bad engineer” to spit on this rule and keep doing development on different DB version? Or may be for “good engineer” to confuse about what to use? I think yes in last 2 questions, and Docker approach is not a silver bullet for environment parity issue.
Finally there is no one feature left to start thinking about database containerization.
6. Extra isolation is critical at the database layer
Actually I mentioned in reason №2 and reason №3 my proofs about this thought. But I put this in separate topic because I want to notice this fact once again. The more isolation levels we have, the more resource overhead we get. It’s not the issue when we get a lot of benefits instead of dedicated environments. But in Docker these features for stateless computing services, not for databases.
So, we don’t see any isolation features for database, why should we put this in container?
7. Cloud platform incompatibility
Majority of us getting started with projects in the cloud. So, cloud simplifies VMs juggling and possibility for quick termination and replacement. For example, why do we need the testing/staging environment at night or weekend when nobody works? Why do we have to scare about single instance uptime when we can spin up another one? Having the same configuration and service start process.
This feature is the main one. That’s why we pay a lot to our cloud provider. It will disappear when we put the database container for the instance. That way new instance won’t be compatible with existing one because of data mismatch. So, we’ll restrict yourself to the single machine and such juggling will be impossible. Better to use a non-containerized environment for DB. Leave the juggling and autoscaling only for computing service layer.
Does it matter all databases?
No, not all. It’s all about databases where data should be persisted. And it’s also about databases with special resource requirements.
If we talk about using Redis as a cache or user sessions storage – there should not be any problem. Your data have no risk to get lost because you don’t need this kind of data persistent. But if we talk about Redis as a persistent data storage – you’d be better to put the database outside of container. Even if you have constantly refreshed RDB snapshot. It’s going to be complicated to find this snapshot in the rapidly changing computing cluster.
Also we may talk about Elasticsearch inside the container. We may store in ES only indexes and constantly rebuilt them from persistent data source. But look at the requirements! By default Elasticsearch takes from 2 to 3 GB memory. Memory usage is inconsistent in order to Java garbage collection stage. Are you sure Elasticsearch is the best for container designed for resource limitation? Wouldn’t be better to organize different instance for Elasticsearch with hardware requirements arranged?
And don’t worry about local development. You’ll save a lot of time and efforts putting the database in containers at local environment. Also you’ll be able to repeat the production OS environment. You know, native Postgres for OS X or Windows is not 100% compatible with LInux version. Setting up the container instead of package on your host OS, you’ll cover this lack.
The Docker hype should go down someday. And it doesn’t mean people will stop using container virtualization. It means that people will start using containers properly and put VALUE on the top.
Few days ago I watched the awesome talk about frameworks survival in messy Ruby world. What I pointed from this talk, having no knowledge about Ruby – technology life cycle. Frankly speaking, hype cycle. Looking at this hype cycle we see the Docker on the second stage – peak of inflated expectations – too long time. The situation will be normalized when we’ll see the Docker at the last stage. I think we’re responsible for such process and may speed up this process. And have to.
P.S: yay, this is almost the first post at this blog where I put my own thoughts, not a “how to” instruction. I still have a lot of similar mindless guides at the schedule. So, there won’t be many and after them I’ll try to write less, but reasonably.