Why databases are not for containers

archer

If we attentively look at IT industry at 2017, all of us will see “containers” and “Docker” as the top buzzwords ever. We started to package developed software in Docker containers in every field. We’re using containers everywhere. From small startups to huge microservices platforms. From CI platform to Raspberry Pi ARM-targeted research. From database management systems to….

Sorry, WHAT?!! Are you sure you’re going to put your database inside the container? In production? Have you gone mad, dude?!

Unfortunately, this is not the fiction. I see many fast-growing projects who’re managing persistent data into containers. Moreover, managing at the same machine with computing services! Hopefully, mindful experienced people are not going to do this solution. But a lot of inexperienced people do.

So, here is my point of view, answering “Why?” for this subject. Note: this is the situation for the current day, January 29th 2017. We know some research projects who’s trying to figure out how to safely manage databases in Docker. But database containerization is not reasonable for today.

Let’s get started from reasons!

7 reasons


1. Data insecurity

This is a totally right point of view at this article about Docker, starting from the “Banned from DBA” part. Even if you’re going to put the volume from host where data located – it doesn’t guarantee you anything. Yes, Docker volumes designed go around the Union FS image layers to provide persistent storage. But it still doesn’t guarantee you anything.

Docker is still unreliable working with currently available storage drivers. You may corrupt the data in case of container crash where database didn’t shutdown correctly. And lose the most important part of your service.

2. Specific resource requirements

I’ve seen DBMS containers running on the same host with service layer containers. But these service layers are not compatible according to hardware requirements.  

Database, especially relational, requires extra resources. It’s about memory, disk I/O provisioning. Generally database engines dedicated environment to avoid resource concurrency. Putting your database inside the container, you’re going to waste your project’s budget. Why? Because you’re putting a lot of extra resources to the single instance. And it’s going out of control. In cloud case you have to launch the instance with 64GB memory when you need a 34. In practice some of this resources will stay unused.

What about workaround? You might separate layers and spin up multiple instances with fixed resource requirements. Horizontal scaling is always better than vertical. Ok, almost always!

3. Network problems

To understand Docker network you must have a deep knowledge about network virtualization. Also get ready for unexpected cases. You may be forced by circumstances to fix the bug with no support or use the extra tool for permanent fix.

We also know: database requires dedicated and persistent throughput for higher load acceptance. We also know container is a one more isolation layer behind the hypervisor and host virtual machine (we’re about cloud, right?). We also know network is critical for replication, where 24/7 stable connection between replicas required. And, of course, unsolved Docker network troubles. Still unsolved after dedicated network solution released in 1.9 version.

To put all these things together, we’re making sure database containers are hard to manage. Very hard if we talk about networking. I know, my friend, you’re a top-class engineer who doesn’t know the ‘hard’ and ‘impossible’ words. But how much time will you spend solving Docker networking problems? Wouldn’t be better to put the database in dedicated environment? Save the time to concentrate on what really matters for target business?

4. State in computing environment

Sometimes we’re using Docker with ‘stateless’ buzzword. By the way, can you see how many buzzwords we’re using now? Please, don’t talk about ‘DevOps’ today!

That’s cool to pack stateless services in Docker, implement orchestration and stop caring about single point of failure. What about databases? Putting the database on the same environment, you’re making this stateful. You’re making the application failure domain broader. Next time when your application instance or application itself will crash. it’ll probably touch the database workflow.

5. They just don’t fit major Docker features

OK, I don’t know anything about Docker. OK, I’m an old-school crappy system administration. OK, I don’t care about innovations and business value. So, thinking about database in the containers, we must estimate the value and profit. Let’s copy the official answer about what Docker actually is:

Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.

So, according to the answer, we can easily define the main Docker value:

  • Easy to setup the software
  • Easy to redeploy (Continuous Integration)
  • Easy to scale horizontally (nof from the answer, but from practice)
  • Easy to maintain environments parity

And let’s start of thinking about how do these features fit the database world.

Easy to setup the database? Is there any BIG difference in time running…

docker run -d mongod:3.4

and…

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6
echo "deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
sudo apt-get update && sudo apt-get install -y mongodb-org

If we talk about MongoDB cluster – probably it is. But what about Configuration Management systems? They’re designed to solve this kind of routine by running one command. Here is example of Ansible role for Mongo setup you can use for dozens of instances. Coding for your favorite CM system – whatever it is – is not a rocket science as sure as read this. As you can see, there is no huge increase of value.

Easy to redeploy? How frequently do you redeploy the database upgrading to the next version. Database upgrade is not a usability problem, but engineering problem. Even for usability in clusters. Think about how will your application work with the new DB engine version. What may be broken when engine will be changed. It is much more valuable thought. And Docker won’t solve this problem.

Easy to scale horizontally? Do you want to share the data directory between many instances. Are you not afraid of direct data concurrency and possible data corruption? Wouldn’t be more secure to deploy one more instance with dedicated data environment? And finally organize a master-slave replication?

Easy to maintain environments parity? How frequently does your database instance environment changed? Do you upgrade the OS every day? May be database version or dependent software like libs and modules? Wouldn’t be much easier to make a consensus with engineering team?

OK, let’s imagine it will. But is it possible for “bad engineer” to spit on this rule and keep doing development on different DB version? Or may be for “good engineer” to confuse about what to use? I think yes in last 2 questions, and Docker approach is not a silver bullet for environment parity issue.

Finally there is no one feature left to start thinking about database containerization.

6. Extra isolation is critical at the database layer

Actually I mentioned in reason №2 and reason №3 my proofs about this thought. But I put this in separate topic because I want to notice this fact once again. The more isolation levels we have, the more resource overhead we get. It’s not the issue when we get a lot of benefits instead of dedicated environments. But in Docker these features for stateless computing services, not for databases.

So, we don’t see any isolation features for database, why should we put this in container?

7. Cloud platform incompatibility

Majority of us getting started with projects in the cloud. So, cloud simplifies VMs juggling and possibility for quick termination and replacement. For example, why do we need the testing/staging environment at night or weekend when nobody works? Why do we have to scare about single instance uptime when we can spin up another one? Having the same configuration and service start process.

This feature is the main one. That’s why we pay a lot to our cloud provider. It will disappear when we put the database container for the instance. That way new instance won’t be compatible with existing one because of data mismatch. So, we’ll restrict yourself to the single machine and such juggling will be impossible. Better to use a non-containerized environment for DB. Leave the juggling and autoscaling only for computing service layer.

Does it matter all databases?


No, not all. It’s all about databases where data should be persisted. And it’s also about databases with special resource requirements.

If we talk about using Redis as a cache or user sessions storage – there should not be any problem. Your data have no risk to get lost because you don’t need this kind of data persistent. But if we talk about Redis as a persistent data storage – you’d be better to put the database outside of container. Even if you have constantly refreshed RDB snapshot. It’s going to be complicated to find this snapshot in the rapidly changing computing cluster.

Also we may talk about Elasticsearch inside the container. We may store in ES only indexes and constantly rebuilt them from persistent data source. But look at the requirements! By default Elasticsearch takes from 2 to 3 GB memory. Memory usage is inconsistent in order to Java garbage collection stage. Are you sure Elasticsearch is the best for container designed for resource limitation? Wouldn’t be better to organize different instance for Elasticsearch with hardware requirements arranged?

And don’t worry about local development. You’ll save a lot of time and efforts putting the database in containers at local environment. Also you’ll be able to repeat the production OS environment. You know, native Postgres for OS X or Windows is not 100% compatible with LInux version. Setting up the container instead of package on your host OS, you’ll cover this lack.

Conclusion


The Docker hype should go down someday. And it doesn’t mean people will stop using container virtualization. It means that people will start using containers properly and put VALUE on the top.

Few days ago I watched the awesome talk about frameworks survival in messy Ruby world. What I pointed from this talk, having no knowledge about Ruby – technology life cycle. Frankly speaking, hype cycle. Looking at this hype cycle we see the Docker on the second stage – peak of inflated expectations – too long time. The situation will be normalized when we’ll see the Docker at the last stage. I think we’re responsible for such process and may speed up this process. And have to.

P.S: yay, this is almost the first post at this blog where I put my own thoughts, not a “how to” instruction. I still have a lot of similar mindless guides at the schedule. So, there won’t be many and after them I’ll try to write less, but reasonably.

The image source: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/

Advertisements

19 thoughts on “Why databases are not for containers

  1. Docker and Container are as interchangeable as Volkswagen Passat and Sedan are. Just because the first one is an example of the other it doesn’t mean it’s the only example. Nothing in this article is relevant to LXC or OpenVZ.

    Containers are very much for databases, and Docker isn’t for many things. In fact I cannot imagine using it in any production scenario whereas many production systems run on OpenVZ and LXC.

    Liked by 2 people

    • Didn’t have so much experience with OpenVZ and LXC. Thanks for the remark, surely, Docker doesn’t cover everything we can say about containers.
      The problem is that I didn’t hear anything about container features, but only Docker features for databases. And that’s the main topic of my post.

      Like

    • Totally agree with eliver0com: I’ve been using containers for all my production stuff, including DB for thousands of customers for 12 years now. Started with Linux Vserver, then OpenVZ and now LXC.

      Please keep in mind that containers existed far before Docker. Docker provide container technology, but with a more opinionated way of doing things. It has its own advantages for sure, but it represents only a container technology, not the whole picture. Maybe changing the article title to “Why databases are not for Docker” would make it more precise ?

      Anyway, thanks for the content and the information!

      Like

  2. Pingback: Why databases are not for containers | Ace Infoway
  3. As someone who’s testing and running postgres since 16 months with docker, I can say it works without any problem. Much of your points are more or less invalid. Yeah also running it in production. (And we have heavy traffic on our DBS)

    What you don’t see and know that there are a lot of professional orchestration tools like kubernetes. Our DB Pods are running on the nodes which are designed for them.

    It seems you don’t like docker and missing much of knowledge in this topic. Try to be more neutral on stuff you don’t understand. (Point 1 to 5 are bullshit, sorry.)

    At your point of view you will still deploy DBS on physical hosts. If you want to throw your money away ok, do it! Else using docker on VMS does not add a big isolation layer. Don’t think docker is just another kind of virtualization. Besides setting up and configure a cluster the right way is more then just install the binaries. You use config management, cool. But what’s happens when a security patch on a new staged node does wired things? In fact you will mix different binaries states of software. Config management is not proper deployment. DevOps means much more then just use docker. With DevOps you creating a process to test and deploy exactly the same binaries and configurations and it does not matter when you scale because everything is identical.

    Liked by 2 people

    • Thanks for the feedback.
      Actually I didn’t mention physical hosts. Absolutely true, we don’t want to waste the money away, using all resources we have. I mentioned virtual hosts. What kind of problem would you expect putting your Postgres in VM?
      Also if you share the details about how to make a reliable Postgres replica cluster on Docker multihost networking.

      Like

  4. Interesting read.

    It might only be my case, but I only use Docker containers for my DBs’ locally. I find that it solves a really good problem in separating separate DB versions for different projects. It also makes it easier to move around different data from different people by easily mounting “Volumes”.

    Liked by 1 person

  5. I witnessed a lot of dedicated DB setups where security patches and updates on the database (config changes and software updates) are basically done blind folded without testing, because there is no real test setup. Isolated environments and host system abstraction is a huge win for maintainability and testing. Test setups are easy to handle and deploy. I don’t really care if you’re abstraction is based on a hypervisor or the Docker platform. For me Docker worked out great and I setup infrastructure for a living.

    Most points you mention are irrelevant and more or less seem like inexperience. Especially with networking I don’t see the point. The technology used in Docker networking is pretty much the same as in virtualization or dedicated setups. What’s wrong with bridged interfaces and iptables? I would agree with you if you where talking about Docker Cluster networking I had major hiccups with that.

    Liked by 2 people

  6. To create a MySQL database instance on a server, using the hosts network (ie. without any bridging, just like you would do without containers), with your data somewhere safe:

    docker volume create mysql-data
    docker run –restart always -d –network host -v mysql-data:/var/lib/mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=yes mysql:5.5

    You now have a fully operational MySQL instance, ready to go. Remove your container, restart it, upgrade it, whatever. Your data stays where it is, because you’ve made a seperate volume for it.
    Want to move the MySQL instance to a different server? No problem, just mount the same volume and go for it. Is the image for your minor version of MySQL updated? Update it with a single command. Did your server restart, albeit on purpose? Docker will start your container automatically for you. Wanna use the server you’re using now for something else? Stop and remove the mysql container, your system is now clean and can run anything else. Or run your containers along side of your MySQL instance if you want to, without worrying about any other process conflicting with others. Personally, I can’t imagine why you wouldn’t want to run something sensitive like a database within a container. Even when you’re running a server with just one container (containing a database server), the benefits are already there.

    I’ve been running a MySQL database with docker without _any_ problems. In fact, it works perfectly and reliably.

    None of the arguments in this article make sense and seem more like a lack of understanding/experience regarding Docker.

    “Docker is still unreliable working with currently available storage drivers.”
    Citation needed. Never had a single issue, at least not with official storage drivers (don’t have experience with unofficial drivers). Local drivers seem to work perfectly, and using NFS shares in combination with said driver work means you can have a single central filesystem (which should be redundant in terms of hardware and should backup periodically off-site) for all of your containers within your network to write to without your containers even having to know about where they’re actually writing to.

    “You may corrupt the data in case of container crash where database didn’t shutdown correctly.”
    This is not an argument. Anything crashing, whether it’s within a container or not, might (or might not) corrupt your data.

    “To understand Docker network you must have a deep knowledge about network virtualization. Also get ready for unexpected cases.”
    This is simply not true. The whole idea behind docker networks is that it’s easy and straightforward in both usage and management. It does, however, require experience and knowledge about the tools you’re using. As is the case with any tool we use in devops.

    Liked by 1 person

  7. I don’t agree with everything said here, but :
    – there is no such thing as “a database”. There are MANY databases with different use cases and setup and tolerance.
    – you won’t do the same thing in Dev and Prod (I hope), so while a single instance database X for dev is OK, having a cluster of database Y may not play well with docker
    – you usualy don’t go plain docker on Prod, you use something else, like Kubernetes, Mesosphere, Openshift, which will bring another layer of compexity. In this case, setting up a clustered database inside a cluster of docker inside a pool of servers, maybe inside a virtual env like AWS… is obviously prone to major failure.

    While I would never say “don’t use docker for databases”, I would certainely advise you to think twice and clearly learn the limitations of your Db, of how is it’s cluster working, how you can recover from a failed node, how you’re going to store your data (nfs, local ?), how you manage io failure, how you make backup and so on.

    Like

    • 1. there is no such thing as “a database – totally true, but we are about data. Who cares about how do you store the service data when it’s lost or corrupted?
      2. you won’t do the same thing in Dev and Prod – are you OK with this lifecycle having inconsistent environments? Isn’t it a 12 Factors rule that all service-related environments (even local environment) should be equal to production?

      The last quote is brilliant, thanks for that. “Think” is a keyword while there is so much hype like this.

      Liked by 1 person

  8. Pingback: Why databases are not for containers | thechrisshort
  9. Pingback: Friday Reading 2017-02-10 | The DBA Who Came In From The Cold
  10. Pingback: Stateful Containers using Portworx and Couchbase
  11. Pingback: Some irresistible reading for March – outages, code, databases, legacy & hiring | Scalable Startups
  12. Pingback: Devopsdays Moscow: 6 thoughts | MyOpsBlog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s