Infrastructure from scratch. Part 0: Challenge

I tried to write my posts saving a little distance between my life and technologies I’m learning. But my job road was changed. Now I have to describe my thoughts as from technical viewpoint, as sure as from social too.

Let me describe the issue. The first 5 months of 2016 I organized media streaming on the Internet. The main software they use – Wowza Streaming Engine I described the last 6-7 weeks.

Things were doing well for me personally. I adjusted in this special IT field, learned streaming network protocols, video transcoding roadmap and so on. But there were some pitfalls.

The first one is a software choice restriction. I had to work only with Wowza, when there is much more interesting tools to build a streaming platform. Won’t describe any details of this, just setting the fact.

The second one is a duty restriction. When you work with Wowza, you don’t need to understand databases, cloud, continuous integration etc.. It’s going to drag you down as engineer, because you’re not progressing generally. There were a lot of more reasons why I solved to change my Ops way. But I think I’ll make a special post about job substitution later.

Somehow I saw invitation to position called “AWS System Administrator”. I looked at list of required knowledges and solved to proceed the interview. After that I was amazed by opportunities and variety of technologies I can bring at this project. The project is actually fast-growing startup with infrastructure needs to be maintained.

So, I moved to this job and faced with mess and absence of tools vital for every software environment. It’s OK, because developers team can’t have on finger in many pies. That’s where Operations engineering is getting started.


  • build a reliable, scalable and developed system ready to highload;
  • deploy DevOps in software development and deployment process;
  • step forward personally by knowledge and experience

The subject

Startup project. Web-application with determined idea based on media content delivery. Has 2 applications: main Node.js app and admin panel on Java. There is no frontend and backend splitting: only one application connected with PostgreSQL database. Requires to build a stable cost-optimized environment.


Split the big goal on subjective subfields. Subfields in its turn are splitting on different tasks. Everything is collected and sorted by priority in “to-do” table. Solving tasks one by one mark them as done and go ahead.

And here is the subfields of projected service. Don’t become dissapointed when some pictures in this post will look pseudo-funny for you. I just need to add some media content into bare text to keep you reading this confession.

Note. The tasks special for my situation will be marked in italic font.

1. AWS Infrastructure


ChallengeSort out all disorder in AWS account. Distribute the whole environment on sub-environments: production, staging and QA. Exactly understand features of all services provided by Amazon. Customize software infrastructure based on useful of their services with minimum of expenses.

Basic steps.

  • Deploy detailed cost tracking system
  • Optimize AWS costs
  • Track costs for every special environment (production/staging/QA)
  • Clearly tag all AWS resources
  • S3 bucket data distribution to sub-environments
  • Priviledge separation for AWS console with least-necessary user rights
  • Autoscaling group customization for production
  • Implement reserved instances policy to reduce EC2 costs

2. Backup and recovery


Challenge. Build the firm backbone of infrastructure. Backup everything which could be backed up. Make daily backup automatically. Test system recovery regularly.

Basic steps.

  • Source code backup
  • Operating system configuration backup
  • AWS services configuration backup
  • User data backup
  • Automate EBS snapshots process
  • Database backup (as physical as logical)
  • Place backup destination point outside of AWS
  • Regular recovery testing
  • Potential points of failure analysis. Further backup of their points state.

3. Monitoring


Challenge. Raise a mechanism which could tell you what’s going on in your system and how. Define the most important metrics and processes in your system. Collect and visualize them in one specific place. Choose what monitoring system to use and answer why was it chosen among the thousand of solutions.

Basic steps.

  • Hardware components monitoring (CPU/memory/network/disk)
  • Main application metrics monitoring
  • Admin application metrics monitoring
  • AWS services monitoring
  • Business-metrics monitoring
  • External services monitoring (e-mail/sms notifications etc.)
  • Database monitoring
  • Web-application failures detection and monitoring

4. Log events management


Challenge. Collect all log events from services in one specific place. Make a possibility to easily track and filter them. Notify responsible persons for any system deviation recorded in logs.

Basic steps.

  • Deploy centralized log events storage with dashboard
  • Implement log events notification system

5. Security


ChallengeClose everything which is not used. Close inside everything which should be used only from the inside. Separate user privileges as tough as it might be. Set a regular process of security test and analysis. Provision application servers by intrusion prevention software.

Basic steps.

  • Secure every reference to Amazon resources from users
  • Leave Internet access only to web-application
  • Deploy VPN system
  • Distribute sub-environments to subnets. Permit traffic exchange only inside of subnet.
  • Permit only HTTPS traffic.
  • Database privilege separation.
  • Install and configure security software (OSSEC, Fail2Ban, BitNinja)
  • Configure firewall for network attacks protection (like DDoS and MITM)
  • Launch main application only from unprivileged user
  • Remove all useless processes and daemons from operating systems
  • Control SSH connections from team members
  • Regularly remove all data and credentials of fired team members.
  • Move SSH access to another port (say, 222)
  • Build a notification from AWS CloudTrail
  • Make a regular web-application analysis
  • Make a regular network analysis
  • Make a regular pentesting
  • Make a web-application response failover

6. Service architecture


Challenge. Improve existing application architecture. Regularly look for better alternatives of services we use at present. Move from inconvenient and costly to convenient and open-source.

Basic steps.

  • Have a clear service architecture blueprints
  • Choose suitable filesystem
  • Move from ElastiCache to self-hosted cache service (Memcached)
  • Move from Elastic Transcoder

  • Move thumbnails processing behind S3 access layer

7. OS software management


Challenge. Make OS environment independent from dependencies. Keep in mind what’s running in your machine and why. Keep track on all security updates and software updates generally. Automatically patch the Linux kernel up to the latest version. Sync the time between all nodes in cluster. Regularly check applications neighborhood (libraries, frameworks etc.).

Basic steps.

  • Customize package manager for reasonable update
  • Regularly check security updates and software/libraries updates and changelog
  • Regularly check application language environment (NodeJS and Java in my case)
  • Permanent time sync between all cluster nodes
  • Realize a fully-automated Linux kernel patch

8. Database


Challenge. Generally database management system looks like a control panel in the airplane. It has to be understood what button to push and what arm leave like it is to optimise database requests speed. One of the most tasty and interesting subfields for Ops I guess.

Basic steps.

  • Look out where is the better Amazon RDS alternative
  • Customize for your needs cache/memory/buffer use and file system.
  • User privilege separation
  • Keep DB access open only from the application server
  • Choose the most favorable storage for database
  • Deploy phpPgAdmin to visualise database workflow
  • Replication
  • Sharding
  • Logical partitioning

9. Load testing/Capacity planning


Challenge.Set a regular rule to test a system habits under high load. Understand the real system possibilities and plan to increase them. Also look on application improvement from the side of web-app.

Basic steps.

  • Regularly perform load tests
  • Regularly perform capacity planning
  • Regularly test web-application with web-analytics systems (like Piwik)

10. Automation of cluster management


Challenge.Don’t do anything manually. Don’t install/configure/deploy/move environment by hands. Get a guru-level knowledges about AWS orchestration tools and configuration management system. Probably containerize maintained services.

Basic steps.

  • Full use of configuration Management system (Ansible)
  • AWS infrastructure orchestration (Terraform)
  • Move to vim editor (the personal one)
  • Containers deployment for environment services (Docker/Kuberntetes)
  • Set up a Service Discovery in application nodes (like Consul or Zookeper)

11. Continuous Integration/Continuous Delivery/Continuous Deployment


Challenge. Release anything by one click. Release regularly as quick as possible. Realize infrastructure where any piece of software is going to be tested. Build and pack the app there, update dependencies and roll it in environment at last. Deploy auto-notifications for each release. Shortly, make deployment reasonable.

Basic steps.

  • Automate a main application deployment
  • Automate admin application deployment
  • Place Giltab on Continuous Integration scheme
  • Release application from deb-package
  • Cover application workflow by tests
  • Automate Code Review
  • Security analysis of each new build
  • Automate release notifications
  • Optimize the speed “from commit to deploy” workflow

12. Documentation


Challenge. Try to keep every workflow process in centralized and accessible place. Keep in mind that documentation always matters. Stop to keep any done thing in my brain. Have a clear architecture schemes. Save a reasonable legacy for project successors.

Basic steps.

  • Deploy a Wiki engine
  • Regularly draw and visualise architecture solutions.
  • Set an example attracting teammates to document their job

13. DevOps culture


Challenge. Learn to speak with developers. Learn to interact with developers. The same thing with managers and stakeholders. Clearly know the borders of your enterprise. Deal with your duties as well as possible. Don’t busy any team-member brain by thoughts about system unreliability. Reach engineering Zen.

Basic steps.

  • Keep a track on new technologies/software/Open Source solutions
  • Form a comfortable environment for development

14. Service improvement and expansion

Challenge. Look what’s done and enjoy a couple of hours. After that understand that customers is coming on and RPS is growing up. Think about new architecture solutions and deal with accumulated data.

Basic steps.

  • Containerization
  • Accelerate system processes with NoSQL solutions
  • Deploy a live media streaming service
  • Network performance tuning
  • Distribute big service on geographical regions
  • Big Data management

I’ll try to record every step from this list I mastered. There could be cheerful reports, tutorials of small things for the special subfield and so on. It will be enclosed in a special mini-cycle. This post is the 0st part of cycle. If everything will be done, the project should be raised “from scratch to highload”. It will also become a small instruction for another Ops engineers. I’d like to answer on 3 questions: where to start from, what to do and what could be leveraged.

So, if you accidently got this post, wish me a good luck and steel nerves. I hope I’ll get a priceless experience working on this wide-specialized product. And also remember how to interact with developers. Amazing and funny journey to DevOps world is coming!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s