EGLUG Systems Development Course
Submitted by bahaa2008 on Tue, 29/06/2010 - 8:20pm.
You can find the course details here Here is the outline for the course. EGLUG Systems Development Course (ESDC) Amr Ali 062810 Copyright Notice Copyright (C) 2010 by Amr Ali amr-ali.co.cc All rights reserved. This work is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License. To view a copy of this license visit or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Abstract This course is only a premier to the system development field and should not be treated as a reference or base a study of sorts upon it. It only teaches the principles of system design and development for developers that invested quite an effort into programming and willing to go further into the realm and valleys of the hacker. Quotes "Software Engineering might be science; but that's not what I do. I'm a hacker, not an engineer." - Jamie Zawinski "I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent others from making use of one's contributions to computer science." - Donald E. Knuth "Science is what we understand well enough to explain to a computer. Art is everything else we do." - Donald E. Knuth "Always program as if the person who will be maintaining your program is a violent psychopath that knows where you live." - Martin Golding "Coding styles are like assholes, everyone has one and no one likes anyone else's." - Eric Warmenhoven Jokes "Computers are like air conditioners: they stop working when you open WINDOWS." "unzip; strip; touch; finger; mount; fsck; more; yes; unmount; sleep" - My daily UNIX command list. "The Internet: where men are men, women are men, and children are FBI agents." "One of the main causes of the fall of the Roman Empire was that, lacking zero, they had no way to indicate successful termination of their C programs." - Robert Firth "UNIX is user-friendly. It's just very selective about who its friends are." "Microsoft is not the answer, Microsoft is the question, NO is the answer." - Erik Naggum "I would love to change the world, but they won't give me the source code" - Amr Ali? :-) My Personal Favorite "There are 10 kinds of people in this world, those that understand trinary, those that don't, and those that confuse it with binary." - Whoever understands that joke will do good in this course :-) Table Of Contents 1. Introduction 1.1. Summary 1.2. Prerequisites 1.2.1. Programming Experience 1.2.2. Field Of Experience 1.2.3. Programming Languages 1.2.4. Depth Of System Knowledge 1.2.5. CPU Designs And Architectures 1.2.6. Mentality 2. UNIX Based Systems Communications 2.1. User Space To User Space 2.2. User Space To Kernel Space 2.3. Kernel Space To Kernel Space 2.4. InterProcess Communication 2.4.1. Types Of Communication 2.4.1.1. Signals 2.4.1.2. Pipes 2.4.1.3. Sockets 2.4.1.4. Message Queues 2.4.1.5. Semaphores 2.4.1.6. SpinLocks 2.4.1.7. Mutexes 2.4.1.8. Shared Memory 2.4.2. Synchronization 2.4.3. Common Problems 3. System Application Design 3.1. Strategy 3.2. Daemons 3.3. Logs 3.4. Storage 3.5. Debugging 4. Case Studies 4.1. Case 1 4.2. Case 2 4.3. Case 3 4.4. Final Project 5. Author 5.1. Background 5.2. Contact Information 6. Thanks 1. Introduction 1.1. Summary ESDC is for whoever wants to develop system level applications and solutions that are specifically designed for UNIX based systems. This is an open ended course which entitles the expansion of the above TOC at any time without prior notice to students. 1.2. Prerequisites 1.2.1. Programming Experience Whoever applies to this course should had quite an experience with different programming languages to have the mature mentality required for this course. Basically it is a "MUST" that a student had at least written one thousand (1000) line of code in any language. 1.2.2. Field Of Experience It is not required that students had done any system development prior to this course but it is rather preferred that they have done at least readings on the topic or have a general idea what the hell we are talking about. 1.2.3. Programming Languages It is a "MUST" that a student is very fluent in C as a language but not necessarily in its libraries, however general knowledge of them is preferred and knowing how to use the man pages is a must along with knowing the meaning of "RTFM". ASM is not required at all, but knowing the different syntaxes is preferred and maybe little to how ASM as language works. BASH, I know I'm stating the obvious, but I don't either want to be fronted with questions about administering UNIX or how to build a Makefile, so BASH/M4 are best to be known. 1.2.4. Depth Of System Knowledge This course is built around UNIX based systems, mainly Linux, so it is a "MUST" to know your way around that system and no, this is not "how to make windows drivers" course. 1.2.5. CPU Designs And Architectures It is strongly preferred that a student knows about the different CPU architectures and designs and how they contribute to system development, that's why some ASM would be preferred in general. Just know this, SMP is trouble, lock your code good, or keep debugging day and night until you bleed out of your eye sockets. 1.2.6. Mentality You should have the mentality of a hunter, you never quite, and you never surrender to failure, you keep trying and trying. Always alert to the tiniest details and ready to adapt new tactics and techniques quickly, which requires dedication and effort. 2. UNIX Based Systems Communications 2.1. User Space To User Space First let me introduce you to what is user space, user space is whatever made by mere humans and runs in the background (ex. background process, or daemon) and of course cannot communicate directly with the physical layer of your computer. User space communications mainly deals within the area of IPC (InterProcess Communication), like shared memory segments, message queues, and named/unnamed pipes. Forget the ideas you had to have a certain file that some process writes to and another reads from, I had these ideas when I was 12, you are an adult use IPC. Apparently main purposes of IPC is to make two processes run totally independently of each other and communicate in an efficient way to pass certain information to each other back and forth. 2.2. User Space To Kernel Space "Kernel: is the central component of most computer operating systems; it is a bridge between applications and the actual data processing done at the hardware level." - Wikipedia So let me put this in more simple words, the kernel is basically the guy that facilitates the usage of your hardware, bluntly, without a kernel and you wanted to print out "hello." to your screen you will have to write the code that communicates to your PCI bus and to your video card, with all the pedantic necessary op codes and flags to make this 6 characters word appear to your dead cold black screen. But if we have our kernel how we communicate to it? can we include just a couple of C header files and it will contain all the above mentioned code? the answer is of course YES. Thats the main purpose of the standard C library or `stdlib', which simply is an abstraction to all the assembly (yes ASM, you can't talk to the kernel directly in C) required to communicate back and forth with your virtual/real hardware. However we still want to communicate directly with our kernel, isn't there any other possible ways except for ASM? of course there is other ways you silly sally, which are still another abstraction over the ASM interfaces the kernel provides, like IOCTL, ProcFS (Linux only), NetLink, and System Calls (SysCalls) all these are called IPC, but I don't like calling them that, so I'll call them ISC (InterSpace Communication). 2.3. Kernel Space To Kernel Space Lets imagine that you got so elite to the point that you created two kernel modules, and you would like to exchange information between them. The thing you must understand about the Linux kernel is that it ends up compiled to a single file, everything is shared inside the kernel, so you declare a certain function it becomes exported to what is known to be KST (Kernel Symbols Table), this table will contain all the functions and variables you've exported, so other parts of the kernel can call them. As a side note, as I want to impose the kernel space image upon you, there is no floating point arithmetics in kernel space, just because its not worth it. That can tell you how pedantic the process of code getting selected to go inside the main kernel repo is, so don't expect any kind of functions/libraries you are used to in user space to exist in kernel space. 2.4. InterProcess Communication 2.4.1. Types Of Communication There are mainly 8 types of communication, three of which are locking mechanisms, Signals, Pipes, Sockets, Message Queues, Semaphores, SpinLocks, Mutexes, and Shared Memory. Each will be described in the following sub sections, but the general idea is, they are all are used "not" equally, they all have their different purposes. 2.4.1.1. Signals These are one of the oldest methods to build interrupt based applications, which means that a signal can be sent to a process or two in case of an event, or the receiving of certain new data. Interrupt based style is heavily used inside the kernel so you should understand this type of communication in great depth if you are planning to dive into kernel space. Also must note that this type of communication is asynchronous, which means there is no two parallel ways of communication, its only one way, or live on one wire as our friends at the electric engineering department would love to call it. 2.4.1.2. Pipes This is a unidirectional byte stream way of communication, which connects the standard output from one process into the standard input of another process, of course that bridge is made using files, but not just normal usual files, they are files that are on a VFS inode which itself points to a physical page within memory. Must note that these are unnamed pipes, there are also named pipes which create real files with the only difference that synchronization must be handled by you, locking, etc. so don't expect the magic of unnamed pipes, but you can set permissions to FIFOs (they are called that because they work with the principle of First In First Out, so what you write first will be read first on the other end), they can be created simply by the command `mkfifo' (RTFM). 2.4.1.3. Sockets I'm sure most of you people have stumbled upon that concept before or read about it somewhere, or even used it. Its simply what makes today's networks and even the Internet, they are all operate on the concept of sockets. Figured it out already?, well they are simply all ports and IP/Host addresses/names, but not necessarily used only in the case of over the network communication, it can also be used as IPC, heard of UNIX sockets before? and yes they are different than TCP/IP sockets. I'm not going into much detail over this here, as this has literally tons of information over the Internet, which you sir/mam can google out on your own. 2.4.1.4. Message Queues You can think of this type of communication as in one-to-many relationship, which happens if you want to send a message to many processes, the only difference between it and mailboxes is that it has restriction on the size of each message, and shares the same synchronization as mailboxes which is asynchronous, meaning that the sender and the receiver do not need to interact with the message queues at the same time. It's also similar in some ways to Pipes, except that it all happens in memory, no files. 2.4.1.5. Semaphores These are more of locks than a way to communicate, and happens mostly over some shared resource either resides in memory or on disk. Simply they are a location in memory which value can be tested and set by more than one process, the test/set operations are atomic or uninterruptible which from a process point of view; once started nothing can stop it. You can think of them as some variable that get incremented if a process or thread jumps in the critical region to modify some critical resource (ex. memory page, file, etc.), and once finished with that region in question, it decrements the value in the variable in an atomic fashion. Semaphores are not the best solution out there, its quite expensive to lock and unlock a semaphore, it takes literally thousands of CPU cycles to do so, because of the system calls that had to be made. But they had their uses, for example if your critical region is supposed to be just setting an integer value to a variable, then it is a very bad idea to use semaphores. But if you have an operation like writing to some file, then its worth it to put some process to sleep and then wake it up after you finish, which what semaphores does. 2.4.1.6. SpinLocks SpinLocks on the other hand are the fastest out there, simply because they are hardware implemented, note that if you are working on a single core/processor system, SpinLocks are useless unless you have preemptive kernel, or preemption is compiled into your kernel. SpinLocks are the fastest simply because they are implemented in hardware, not like other locking methods which implemented in software. However take extra care when exactly you use SpinLocks, as their name says it, they do a busy spin, which means that they keep spinning in a while loop and saving the time of sleeping the process and waking it up again, so if the critical section is taking more than a thread quantum, then SpinLocks are a very bad idea. 2.4.1.7. Mutexes You can think of Mutexes as hybrids for SpinLocks and Semaphores, which explains why they are the most used in user space applications. They do require expensive system calls when locking, but when you do unlocks it does it without the kernel help, which saves half the time Semaphores take. Basically if you don't know what you are doing, your best bet is to use Mutexes, they are widely known and understood, so you won't be bothered with all the technicalities, but yet again if you don't want to be bothered, this course is totally not for you sir/mam, go have some windows lecture instead. 2.4.1.8. Shared Memory Shared pages of memory are what they are, I can't really think of a better to describe them except that they just have an id just like MQueues (Message Queues, duh sherlock), but just simply share memory pages, not necessarily having the same address for in each process accessing them, but they do reference the same page, the mechanics of this part is complex and deep, so I leave them to later on in course. 2.4.2. Synchronization We have seen two terms till now, asynchronous and synchronous operations, the two differ only by one character but in meaning they differ a lot. Asynchronous operations are operations that do not necessarily expect answers right away, an example would be your email, you send an email to a friend, but do you expect him to answer instantly? no. Synchronous operations on the other hand are operations that do expect answers instantly, basically they do block on answers, an example would be talking with your friend on the phone, when you talk, he listens and responds in the same conversation context. 2.4.3. Common Problems Most IPC methods of communication are known to share one big common problem, which is synchronization, which in the context of computers and applications should be addressed in terms of managing access between processes/threads, especially on a system that has more than one CPU (either virtual or physical). One of these problems does touch security, like race conditions. Race conditions happen when two threads or processes race for a certain operation, like setting or reading a value, when that happens, it can exploited to corrupt the system memory, or even gain unauthorized access to the system itself, so must take extra caution when setting up locks. Another problem is deadlocks, which happen when a certain process or thread locks and dies before it unlocks, which keeps either all other processes/threads waiting on the lock or spinning on the lock which ultimately results to the termination of the application. These kind of problems are very hard to debug, so I felt mentioning them in a separate section to show how important they are, or otherwise you will end up producing very bad code and some guy with a tie and a suit knows second to none about computers, yelling at you real loud asking for the name of the person that taught you these stuff to murder with a chainsaw. 3. System Application Design 3.1. Strategy If you grown up to be the strategy nut I am, I'm sure you will be very good at these stuff, simply because on the fly planning always works, pre-planning stuff always fail, and you need to have a vision into things along adapting and inventing different strategies of your own to be able to see bugs and errors at a glance and be able to mitigate them right away and know exactly what to modify and what to lave as-is, its really a very good skill to have, and as every other skill it comes with effort and training. The only method I found very effective to train that kind of skill is to look at others code, see and learn how they done things in their own way and style, try to understand it from the little tiny pieces, put the pieces together to form the mental image they had when they first developed that application and see if it can be improved in anyways. Bluntly as I always say, try to be more of a hacker that investigate every single detail and tries to understand the whole of everything, its not a shame to fail thousands of times, but it is a shame to be ignorant even about the little tiny things that everybody else discards, and always remember that knowledge is power. 3.2. Daemons So what are these little evil daemons, huh? They are simply what windows people call them, "background processes" or "services", if you never developed one before its time to build one. The main difference between a daemon and a process like "top &" is that the former closes all standard input descriptors (stdin, stdout, stderr), and forks another process which stays in a conditional loop till the condition that it is keeping the loop running is gone (ex. waiting on a SIGTERM signal). Also logging is a major trait of daemons, most have logs of their own and those that don't make use of other pre-installed logging systems. 3.3. Logs Any daemon should have one or more ways to communicate with the system administrator, if its not logs, what would it be? You can't communicate thorough standard output because a daemon is never attached to any ttys or pts's, so it gotta be logging. There are several mechanisms for logging, either you design your own logging, but will have to rotate your logs, so you won't end up with one file of gigs of bytes. Rotating logs isn't hard, you can make use of `logrotate'. Or you can save yourself all the trouble and make use of `syslog' which is a logging system that provides a very simple interface that you can make use of. 3.4. Storage Some daemons needs some way of organized storage, there are several solutions to fulfill your storage needs, one is to use `SQLite', which is used by many applications, like APT just to name one, but `SQLite' is only for not so sophisticated schemes of storage, and you should only use it in cases that does not require huge data sets to be stored. If your application requires some heavy duty DB system, I strongly recommend the usage of `MySQL', its a very good and a well known DBMS that comes a long with a very well done API. 3.5. Debugging Debugging daemons is specifically hard, because it involves threads so you want to know where exactly a certain bug is, however you need first to learn how to escape from the fork being done at first that spawns the background process. `set follow-fork-mode child' this command shall force GDB to follow a fork child, meaning that once your daemon forks a background process, GDB starts debugging the new child, and the parent is simply discarded and let to die. To know how many threads are running and which one is currently in context, you issue `info threads' which will display a numbered list. But what if you wanted to switch to one of these threads? easy you just issue `thread [threadno]' where [threadno] is the thread number you got from `info threads'. But I strongly recommend that you learn GDB from ground up, as it is a one essential tool in development under UNIX based systems. 4. Case Studies 4.1. Case 1 Develop an application that creates exactly two threads to calculate parallel Fibonacci sequence starting at any given point in the seq. ex. of input file ... 04 08 89 21 55 89 13 21 The first column is where the Fibonacci sequence begins and the second column is where it ends. The results should be outputted to standard output in the form of each line in the input file corresponds to a line in standard output. Also note that the order of the columns, where the Fibonacci sequence begins and ends is not sorted, so you might find the first column is the beginning of the sequence and other times the second column is the where it begins. 4.2. Case 2 Create another application that calculates the Collatz conjecture series based on the previous application output, only that it communicates with the previous application over shared memory pages and for each outputted line of Fibonacci sequence, a Collatz series has to be generated for each number in that line and outputted to standard output in the form of a table that each column begins with the original number from the first application and ends by one (1) (read Collatz conjecture on Wikipedia). 4.3. Case 3 Create a daemon that forks to the background and closes all standard descriptors, creates a few threads to pre-calculates a very large number of Fibonacci and Collatz series, and a client application that communicates with the daemon over shared memory pages to get some of these results and display them to standard output. Note that the amount of results going to be requested from the daemon must come from the user not hard coded, so your application must ask for the amount of results before getting any. 4.4. Final Project With a team, do develop a server that listens on a specific IP/port which can handle simultaneous connections using forking or threading if you choose forking, you will have to implement IPC between the forked processes and the parent process as data in each process has to be shared across all other processes in a central fashion as the parent holds all the data in a shared memory page, and all other forked processes access it from there. If you however decided to go threading you have the advantage that you won't have to implement IPC in your system, but would have to design the threads as a worker thread and a thread pool, the worker thread, waits on connections and once a request for a connection is presented, a thread is assigned this connection from the thread pool. Once the connection ends, the thread gets released of the connection and back to the thread pool as an available thread again. This server purpose is to broadcast messages to each and every client connected to it, but also store last hour of messages in a queue for offline clients, so when they login to the server they receive all last hour messages which being exchanged, that means that all messages that been sent back and forth between clients. This also means that you will have to develop a client that connects to the server and be able to send and receive messages from and to the server. If you want to get fancy, develop a configuration file parser, and make the listening IP/port put into a file to be read by the server when it starts. This is not required, but its just a way for me saying, if you want to get creative its absolutely encouraged. Good luck :-) 5. Author 5.1. Background I've started coding since the early age of 10 years old, and once I started writing my first few lines on MSX-170/MSX-350, I never looked back, programming and being able to have full control over a machine has been an addiction of mine for many years gone and many to come. I've started to dwell in the security field by writing my first symmetric encryption algorithm by the age of 14, which got me even more interested in programming but at a totally different level, all I wanted ever since is to be able to code at the most intimate level of the machine, and so I have done, I'm now able to code some of the BIOS, learned Verilog and able to design and write FPGA solutions. As for security, lets just say, it became second nature to me and a passion, I see vulnerabilities in humans let alone code, I can manipulate about everything from a group of processes to a group of people. Its all comes down to this, once you discover this security 7th sense it just becomes like your sense of vision, it just changes all and every aspect of your life. 5.2. Contact Information Please visit http://amr-ali.co.cc 6. Thanks I'd like to thank mother for all the support, encouragement, and love she always gave me in that direction. (Love ya mommy :-P) Also would also like to give thanks to the people that effectively changed my life to the better and being patient all along ... Gerald M O'Steen - For being the awesome mentor he is and for teaching me everything he could and being a very very good friend. Mark LaDoux - For beating me like a dead cow till I matured and learned the ways of pursuing knowledge and being a good friend. Love you all guys <31337

