A quick look at the overall structure and some interesting aspects of PostgreSQL

To get and compile the code, run the normal commands:

$ git clone git://git.postgresql.org/git/postgresql.git

$ ./configure
$ Make

(Interestingly, the Makefile is actually committed into the repository, so you might not even need to run the configure script.)

Before looking at the code, try to imagine what the overall structure might be, based on what it needs to accomplish. There needs to be a network component, which hands data to a parser, then maybe an optimizer, and some code to actually run the queries. There is of course a docdirectory and README files, but let’s jump right in to the src directory.

Structure is the key to understanding when it comes to code, and a diagram might help, but here we’re trying to understand the structure looking at the code itself.

The broad division in PostgreSQL is between the frontend (libraries/CLI making requests over the network), and the backend. The backend is probably more interesting. Run $ cd src/backend; ls and we find this directory listing (among others):

Makefile	common.mk	main		po		replication
access		executor	nls.mk		port		rewrite
bootstrap	foreign		nodes		postgres	snowball
catalog		lib		optimizer	postmaster	storage
commands	libpq		parser		regex		tcop
tsearch 	utils

Woah, what is snowball? cd snowball; cat README. Looks like a grammatical library for natural language processing, designed to find the stem of words: for example, from the word “eating” it will find “eat.” Apparently PostgreSQL lets you search text based on the stem of the word. That’s cool, learning already.

Optimizer and parser are easy enough to figure out, what about postmaster? Does that sound like a network layer? less postmaster/postmaster.c and we see this beautiful comment:

 * postmaster.c
 *        This program acts as a clearing house for requests to the
 *        POSTGRES system.  Frontend programs send a startup message
 *        to the Postmaster and the postmaster uses the info in the
 *        message to setup a backend process.
 *        The postmaster also manages system-wide operations such as
 *        startup and shutdown. The postmaster itself doesn't do those
 *        operations, mind you --- it just forks off a subprocess to do them
 *        at the right times.  It also takes care of resetting the system
 *        if a backend crashes.

So PostgreSQL has a high-level message concept, excellent. grep socket * -r leads topqcomm.c, which contains the low-level network routines. It holds another nice comment:

 *  StreamServerPort        - Open postmaster's server port
 *  StreamConnection        - Create new connection with client
 *  StreamClose                     - Close a client/backend connection
 *  TouchSocketFiles        - Protect socket files against /tmp cleaners
 *  pq_init                 - initialize libpq at backend startup
 *  pq_comm_reset   - reset libpq during error recovery
 *  pq_close                - shutdown libpq at backend exit
 *low-level I/O:
 *  pq_getbytes             - get a known number of bytes from connection
 *  pq_getstring    - get a null terminated string from connection
 *  pq_getmessage   - get a message with length word from connection
 *  pq_getbyte              - get next byte from connection
 *  pq_peekbyte             - peek at next byte from connection
 *  pq_putbytes             - send bytes to connection (flushed by pq_flush)
 *  pq_flush                - flush pending output
 *  pq_flush_if_writable - flush pending output if writable without blocking
 *  pq_getbyte_if_available - get a byte if available without blocking
 *message-level I/O (and old-style-COPY-OUT cruft):
 *  pq_putmessage   - send a normal message (suppressed in COPY OUT mode)
 *  pq_putmessage_noblock - buffer a normal message (suppressed in COPY OUT)
 *  pq_startcopyout - inform libpq that a COPY OUT transfer is beginning
 *  pq_endcopyout   - end a COPY OUT transfer

Moving back up, the lib directory looks interesting, I wonder what’s in it?

$ ls lib
Makefile	    bipartite_match.c    objfiles.txt	   stringinfo.c
README		    hyperloglog.c        pairingheap.c
binaryheap.c	    ilist.c              rbtree.c

Fascinating, PostgreSQL uses a red-black tree. I haven’t used those much since college. Here is some code from rbtree.c. It looks a lot like a college textbook:

 * rb_leftmost: fetch the leftmost (smallest-valued) tree node.
 * Returns NULL if tree is empty.
 * Note: in the original implementation this included an unlink step, but
 * that's a bit awkward.  Just call rb_delete on the result if that's what
 * you want.
RBNode *
rb_leftmost(RBTree *rb)
        RBNode     *node = rb->root;
        RBNode     *leftmost = rb->root;

        while (node != RBNIL)
                leftmost = node;
                node = node->left;

        if (leftmost != RBNIL)
                return leftmost;

        return NULL;

Source: A quick look at the overall structure and some interesting aspects of PostgreSQL