en

The VICE Channels

    Image: /Shutterstock

    Know Your Language: Gazing Into the Relational Database Void with SQL

    Written by

    Michael Byrne

    Editor

    On its face, SQL is straight up dullsville: An ALL-CAPs chore for managing databases and only for managing databases.

    Specificity doesn't help much: SQL is a set of statements for manipulating and querying a certain category of database known as a relational database. Hrm.

    Deeper still: SQL is capable of representing bafflingly complex operations on many data tables that are simultaneously featuring many intersecting, intricate relationships among many millions of entries. It does this with simple language and intuitive forms. SQL makes much of our data-based world possible. Nonetheless, among those outside the world of relational databases, it's power is almost certain to be understated.

    Before describing SQL, it's worth asking if SQL is a programming language at all. It's a reasonably common question and one that gets at the heart of what SQL even is.

    The answer is in the name. SQL stands for "structured query language." So, we can at least say that it is a type of language, albeit one that is wildly different in syntax and behavior than a more traditional language like Java or C++. Which is fine because you would never use SQL to do traditional programming language tasks, though common contemporary SQL variations have been shown to be Turing complete, e.g. able to do all of the tasks of any other programming language.

    So, what even is SQL?

    SELECT FROM WHERE

    SQL is considered to be a fourth-generation programming language. This is in contrast to third-generation languages like Java and C++, with the difference being in abstraction and human-readability. (The generational distinction among languages is kind of esoteric, so don't stress it too hard.) Human-readability is in the eye of the reader, of course, but at a basic level, yes, SQL kinda/sorta looks like words that would come from the mouth area of a human.

    The main statements employed by SQL are SELECT, FROM, and WHERE, though there are several more. (They don't have to be in all-caps for the SQL engine to process them, but that's the convention.) These are just ways of telling the SQL engine what data we want to extract from a table or tables. For instance, if we wanted to get all of the names in some table called "people" we could just write something like this, assuming "names" is the name of one of the table's columns:

    SELECT names FROM people

    If we wanted to specify some condition that would give us only a smaller subset of those names, we could use the WHERE clause like this, assuming "age" is another column:

    SELECT names FROM people WHERE age > 21

    What we'll get in return from these statements are really new tables. Both of the examples above will consist of only names. We could add another column to the new table, the result of our query, by doing something like this:

    SELECT names, age FROM people WHERE age > 21

    We'll get the same names as in the second table, but we'll also get a second column showing corresponding ages.

    This seems simple and, honestly, a lot of SQL operations are this simple, but two main features can blow the complexity of our queries into the stratosphere: nested queries and JOINs.

    Queries of queries of queries

    As the results of our queries are really just more tables, we can do more queries on the tables we just created. We can do this more or less forever, linking tables to tables to tables to tables.

    Just to be clear, a single nested subquery would look like this:

    SELECT names FROM people WHERE name NOT IN (SELECT names FROM badPeople)

    So, that one would get only the names of people that are not in this other table badPeople. You can keep nesting from there until you run out of tables in the whole universe and then you could subquery your original table and start again, though anyone that has to work with your code will hate you after the first couple.

    The mystique of the JOIN

    There's a deep, almost philosophical problem at the heart of databases, which is how to handle relationships among different data entities. Some attribute of one table might be an attribute in another and those attributes might be unique identifiers of something, aka keys. (A key is some attribute that will be matched with one and only one entity.) So, in situations like this, we wind up with the problem of deleting or altering data in one table and then having to do the same to some other table, possibly resulting in a cascade of data alterations across an entire database.

    This is done according to some predetermined rules, like, yeah, OK, please nuke the other data too. Or don't. Or disassociate the other data with the data being altered or deleted. Sheesh.

    You wind up with eye-crossing webs of dependencies and no real easy way to unravel it all. Entities have unique attributes, which might also be unique attributes of other entities! Ahhh!

    But there's a reason these dependencies are allowed to exist, which is their power to describe relationships between data, even exceedingly far flung data.

    The command that SQL provides for following these paths is called the JOIN. Giving a code example of a JOIN would, I think, just make this more confusing, but the idea is that it's possible to bridge between tables, sometimes a whole lot of them, by matching unique identifiers, table by table.

    By following these JOIN'd bridges of unique identifiers it becomes possible to describe some way distant and obscure relationships. JOINS can get really bonkers really fast. You can imagine the result of a bunch of JOIN operations as a fucked up, kaleidoscopic Venn diagram of overlapping relationships.

    Relational algebra

    SQL has its roots in set theory and pure mathematics. The concepts involved in relational databases extend well beyond JOINs and SELECTs and computers and computer memory: entities, attributes, relationships. In 1970, a computer scientist at IBM named Edward Codd decided to formalize this stuff into a new algebra—a novel system of symbols and symbol manipulations meant to organize data not as tangled webs of hierarchies, but as simple tables. Rows and columns.

    In a paper describing the new system, Codd offered a sort of manifesto:

    Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.

    The data structures of computing—arrays and linked lists and binary trees etc.—are pretty great for doing stuff with data, but not so much for organizing it. This is more or less what's offered by a navigational database, which was the big organizational scheme of the pre-SQL era and one that still exists (for some good reasons). A user or programmer entered a database at a certain point and then sought out data as though they were traveling along a map, using paths and propositions like "next", "previous", "first", "last", "up", "down", etc.

    The progenitor of the navigational database paradigm was Charles Bachmann, who in 1973 offered the following in a now-famous lecture:

    Copernicus completely reoriented our view of astronomical phenomena when he suggested that the earth revolves about the sun. There is a growing feeling that data processing people would benefit if they were to accept a radically new point of view, one that would liberate the application programmer's thinking from the centralism of core storage and allow him the freedom to act as a navigator within a database. To do this, he must first learn the various navigational skills; then he must learn the "rules of the road" to avoid conflict with other programmers as they jointly navigate the database information space.

    "This reorientation will cause as much anguish among programmers as the heliocentric theory did among ancient astronomers and theologians," Bachman promised.

    That same year, after mostly sitting on the idea since its 1970 unveiling, IBM decided to develop the relational database idea. Codd wasn't invited, however, and the SQL database system would be built without him, originally as the SEQUEL Structured English QUery Language. One of its early designers and advocates was the computer scientist Don Chamberlin, who offered this praise:

    ...since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long, that would navigate through this labyrinth of pointers and stuff. Codd would sort of write them down as one-liners. ... They weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that.

    IBM finally released the first version, SQL/DS, in 1981. The much more successful DB2 version was released in 1983. Relational Software, Inc., which would go on to become the Oracle Corporation, was at the same time releasing its own variant of the relational database model (the object-relational database), with version 3 of the Oracle database hitting the market in 1983.

    Getting started

    Since the late 80s SQL has been standardized and developed jointly by American National Standards Institute (ANSI) and of the International Organization for Standardization (ISO). There are many open-source implementations, including SQLite, PostgreSQL, and MySQL. There are further sub-variants depending on features and interfaces and storage engines. It's kind of a mess.

    Further complicating things is the fact that SQL is very often used within another language, like PHP, which is what SQL is probably most often associated with. In this context, data can be queried and manipulated within the a regular old computer program—most likely a server.

    Even though I did SQL in a databases class in school, I did Khan Academy's course as well because it's really good and hands-on. You wind up spending much more time actually screwing around with databases than you do listening to lectures, which is a really good thing for SQL in particular.

    Read more Know Your Language.