Dissertation on Table Header
Since I started this project, the table header has always been problematic. At the beginning I liked Walt Hobbs' RDB idea, with a rich header containing all sort of information. Over the years I did almost every possible experiment with different header formats, including writing the header in a separate file. Every solution had both advantages and disadvantages, and the table header is a topic that can lead to endless dicussions on which format is the best one. To cut a long story short I will just summarize why I finally settled on the current format:
- The paradigm underlying NoSQL only needs to attach a name to delimited columns in a file, as if we were happy with referring to columns by their numerical position then we would be perfectly ok with AWK,
cut,
sort,
join, ... you name them.
- To NoSQL everything is just a string of any size, and it's not up to NoSQL to enforce datatypes, lengths or anything. If we want those things then we have better revert to another DBMS, as there's plenty of them.
- Having the header inside the same as the data, instead of in a separate file as other DBMS'es do, has the following advantages:
- easier updates: only one file to lock, less risks of inconsistency, for instance if the system crashes in the middle of the update.
- on-the-fly views: no need to write a temporary header somewhere.
- most table manipulations (think of join operations) are much easier if tables are self-contained, without separate pieces scattered all over the places.
- tables can be manipulated with stock UNIX command-line utilities, without the need to have everything specifically-programmed for the job, possibly in C to retain an
acceptable speed (the original RDB
jointbl ioperator is an example of what I mean).
- The current header is:
- simple.
- close enough to other similar implementations to make it easy to convert back and forth between them with simple one-liners.
- free from spurious information that are either redundant or do not belong into NoSQL (datatypes, lengths, dashline, ... the actual length of a piece of data is the true information, no matter what we redundantly declare in any header whatsoever).
The fact that
\001 always sorts ad the top makes it possible to perform some operations on tables directly with stock UNIX tools, without even resorting to NoSQL commands. This can be very beneficial to speed, especially when it comes to using NoSQL to back a busy web server, as I often do.
So, in conclusion, of all the headers I've played with, the current one is the most satisfactory compromise I've come up with.
Trackbacks (1) |
New trackback |
Comments (0) |
Print
|