NoSQL Relational Database Management System: Big tables

NoSQL: a non-SQL RDBMS

Recent Pages

Commands at a glance
edittable: trivial tabl ...
indextable: generate ta ...
NoSQL Utilities
Text Formatting Rules
Site SandBox
Change Management
Wiki Editor's How-To
Site Terms of Use
Site Copyright

Links

NoSQL Home

Table of Contents

Wiki Editor's How-To

Text Formatting Rules

Wiki SandBox

www.strozzi.it on Twitter

Google Ads

Please Support NoSQL !

Session

Campaigns

Big tables

Let me begin this section by saying that big tables should be avoided in the first place. By cleverly organizing your data you may be able to keep your tables within a manageable size even if the content of the overall database is large. Please have a look at section 2.9 of The UNIX Shell As a Fouth Generation Language paper. Remember that NoSQL works with UNIX, not in addition to it. The underlying UNIX file system, that more conventional databases tend to disregard, can provide, when used creatively, an extremely powerful way to efficiently pre-organize your tables (relations) and keep them small (where small means roughly within a few hundred kilobytes each).

If you rather prefer/need to work with large, monolithic data-sets, then you can still do it efficiently by applying your changes to a separate file rather than to the actual table. The changes can then be merged back into the big table (and any indices can be re-built) only every now and again, with a batch job that may be run in the background, overnight or when the system activity is low. The following example will try to explain this better.

Suppose we have the large indexed table bigtable and we need to change/delete one or more records. What we can do is:

Create a journaling file, say bigtable.updates, with exactly the same header as bigtable, but containing only those records that we want to update or remove from bigtable. The entries in bigtable.updates will have to be in a format suitable for the 'updtable' operator.
Whenever we fetch data from bigtable we will have to do it in two logical steps: the first step is to use 'searchtable' on bigtable to take advantage of the indices; the second step will be to merge the result of the first step into bigtable.updates.
Any updates to bigtable will be done to bigtable.updates, with the syntax described in the documentation of the 'update' operator. Unlike bigtable, where records will usually be kept sorted on the primary key field (i.e., on the leftmost table column), the records in bigtable.updates do not need to be sorted in any particular order.

This may seem complicated, but it isn't much, really. Say bigtable contains two columns, SSN (for Social Security No.) and SURNAME, where the unique table key is on SSN, and we have built a secondary index on the SURNAME field. If we want to extract all the rows which SURNAME field is equal to "Smith," the query necessary to take advantage of a fast search method on bigtable and to account for the presence of bigtable.updates is:

       printf '\001SURNAME\nSmith\n' |
         searchtable --index bigtable._x.SURNAME |
         updtable --no-insert bigtable.updates

As you can see, the trick is:

Perform an indexed 'searchtable' on bigtable to quickly obtain a much smaller (possibly empty) subset of its data.
Merge the first output with bigtable.updates on-the-fly during the query.

The 'searchtable' operator is very powerful, but being written in Perl it may take a while to load. If the following conditions are met, with respect to the previous example:

bigtable leftmost field is "SURNAME," and
the table is sorted on that field,

then you only need to perform one single lookup of bigtable.

If the above is the case -- a common situation -- then here is a faster alternative based on the look(1) shell utility (which is about twice as fast as 'searchtable'):

       (head -1 bigtable; look Smith bigtable) |
               updtable --no-insert bigtable.updates

For convenience, NoSQL provides the 'keysearch' operator, which is a header-aware front-end to the look(1) utility, so that the above can be rewritten as:

       keysearch Smith bigtable |
                    updtable --no-insert bigtable.updates

Newer versions of 'keysearch' support secondary indices, thus expensive calls to the more feature-full 'searchtable' can almost always be avoided and the following becomes possible:

       keysearch --index bigtable._x.SURNAME Smith bigtable |
                             updtable --no-insert bigtable.updates

For 'keysearch' to use secondary index files built on multiple columns, such as bigtable._x.NAME.SURNAME, the user needs to know how the index was built, as it contains separators in the form of colons ( :), which must be included in the search string passed to 'keysearch'.

If bigtable is subject not only to deletions/updates, but also to the insertion of new keys, the shortcoming of the above example is that it will return no match if the Smith record occurs only in bigtable.updates and not in bigtable. When that may be the case, the above command needs to be made more general, although slightly more expensive in terms of system resources, as follows:

       keysearch --index bigtable._x.SURNAME Smith bigtable |
                   updtable bigtable.updates | getrow 'SURNAME=="Smith"'

This last example shows a general NoSQL tenet: the input stream can be narrowed to a more manageable size by using record pre-selection techniques, such as indexes or other non-sequential structures, but the rest of the processing is done in a linear, i.e. sequential fashion, in line with the underlying Operator-Stream paradigm^?.

Sometimes there may be multiple update tables associated with bigtable, such as bigtable.updates-1, bigtable.updates-2, and so forth. Provided that all of the update tables are union-compatible, i.e. they all have the same fields in the same positions, the last example above can be modified as follows:

       keysearch --index bigtable._x.SURNAME Smith bigtable > result.tmp
       uniontable bigtable.updates-* |
                updtable --stdin result.tmp | getrow 'SURNAME=="Smith"'

Note that indexed access methods, like those provided by 'searchtable' and 'keysearch,' may be worthwhile only if the table being searched is really big, say over a few thousand kilobytes. With a 300 KB table a linear search with 'grep' often is still faster, even on an old PII-233 box. Actually, with 'grep' a lot depends on the complexity of the pattern to be matched, whether we use an extended regular expression, and so forth. Non-sequential access methods always add to complication (indexed files, sorted tables) and should not be used without prior thinking.

Trackbacks (6) | New trackback | Comments (0) | Print