I will here develop some ideas on the platform of Peter Boncz's inaugural lecture mentioned in the previous post. This is a high-level look at where the leading edge of analytics will be, now that the column store is mainstream.
Peter's description of his domain was roughly as follows, summarized from memory:
The new chair is for data analysis and engines for this purpose. The data analysis engine includes the analytical DBMS but is a broader category. For example, the diverse parts of the big data chain (including preprocessing, noise elimination, feature extraction, natural language extraction, graph analytics, and so forth) fall under this category, and most of these things are usually not done in a DBMS. For anything that is big, the main challenge remains one of performance and time to solution. These things are being done, and will increasingly be done, on a platform with heterogenous features, e.g., CPU/GPU clusters, possibly custom hardware like FPGAs, etc. This is driven by factors of cost and energy efficiency. Different processing stages will sometimes be distributed over a wide area, as for example in instrument networks and any network infrastructure, which is wide area by definition.
The design space of database and all that is around it is huge, and any exhaustive exploration is impossible. Development times are long, and a platform might take ten years to be mature. This is ill compatible with academic funding cycles. However, we should not leave all the research in this to industry, as industry maximizes profit, not innovation or absolute performance. Architecting data systems has aspects of an art. Consider the parallel with architecture of buildings: There are considerations of function, compatibility with environment, cost, restrictions arising from the materials at hand, and so forth. How a specific design will work cannot be known without experiment. The experiments themselves must be designed to make sense. This is not an exact science with clear-cut procedures and exact metrics of success.
This is the gist of Peter's description of our art. Peter's successes, best exemplified by MonetDB and Vectorwise, arise from focus over a special problem area and from developing and systematically applying specific insights to a specific problem. This process led to the emergence of the column store, which is now a mainstream thing. The DBMS that does not do columns is by now behind the times.
Needless to say, I am a great believer in core competence. Not every core competence is exactly the same. But a core competence needs to be broad enough so that its integral mastery and consistent application can produce a unit of value valuable in itself. What and how broad this is varies a great deal. Typically such a unit of value is something that is behind a "natural interface." This defies exhaustive definition but the examples below may give a hint. Looking at value chains and all diverse things in them that have a price tag may be another guideline.
There is a sort of Hegelian dialectic to technology trends: At the start, it was generally believed that a DBMS would be universal like the operating system itself, with a few products with very similar functionality covering the whole field. The antithesis came with Michael Stonebraker declaring that one size no longer fit all. Since then the transactional (OLTP) and analytical (OLAP) sides are clearly divided. The eventual synthesis may be in the air, with pioneering work like HyPer led by Thomas Neumann of TU München. Peter, following his Humbolt prize, has spent a couple of days a week in Thomas's group, and I have joined him there a few times. The key to eventually bridging the gap would be compilation and adaptivity. If the workload is compiled on demand, then the right data structures could always be at hand.
This might be the start of a shift similar to the column store turning the DBMS on its side, so to say.
In the mainstream of software engineering, objects, abstractions and interfaces are held to be a value almost in and of themselves. Our science, that of performance, stands in apparent opposition to at least any naive application of the paradigm of objects and interfaces. Interfaces have a cost, and boxes limit transparency into performance. So inlining and merging distinct (in principle) processing phases is necessary for performance. Vectoring is one take on this: An interface that is crossed just a few times is much less harmful than one crossed a billion times. Using compilation, or at least type-and-data-structure-specific variants of operators and switching their application based on run-time observed behaviors, is another aspect of this.
Information systems thus take on more attributes of nature, i.e., more interconnectedness and adaptive behaviors.
Something quite universal might emerge from the highly problem-specific technology of the column store. The big scan, selective hash join plus aggregation, has been explored in slightly different ways by all of HyPer, Vectorwise, and Virtuoso.
Interfaces are not good or bad, in and of themselves. Well-intentioned naïveté in their use is bad. As in nature, there are natural borders in the "technosphere"; declarative query languages, processor instruction sets, and network protocols are good examples. Behind a relatively narrow interface lies a world of complexity of which the unsuspecting have no idea. In biology, the cell membrane might be an analogy, but this is in all likelihood more permeable and diverse in function than the techno examples mentioned.
With the experience of Vectorwise and later Virtuoso, it turns out that vectorization without compilation is good enough for TPC-H. Indeed, I see a few percent of gain at best from further breaking of interfaces and "biology-style" merging of operators and adding inter-stage communication and self-balancing. But TPC-H is not the end of all things, even though it is a sort of rite of passage: Jazz players will do their take on Green Dolphin Street and Summertime.
Science is drawn towards a grand unification of all which is. Nature, on the other hand, discloses more and more diversity and special cases, the closer one looks. This may be true of physical things, but also of abstractions such as software systems or mathematics.
So, let us look at the generalized DBMS, or the data analysis engine, as Peter put it. The use of DBMS technology is hampered by its interface, i.e., declarative query language. The well known counter-reactions to this are the NoSQL, MapReduce, and graph DB memes, which expose lower level interfaces. But then the interface gets put in the whole wrong place, denying most of the things that make the analytics DBMS extremely good at what it does.
We need better and smarter building blocks and interfaces at zero cost. We continue to need blocks of some sort, since algorithms would stop being understandable without any data/procedural abstraction. At run time, the blocks must overlap and interpenetrate: Scan plus hash plus reduction in one loop, for example. Inter-thread, inter-process status sharing for things like top k for faster convergence, for another. Vectorized execution of the same algorithm on many data for things like graph traversals. There are very good single blocks, like GPU graph algorithms, but interface and composability are ever the problem.
So, we must unravel the package that encapsulates the wonders of the analytical DBMS. These consist of scan, hash/index lookup, partitioning, aggregation, expression evaluation, scheduling, message passing and related flow control for scale-out systems, just to mention a few. The complete list would be under 30 long, with blocks parameterized by data payload and specific computation.
By putting these together in a few new ways, we will cover much more of the big data pipeline. Just-in-time compilation may well be the way to deliver these components in an application/environment tailored composition. Yes, keep talking about block diagrams, but never once believe that this represents how things work or ought to work. The algorithms are expressed as distinct things, but at the level of the physical manifestation, things are parallel and interleaved.
The core skill for architecting the future of data analytics is correct discernment of abstraction and interface. What is generic enough to be broadly applicable yet concise enough to be usable? When should the computation move, and when should the data move? What are easy ways of talking about data location? How can protect the application developer be protected from various inevitable stupidities?
No mistake about it, there are at present very few people with the background for formulating the blueprint for the generalized data pipeline. These will be mostly drawn from architects of DBMS. The prospective user is any present-day user of analytics DBMS, Hadoop, or the like. By and large, SQL has worked well within its area of applicability. If there had never been an anti-SQL rebel faction, SQL would not have been successful. Now that a broader workload definition calls for redefinition of interfaces, so as to use the best where it fits, there is a need for re-evaluation of the imperative Vs. declarative question.
T. S. Eliot once wrote that humankind cannot bear very much reality. It seems that we in reality can deconstruct the DBMS and redeploy the state of the art to serve novel purposes across a broader set of problems. This is a cross-over that slightly readjusts the mental frame of the DBMS expert but leaves the core precepts intact. In other words, this is a straightforward extension of core competence with no slide into the dilettantism of doing a little bit of everything.
People like MapReduce and stand-alone graph programming frameworks, because these do one specific thing and are readily understood. By and large, these are orders of magnitude simpler than the DBMS. Even when the DBMS provides in-process Java or CLR, these are rarely used. The single-purpose framework is a much narrower core competence, and thus less exclusive, than the high art of the DBMS, plus it has a faster platform development cycle.
In the short term, we will look at opening the SQL internal toolbox for graph analytics applications. I was discussing this idea with Thomas Neumann at Peter Boncz's party. He asked who would be the user. I answered that doing good parallel algorithms, even with powerful shorthands, was an expert task; so the people doing new types of analytics would be mostly on the system vendor side. However, modifying such for input selection and statistics gathering would be no harder than doing the same with ready-made SQL reports.
There is significant possibility for generalization of the leading edge of database. How will this fare against single-model frameworks? We hope to shed some light on this in the final phase of LDBC and beyond.