Skip to content

wangxiaoying/qpmodel

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Relational Optimizer and Executor Modeling

This project models a relational optimizer and executor in c#. The main target is the optimizer, and the purpose is to prepare for a more serious production implementation later. The executor part is needed for plan correctness verification. See github project "issues" for bugs and todo items.

Why C#

Optimizer is logic centric, so a high-level language is preferred. After modeling, production may want to turn it into some C/C++ code, so the language must be a close relative of them. C# provides some great features like LINQ, dynamic types to make modeling easy, and it is close enough to C++ (and that's why not python).

Optimizer

The optimizer exercises the following constructs:

  • Top down/bottom up structure: the optimizer does utilize a top down cascades style optimizer structure but optionally you can choose to use bottom up join order resolver. It currently use DPccp ("Analysis of Two Existing and One New Dynamic Programming Algorithm") by G. Moerkotte, et al. It also implements some other join order resolver like DPBushy (TDBasic, GOO), mainly for the purpose of correctness verification. A more generic join resolver DPHyper (Dynamic Programming Strikes Back) is in preparation.
  • Subquery decorrelation: it follows the "Unnesting Arbitrary Queries" and "The Complete Story of Joins (in Hyper)" by T. Neumann et al.
  • CTE inline/non-inline (ongoing): it follows the Optimization of Common Table Expressions in MPP Database Systems by A. El-Helw et al.
  • Cardinality estimation, costing: currently this follows "as-is" textbook implementation. We did not spend much time here because it is a local issue, meaning later improvements shall not impact the architecture. CE also demonstrate upgrade or version management.
  • Distributed plan: follows the remote exchange model (gather, redistributed etc). A naive in-machine parallelism can be modeled with the same scheme.
  • The optimizer also exposes a DataSet like interface which demonstrates language integration. Another example calculating Pi with Monte-Carlo can be found in Unittest/TestDataSet.
	// leverage c#'s sqrt so no reinventing wheels
	double sqroot(double d) => Math.Sqrt(d);
	SQLContext.Register<double, string>("sqroot", sqroot);

	// use sqroot in SQL
	var sql = "SELECT a1, sqroot(b1*a1+2) from a join b on b2=a2 where a1>1";

	// above query in DataSet form
	var a = sqlContext.Read("a");
	var b = sqlContext.Read("b");
	var rows = a.filter("a1>1").join(b, "b2=a2").select("a1", "sqroot(b1*a1+2)").show();
  • Verify the optimizer by some unittests and TPCH/DS. All TPCH queries are runnable. TPCDS we do not support window function and rolling groups. You can find TPCH plan here.

Executor

The executor is implemented for two purpose: (1) verify plan correctness; (2) test codeGen engineering readiness following paper "How to Architect a query compiler, Revisited" by R.Y.Tahboub et al. Executor is not intended to demonstrate implementing specific operator efficiently.

  • It does not use a LMS underneath, but thanks for c#'s interpolated string support, the template is easy to read and construct side by side with interpreted code.
  • To support incremental codeGen improvement, meaning we can ship partial codeGen implementation, it supports a generic expression evaluation function which can fallback to the interpreted implementation if that specific expression is not codeGen ready.
  • It emulates a cross machine execution using parallel threads using remote exchange model. Current setting is 10 machines, so a plan with 1 Gather and 6 redistribution utilizes 70 threads for emulation.
  • The executor utilizes Object and dynamic types to simplify implementation. To port it to C/C++, you can follow the traditional "dispatch" mechanism like a + b => {int4pl(a,b), int2pl(a,b), doublepl(a,b), ...}.

You can see an example generated code for this query here with generic expression evaluation is in this code snippet:

	// projection on PhysicHashAgg112: Output: {a.a2}[0]*2,{count(a.a1)}[1],repeat('a',{a.a2}[0]) 
	Row rproj = new Row(3);
	rproj[0] = ((dynamic)r112[0] * (dynamic)2);
	rproj[1] = r112[1];
	rproj[2] = ExprSearch.Locate("74").Exec(context, r112) /*repeat('a',{a.a2}[0])*/;
	r112 = rproj;

How to Play

Load the project with Visual Studio 2019 community edition (it is free). Program.Main() mainly for debugging purpose. There are several builtin tables like 'a', 'b', 'c', 'd' for testing purpose, and you can find their definition in Catalog.createBuildInTestTables().

Unittest comes up with multiple tests covering major functionalities:

  • TPCH end to end power run with small data set
  • TPCDS, JoBench only exercise the optimizer
  • Other grouped unittest stressing different aspects

About

A Relational Optimizer and Executor Modeling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 86.0%
  • TSQL 12.0%
  • ANTLR 2.0%