diff options
Diffstat (limited to 'docs/tutorial')
-rw-r--r-- | docs/tutorial/LangImpl1.rst | 11 | ||||
-rw-r--r-- | docs/tutorial/LangImpl4.rst | 2 | ||||
-rw-r--r-- | docs/tutorial/LangImpl5.rst | 2 | ||||
-rw-r--r-- | docs/tutorial/LangImpl6.rst | 2 | ||||
-rw-r--r-- | docs/tutorial/LangImpl7.rst | 4 | ||||
-rw-r--r-- | docs/tutorial/LangImpl8.rst | 716 | ||||
-rw-r--r-- | docs/tutorial/LangImpl9.rst | 262 |
7 files changed, 730 insertions, 269 deletions
diff --git a/docs/tutorial/LangImpl1.rst b/docs/tutorial/LangImpl1.rst index a2c5eee..f4b0191 100644 --- a/docs/tutorial/LangImpl1.rst +++ b/docs/tutorial/LangImpl1.rst @@ -73,14 +73,21 @@ in the various pieces. The structure of the tutorial is: about this is how easy and trivial it is to construct SSA form in LLVM: no, LLVM does *not* require your front-end to construct SSA form! -- `Chapter #8 <LangImpl8.html>`_: Conclusion and other useful LLVM +- `Chapter #8 <LangImpl8.html>`_: Extending the Language: Debug + Information - Having built a decent little programming language with + control flow, functions and mutable variables, we consider what it + takes to add debug information to standalone executables. This debug + information will allow you to set breakpoints in Kaleidoscope + functions, print out argument variables, and call functions - all + from within the debugger! +- `Chapter #9 <LangImpl8.html>`_: Conclusion and other useful LLVM tidbits - This chapter wraps up the series by talking about potential ways to extend the language, but also includes a bunch of pointers to info about "special topics" like adding garbage collection support, exceptions, debugging, support for "spaghetti stacks", and a bunch of other tips and tricks. -By the end of the tutorial, we'll have written a bit less than 700 lines +By the end of the tutorial, we'll have written a bit less than 1000 lines of non-comment, non-blank, lines of code. With this small amount of code, we'll have built up a very reasonable compiler for a non-trivial language including a hand-written lexer, parser, AST, as well as code diff --git a/docs/tutorial/LangImpl4.rst b/docs/tutorial/LangImpl4.rst index aa469ca..cdaac63 100644 --- a/docs/tutorial/LangImpl4.rst +++ b/docs/tutorial/LangImpl4.rst @@ -428,7 +428,7 @@ the LLVM JIT and optimizer. To build this example, use: .. code-block:: bash # Compile - clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core jit native` -O3 -o toy + clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy # Run ./toy diff --git a/docs/tutorial/LangImpl5.rst b/docs/tutorial/LangImpl5.rst index 2a3a4ce..72e34b1 100644 --- a/docs/tutorial/LangImpl5.rst +++ b/docs/tutorial/LangImpl5.rst @@ -736,7 +736,7 @@ the if/then/else and for expressions.. To build this example, use: .. code-block:: bash # Compile - clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core jit native` -O3 -o toy + clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy # Run ./toy diff --git a/docs/tutorial/LangImpl6.rst b/docs/tutorial/LangImpl6.rst index cdceb03..bf78bde 100644 --- a/docs/tutorial/LangImpl6.rst +++ b/docs/tutorial/LangImpl6.rst @@ -729,7 +729,7 @@ the if/then/else and for expressions.. To build this example, use: .. code-block:: bash # Compile - clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core jit native` -O3 -o toy + clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy # Run ./toy diff --git a/docs/tutorial/LangImpl7.rst b/docs/tutorial/LangImpl7.rst index c4c7233..c445908 100644 --- a/docs/tutorial/LangImpl7.rst +++ b/docs/tutorial/LangImpl7.rst @@ -847,7 +847,7 @@ mutable variables and var/in support. To build this example, use: .. code-block:: bash # Compile - clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core jit native` -O3 -o toy + clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy # Run ./toy @@ -856,5 +856,5 @@ Here is the code: .. literalinclude:: ../../examples/Kaleidoscope/Chapter7/toy.cpp :language: c++ -`Next: Conclusion and other useful LLVM tidbits <LangImpl8.html>`_ +`Next: Adding Debug Information <LangImpl8.html>`_ diff --git a/docs/tutorial/LangImpl8.rst b/docs/tutorial/LangImpl8.rst index 6f69493..4473035 100644 --- a/docs/tutorial/LangImpl8.rst +++ b/docs/tutorial/LangImpl8.rst @@ -1,267 +1,459 @@ -====================================================== -Kaleidoscope: Conclusion and other useful LLVM tidbits -====================================================== +====================================== +Kaleidoscope: Adding Debug Information +====================================== .. contents:: :local: -Tutorial Conclusion -=================== - -Welcome to the final chapter of the "`Implementing a language with -LLVM <index.html>`_" tutorial. In the course of this tutorial, we have -grown our little Kaleidoscope language from being a useless toy, to -being a semi-interesting (but probably still useless) toy. :) - -It is interesting to see how far we've come, and how little code it has -taken. We built the entire lexer, parser, AST, code generator, and an -interactive run-loop (with a JIT!) by-hand in under 700 lines of -(non-comment/non-blank) code. - -Our little language supports a couple of interesting features: it -supports user defined binary and unary operators, it uses JIT -compilation for immediate evaluation, and it supports a few control flow -constructs with SSA construction. - -Part of the idea of this tutorial was to show you how easy and fun it -can be to define, build, and play with languages. Building a compiler -need not be a scary or mystical process! Now that you've seen some of -the basics, I strongly encourage you to take the code and hack on it. -For example, try adding: - -- **global variables** - While global variables have questional value - in modern software engineering, they are often useful when putting - together quick little hacks like the Kaleidoscope compiler itself. - Fortunately, our current setup makes it very easy to add global - variables: just have value lookup check to see if an unresolved - variable is in the global variable symbol table before rejecting it. - To create a new global variable, make an instance of the LLVM - ``GlobalVariable`` class. -- **typed variables** - Kaleidoscope currently only supports variables - of type double. This gives the language a very nice elegance, because - only supporting one type means that you never have to specify types. - Different languages have different ways of handling this. The easiest - way is to require the user to specify types for every variable - definition, and record the type of the variable in the symbol table - along with its Value\*. -- **arrays, structs, vectors, etc** - Once you add types, you can start - extending the type system in all sorts of interesting ways. Simple - arrays are very easy and are quite useful for many different - applications. Adding them is mostly an exercise in learning how the - LLVM `getelementptr <../LangRef.html#i_getelementptr>`_ instruction - works: it is so nifty/unconventional, it `has its own - FAQ <../GetElementPtr.html>`_! If you add support for recursive types - (e.g. linked lists), make sure to read the `section in the LLVM - Programmer's Manual <../ProgrammersManual.html#TypeResolve>`_ that - describes how to construct them. -- **standard runtime** - Our current language allows the user to access - arbitrary external functions, and we use it for things like "printd" - and "putchard". As you extend the language to add higher-level - constructs, often these constructs make the most sense if they are - lowered to calls into a language-supplied runtime. For example, if - you add hash tables to the language, it would probably make sense to - add the routines to a runtime, instead of inlining them all the way. -- **memory management** - Currently we can only access the stack in - Kaleidoscope. It would also be useful to be able to allocate heap - memory, either with calls to the standard libc malloc/free interface - or with a garbage collector. If you would like to use garbage - collection, note that LLVM fully supports `Accurate Garbage - Collection <../GarbageCollection.html>`_ including algorithms that - move objects and need to scan/update the stack. -- **debugger support** - LLVM supports generation of `DWARF Debug - info <../SourceLevelDebugging.html>`_ which is understood by common - debuggers like GDB. Adding support for debug info is fairly - straightforward. The best way to understand it is to compile some - C/C++ code with "``clang -g -O0``" and taking a look at what it - produces. -- **exception handling support** - LLVM supports generation of `zero - cost exceptions <../ExceptionHandling.html>`_ which interoperate with - code compiled in other languages. You could also generate code by - implicitly making every function return an error value and checking - it. You could also make explicit use of setjmp/longjmp. There are - many different ways to go here. -- **object orientation, generics, database access, complex numbers, - geometric programming, ...** - Really, there is no end of crazy - features that you can add to the language. -- **unusual domains** - We've been talking about applying LLVM to a - domain that many people are interested in: building a compiler for a - specific language. However, there are many other domains that can use - compiler technology that are not typically considered. For example, - LLVM has been used to implement OpenGL graphics acceleration, - translate C++ code to ActionScript, and many other cute and clever - things. Maybe you will be the first to JIT compile a regular - expression interpreter into native code with LLVM? - -Have fun - try doing something crazy and unusual. Building a language -like everyone else always has, is much less fun than trying something a -little crazy or off the wall and seeing how it turns out. If you get -stuck or want to talk about it, feel free to email the `llvmdev mailing -list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots -of people who are interested in languages and are often willing to help -out. - -Before we end this tutorial, I want to talk about some "tips and tricks" -for generating LLVM IR. These are some of the more subtle things that -may not be obvious, but are very useful if you want to take advantage of -LLVM's capabilities. - -Properties of the LLVM IR -========================= - -We have a couple common questions about code in the LLVM IR form - lets -just get these out of the way right now, shall we? - -Target Independence -------------------- - -Kaleidoscope is an example of a "portable language": any program written -in Kaleidoscope will work the same way on any target that it runs on. -Many other languages have this property, e.g. lisp, java, haskell, -javascript, python, etc (note that while these languages are portable, -not all their libraries are). - -One nice aspect of LLVM is that it is often capable of preserving target -independence in the IR: you can take the LLVM IR for a -Kaleidoscope-compiled program and run it on any target that LLVM -supports, even emitting C code and compiling that on targets that LLVM -doesn't support natively. You can trivially tell that the Kaleidoscope -compiler generates target-independent code because it never queries for -any target-specific information when generating code. - -The fact that LLVM provides a compact, target-independent, -representation for code gets a lot of people excited. Unfortunately, -these people are usually thinking about C or a language from the C -family when they are asking questions about language portability. I say -"unfortunately", because there is really no way to make (fully general) -C code portable, other than shipping the source code around (and of -course, C source code is not actually portable in general either - ever -port a really old application from 32- to 64-bits?). - -The problem with C (again, in its full generality) is that it is heavily -laden with target specific assumptions. As one simple example, the -preprocessor often destructively removes target-independence from the -code when it processes the input text: - -.. code-block:: c - - #ifdef __i386__ - int X = 1; - #else - int X = 42; - #endif - -While it is possible to engineer more and more complex solutions to -problems like this, it cannot be solved in full generality in a way that -is better than shipping the actual source code. - -That said, there are interesting subsets of C that can be made portable. -If you are willing to fix primitive types to a fixed size (say int = -32-bits, and long = 64-bits), don't care about ABI compatibility with -existing binaries, and are willing to give up some other minor features, -you can have portable code. This can make sense for specialized domains -such as an in-kernel language. - -Safety Guarantees ------------------ - -Many of the languages above are also "safe" languages: it is impossible -for a program written in Java to corrupt its address space and crash the -process (assuming the JVM has no bugs). Safety is an interesting -property that requires a combination of language design, runtime -support, and often operating system support. - -It is certainly possible to implement a safe language in LLVM, but LLVM -IR does not itself guarantee safety. The LLVM IR allows unsafe pointer -casts, use after free bugs, buffer over-runs, and a variety of other -problems. Safety needs to be implemented as a layer on top of LLVM and, -conveniently, several groups have investigated this. Ask on the `llvmdev -mailing list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if -you are interested in more details. - -Language-Specific Optimizations -------------------------------- - -One thing about LLVM that turns off many people is that it does not -solve all the world's problems in one system (sorry 'world hunger', -someone else will have to solve you some other day). One specific -complaint is that people perceive LLVM as being incapable of performing -high-level language-specific optimization: LLVM "loses too much -information". - -Unfortunately, this is really not the place to give you a full and -unified version of "Chris Lattner's theory of compiler design". Instead, -I'll make a few observations: - -First, you're right that LLVM does lose information. For example, as of -this writing, there is no way to distinguish in the LLVM IR whether an -SSA-value came from a C "int" or a C "long" on an ILP32 machine (other -than debug info). Both get compiled down to an 'i32' value and the -information about what it came from is lost. The more general issue -here, is that the LLVM type system uses "structural equivalence" instead -of "name equivalence". Another place this surprises people is if you -have two types in a high-level language that have the same structure -(e.g. two different structs that have a single int field): these types -will compile down into a single LLVM type and it will be impossible to -tell what it came from. - -Second, while LLVM does lose information, LLVM is not a fixed target: we -continue to enhance and improve it in many different ways. In addition -to adding new features (LLVM did not always support exceptions or debug -info), we also extend the IR to capture important information for -optimization (e.g. whether an argument is sign or zero extended, -information about pointers aliasing, etc). Many of the enhancements are -user-driven: people want LLVM to include some specific feature, so they -go ahead and extend it. - -Third, it is *possible and easy* to add language-specific optimizations, -and you have a number of choices in how to do it. As one trivial -example, it is easy to add language-specific optimization passes that -"know" things about code compiled for a language. In the case of the C -family, there is an optimization pass that "knows" about the standard C -library functions. If you call "exit(0)" in main(), it knows that it is -safe to optimize that into "return 0;" because C specifies what the -'exit' function does. - -In addition to simple library knowledge, it is possible to embed a -variety of other language-specific information into the LLVM IR. If you -have a specific need and run into a wall, please bring the topic up on -the llvmdev list. At the very worst, you can always treat LLVM as if it -were a "dumb code generator" and implement the high-level optimizations -you desire in your front-end, on the language-specific AST. - -Tips and Tricks -=============== - -There is a variety of useful tips and tricks that you come to know after -working on/with LLVM that aren't obvious at first glance. Instead of -letting everyone rediscover them, this section talks about some of these -issues. - -Implementing portable offsetof/sizeof -------------------------------------- - -One interesting thing that comes up, if you are trying to keep the code -generated by your compiler "target independent", is that you often need -to know the size of some LLVM type or the offset of some field in an -llvm structure. For example, you might need to pass the size of a type -into a function that allocates memory. - -Unfortunately, this can vary widely across targets: for example the -width of a pointer is trivially target-specific. However, there is a -`clever way to use the getelementptr -instruction <http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ -that allows you to compute this in a portable way. - -Garbage Collected Stack Frames ------------------------------- - -Some languages want to explicitly manage their stack frames, often so -that they are garbage collected or to allow easy implementation of -closures. There are often better ways to implement these features than -explicit stack frames, but `LLVM does support -them, <http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_ -if you want. It requires your front-end to convert the code into -`Continuation Passing -Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ and -the use of tail calls (which LLVM also supports). +Chapter 8 Introduction +====================== + +Welcome to Chapter 8 of the "`Implementing a language with +LLVM <index.html>`_" tutorial. In chapters 1 through 7, we've built a +decent little programming language with functions and variables. +What happens if something goes wrong though, how do you debug your +program? + +Source level debugging uses formatted data that helps a debugger +translate from binary and the state of the machine back to the +source that the programmer wrote. In LLVM we generally use a format +called `DWARF <http://dwarfstd.org>`_. DWARF is a compact encoding +that represents types, source locations, and variable locations. + +The short summary of this chapter is that we'll go through the +various things you have to add to a programming language to +support debug info, and how you translate that into DWARF. + +Caveat: For now we can't debug via the JIT, so we'll need to compile +our program down to something small and standalone. As part of this +we'll make a few modifications to the running of the language and +how programs are compiled. This means that we'll have a source file +with a simple program written in Kaleidoscope rather than the +interactive JIT. It does involve a limitation that we can only +have one "top level" command at a time to reduce the number of +changes necessary. + +Here's the sample program we'll be compiling: + +.. code-block:: python + + def fib(x) + if x < 3 then + 1 + else + fib(x-1)+fib(x-2); + + fib(10) + + +Why is this a hard problem? +=========================== + +Debug information is a hard problem for a few different reasons - mostly +centered around optimized code. First, optimization makes keeping source +locations more difficult. In LLVM IR we keep the original source location +for each IR level instruction on the instruction. Optimization passes +should keep the source locations for newly created instructions, but merged +instructions only get to keep a single location - this can cause jumping +around when stepping through optimized programs. Secondly, optimization +can move variables in ways that are either optimized out, shared in memory +with other variables, or difficult to track. For the purposes of this +tutorial we're going to avoid optimization (as you'll see with one of the +next sets of patches). + +Ahead-of-Time Compilation Mode +============================== + +To highlight only the aspects of adding debug information to a source +language without needing to worry about the complexities of JIT debugging +we're going to make a few changes to Kaleidoscope to support compiling +the IR emitted by the front end into a simple standalone program that +you can execute, debug, and see results. + +First we make our anonymous function that contains our top level +statement be our "main": + +.. code-block:: udiff + + - PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); + + PrototypeAST *Proto = new PrototypeAST("main", std::vector<std::string>()); + +just with the simple change of giving it a name. + +Then we're going to remove the command line code wherever it exists: + +.. code-block:: udiff + + @@ -1129,7 +1129,6 @@ static void HandleTopLevelExpression() { + /// top ::= definition | external | expression | ';' + static void MainLoop() { + while (1) { + - fprintf(stderr, "ready> "); + switch (CurTok) { + case tok_eof: + return; + @@ -1184,7 +1183,6 @@ int main() { + BinopPrecedence['*'] = 40; // highest. + + // Prime the first token. + - fprintf(stderr, "ready> "); + getNextToken(); + +Lastly we're going to disable all of the optimization passes and the JIT so +that the only thing that happens after we're done parsing and generating +code is that the llvm IR goes to standard error: + +.. code-block:: udiff + + @@ -1108,17 +1108,8 @@ static void HandleExtern() { + static void HandleTopLevelExpression() { + // Evaluate a top-level expression into an anonymous function. + if (FunctionAST *F = ParseTopLevelExpr()) { + - if (Function *LF = F->Codegen()) { + - // We're just doing this to make sure it executes. + - TheExecutionEngine->finalizeObject(); + - // JIT the function, returning a function pointer. + - void *FPtr = TheExecutionEngine->getPointerToFunction(LF); + - + - // Cast it to the right type (takes no arguments, returns a double) so we + - // can call it as a native function. + - double (*FP)() = (double (*)())(intptr_t)FPtr; + - // Ignore the return value for this. + - (void)FP; + + if (!F->Codegen()) { + + fprintf(stderr, "Error generating code for top level expr"); + } + } else { + // Skip token for error recovery. + @@ -1439,11 +1459,11 @@ int main() { + // target lays out data structures. + TheModule->setDataLayout(TheExecutionEngine->getDataLayout()); + OurFPM.add(new DataLayoutPass()); + +#if 0 + OurFPM.add(createBasicAliasAnalysisPass()); + // Promote allocas to registers. + OurFPM.add(createPromoteMemoryToRegisterPass()); + @@ -1218,7 +1210,7 @@ int main() { + OurFPM.add(createGVNPass()); + // Simplify the control flow graph (deleting unreachable blocks, etc). + OurFPM.add(createCFGSimplificationPass()); + - + + #endif + OurFPM.doInitialization(); + + // Set the global so the code gen can use this. + +This relatively small set of changes get us to the point that we can compile +our piece of Kaleidoscope language down to an executable program via this +command line: + +.. code-block:: bash + + Kaleidoscope-Ch8 < fib.ks | & clang -x ir - + +which gives an a.out/a.exe in the current working directory. + +Compile Unit +============ + +The top level container for a section of code in DWARF is a compile unit. +This contains the type and function data for an individual translation unit +(read: one file of source code). So the first thing we need to do is +construct one for our fib.ks file. + +DWARF Emission Setup +==================== + +Similar to the ``IRBuilder`` class we have a +```DIBuilder`` <http://llvm.org/doxygen/classllvm_1_1DIBuilder.html>`_ class +that helps in constructing debug metadata for an llvm IR file. It +corresponds 1:1 similarly to ``IRBuilder`` and llvm IR, but with nicer names. +Using it does require that you be more familiar with DWARF terminology than +you needed to be with ``IRBuilder`` and ``Instruction`` names, but if you +read through the general documentation on the +```Metadata Format`` <http://llvm.org/docs/SourceLevelDebugging.html>`_ it +should be a little more clear. We'll be using this class to construct all +of our IR level descriptions. Construction for it takes a module so we +need to construct it shortly after we construct our module. We've left it +as a global static variable to make it a bit easier to use. + +Next we're going to create a small container to cache some of our frequent +data. The first will be our compile unit, but we'll also write a bit of +code for our one type since we won't have to worry about multiple typed +expressions: + +.. code-block:: c++ + + static DIBuilder *DBuilder; + + struct DebugInfo { + DICompileUnit TheCU; + DIType DblTy; + + DIType getDoubleTy(); + } KSDbgInfo; + + DIType DebugInfo::getDoubleTy() { + if (DblTy.isValid()) + return DblTy; + + DblTy = DBuilder->createBasicType("double", 64, 64, dwarf::DW_ATE_float); + return DblTy; + } + +And then later on in ``main`` when we're constructing our module: + +.. code-block:: c++ + + DBuilder = new DIBuilder(*TheModule); + + KSDbgInfo.TheCU = DBuilder->createCompileUnit( + dwarf::DW_LANG_C, "fib.ks", ".", "Kaleidoscope Compiler", 0, "", 0); + +There are a couple of things to note here. First, while we're producing a +compile unit for a language called Kaleidoscope we used the language +constant for C. This is because a debugger wouldn't necessarily understand +the calling conventions or default ABI for a language it doesn't recognize +and we follow the C ABI in our llvm code generation so it's the closest +thing to accurate. This ensures we can actually call functions from the +debugger and have them execute. Secondly, you'll see the "fib.ks" in the +call to ``createCompileUnit``. This is a default hard coded value since +we're using shell redirection to put our source into the Kaleidoscope +compiler. In a usual front end you'd have an input file name and it would +go there. + +One last thing as part of emitting debug information via DIBuilder is that +we need to "finalize" the debug information. The reasons are part of the +underlying API for DIBuilder, but make sure you do this near the end of +main: + +.. code-block:: c++ + + DBuilder->finalize(); + +before you dump out the module. + +Functions +========= + +Now that we have our ``Compile Unit`` and our source locations, we can add +function definitions to the debug info. So in ``PrototypeAST::Codegen`` we +add a few lines of code to describe a context for our subprogram, in this +case the "File", and the actual definition of the function itself. + +So the context: + +.. code-block:: c++ + + DIFile Unit = DBuilder->createFile(KSDbgInfo.TheCU.getFilename(), + KSDbgInfo.TheCU.getDirectory()); + +giving us a DIFile and asking the ``Compile Unit`` we created above for the +directory and filename where we are currently. Then, for now, we use some +source locations of 0 (since our AST doesn't currently have source location +information) and construct our function definition: + +.. code-block:: c++ + + DIDescriptor FContext(Unit); + unsigned LineNo = 0; + unsigned ScopeLine = 0; + DISubprogram SP = DBuilder->createFunction( + FContext, Name, StringRef(), Unit, LineNo, + CreateFunctionType(Args.size(), Unit), false /* internal linkage */, + true /* definition */, ScopeLine, DIDescriptor::FlagPrototyped, false, F); + +and we now have a DISubprogram that contains a reference to all of our metadata +for the function. + +Source Locations +================ + +The most important thing for debug information is accurate source location - +this makes it possible to map your source code back. We have a problem though, +Kaleidoscope really doesn't have any source location information in the lexer +or parser so we'll need to add it. + +.. code-block:: c++ + + struct SourceLocation { + int Line; + int Col; + }; + static SourceLocation CurLoc; + static SourceLocation LexLoc = {1, 0}; + + static int advance() { + int LastChar = getchar(); + + if (LastChar == '\n' || LastChar == '\r') { + LexLoc.Line++; + LexLoc.Col = 0; + } else + LexLoc.Col++; + return LastChar; + } + +In this set of code we've added some functionality on how to keep track of the +line and column of the "source file". As we lex every token we set our current +current "lexical location" to the assorted line and column for the beginning +of the token. We do this by overriding all of the previous calls to +``getchar()`` with our new ``advance()`` that keeps track of the information +and then we have added to all of our AST classes a source location: + +.. code-block:: c++ + + class ExprAST { + SourceLocation Loc; + + public: + int getLine() const { return Loc.Line; } + int getCol() const { return Loc.Col; } + ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {} + virtual std::ostream &dump(std::ostream &out, int ind) { + return out << ':' << getLine() << ':' << getCol() << '\n'; + } + +that we pass down through when we create a new expression: + +.. code-block:: c++ + + LHS = new BinaryExprAST(BinLoc, BinOp, LHS, RHS); + +giving us locations for each of our expressions and variables. + +From this we can make sure to tell ``DIBuilder`` when we're at a new source +location so it can use that when we generate the rest of our code and make +sure that each instruction has source location information. We do this +by constructing another small function: + +.. code-block:: c++ + + void DebugInfo::emitLocation(ExprAST *AST) { + DIScope *Scope; + if (LexicalBlocks.empty()) + Scope = &TheCU; + else + Scope = LexicalBlocks.back(); + Builder.SetCurrentDebugLocation( + DebugLoc::get(AST->getLine(), AST->getCol(), DIScope(*Scope))); + } + +that both tells the main ``IRBuilder`` where we are, but also what scope +we're in. Since we've just created a function above we can either be in +the main file scope (like when we created our function), or now we can be +in the function scope we just created. To represent this we create a stack +of scopes: + +.. code-block:: c++ + + std::vector<DIScope *> LexicalBlocks; + std::map<const PrototypeAST *, DIScope> FnScopeMap; + +and keep a map of each function to the scope that it represents (a DISubprogram +is also a DIScope). + +Then we make sure to: + +.. code-block:: c++ + + KSDbgInfo.emitLocation(this); + +emit the location every time we start to generate code for a new AST, and +also: + +.. code-block:: c++ + + KSDbgInfo.FnScopeMap[this] = SP; + +store the scope (function) when we create it and use it: + + KSDbgInfo.LexicalBlocks.push_back(&KSDbgInfo.FnScopeMap[Proto]); + +when we start generating the code for each function. + +also, don't forget to pop the scope back off of your scope stack at the +end of the code generation for the function: + +.. code-block:: c++ + + // Pop off the lexical block for the function since we added it + // unconditionally. + KSDbgInfo.LexicalBlocks.pop_back(); + +Variables +========= + +Now that we have functions, we need to be able to print out the variables +we have in scope. Let's get our function arguments set up so we can get +decent backtraces and see how our functions are being called. It isn't +a lot of code, and we generally handle it when we're creating the +argument allocas in ``PrototypeAST::CreateArgumentAllocas``. + +.. code-block:: c++ + + DIScope *Scope = KSDbgInfo.LexicalBlocks.back(); + DIFile Unit = DBuilder->createFile(KSDbgInfo.TheCU.getFilename(), + KSDbgInfo.TheCU.getDirectory()); + DIVariable D = DBuilder->createLocalVariable(dwarf::DW_TAG_arg_variable, + *Scope, Args[Idx], Unit, Line, + KSDbgInfo.getDoubleTy(), Idx); + + Instruction *Call = DBuilder->insertDeclare( + Alloca, D, DBuilder->createExpression(), Builder.GetInsertBlock()); + Call->setDebugLoc(DebugLoc::get(Line, 0, *Scope)); + +Here we're doing a few things. First, we're grabbing our current scope +for the variable so we can say what range of code our variable is valid +through. Second, we're creating the variable, giving it the scope, +the name, source location, type, and since it's an argument, the argument +index. Third, we create an ``lvm.dbg.declare`` call to indicate at the IR +level that we've got a variable in an alloca (and it gives a starting +location for the variable). Lastly, we set a source location for the +beginning of the scope on the declare. + +One interesting thing to note at this point is that various debuggers have +assumptions based on how code and debug information was generated for them +in the past. In this case we need to do a little bit of a hack to avoid +generating line information for the function prologue so that the debugger +knows to skip over those instructions when setting a breakpoint. So in +``FunctionAST::CodeGen`` we add a couple of lines: + +.. code-block:: c++ + + // Unset the location for the prologue emission (leading instructions with no + // location in a function are considered part of the prologue and the debugger + // will run past them when breaking on a function) + KSDbgInfo.emitLocation(nullptr); + +and then emit a new location when we actually start generating code for the +body of the function: + +.. code-block:: c++ + + KSDbgInfo.emitLocation(Body); + +With this we have enough debug information to set breakpoints in functions, +print out argument variables, and call functions. Not too bad for just a +few simple lines of code! + +Full Code Listing +================= + +Here is the complete code listing for our running example, enhanced with +debug information. To build this example, use: + +.. code-block:: bash + + # Compile + clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy + # Run + ./toy + +Here is the code: + +.. literalinclude:: ../../examples/Kaleidoscope/Chapter8/toy.cpp + :language: c++ + +`Next: Conclusion and other useful LLVM tidbits <LangImpl9.html>`_ diff --git a/docs/tutorial/LangImpl9.rst b/docs/tutorial/LangImpl9.rst new file mode 100644 index 0000000..3398768 --- /dev/null +++ b/docs/tutorial/LangImpl9.rst @@ -0,0 +1,262 @@ +====================================================== +Kaleidoscope: Conclusion and other useful LLVM tidbits +====================================================== + +.. contents:: + :local: + +Tutorial Conclusion +=================== + +Welcome to the final chapter of the "`Implementing a language with +LLVM <index.html>`_" tutorial. In the course of this tutorial, we have +grown our little Kaleidoscope language from being a useless toy, to +being a semi-interesting (but probably still useless) toy. :) + +It is interesting to see how far we've come, and how little code it has +taken. We built the entire lexer, parser, AST, code generator, an +interactive run-loop (with a JIT!), and emitted debug information in +standalone executables - all in under 1000 lines of (non-comment/non-blank) +code. + +Our little language supports a couple of interesting features: it +supports user defined binary and unary operators, it uses JIT +compilation for immediate evaluation, and it supports a few control flow +constructs with SSA construction. + +Part of the idea of this tutorial was to show you how easy and fun it +can be to define, build, and play with languages. Building a compiler +need not be a scary or mystical process! Now that you've seen some of +the basics, I strongly encourage you to take the code and hack on it. +For example, try adding: + +- **global variables** - While global variables have questional value + in modern software engineering, they are often useful when putting + together quick little hacks like the Kaleidoscope compiler itself. + Fortunately, our current setup makes it very easy to add global + variables: just have value lookup check to see if an unresolved + variable is in the global variable symbol table before rejecting it. + To create a new global variable, make an instance of the LLVM + ``GlobalVariable`` class. +- **typed variables** - Kaleidoscope currently only supports variables + of type double. This gives the language a very nice elegance, because + only supporting one type means that you never have to specify types. + Different languages have different ways of handling this. The easiest + way is to require the user to specify types for every variable + definition, and record the type of the variable in the symbol table + along with its Value\*. +- **arrays, structs, vectors, etc** - Once you add types, you can start + extending the type system in all sorts of interesting ways. Simple + arrays are very easy and are quite useful for many different + applications. Adding them is mostly an exercise in learning how the + LLVM `getelementptr <../LangRef.html#i_getelementptr>`_ instruction + works: it is so nifty/unconventional, it `has its own + FAQ <../GetElementPtr.html>`_! If you add support for recursive types + (e.g. linked lists), make sure to read the `section in the LLVM + Programmer's Manual <../ProgrammersManual.html#TypeResolve>`_ that + describes how to construct them. +- **standard runtime** - Our current language allows the user to access + arbitrary external functions, and we use it for things like "printd" + and "putchard". As you extend the language to add higher-level + constructs, often these constructs make the most sense if they are + lowered to calls into a language-supplied runtime. For example, if + you add hash tables to the language, it would probably make sense to + add the routines to a runtime, instead of inlining them all the way. +- **memory management** - Currently we can only access the stack in + Kaleidoscope. It would also be useful to be able to allocate heap + memory, either with calls to the standard libc malloc/free interface + or with a garbage collector. If you would like to use garbage + collection, note that LLVM fully supports `Accurate Garbage + Collection <../GarbageCollection.html>`_ including algorithms that + move objects and need to scan/update the stack. +- **exception handling support** - LLVM supports generation of `zero + cost exceptions <../ExceptionHandling.html>`_ which interoperate with + code compiled in other languages. You could also generate code by + implicitly making every function return an error value and checking + it. You could also make explicit use of setjmp/longjmp. There are + many different ways to go here. +- **object orientation, generics, database access, complex numbers, + geometric programming, ...** - Really, there is no end of crazy + features that you can add to the language. +- **unusual domains** - We've been talking about applying LLVM to a + domain that many people are interested in: building a compiler for a + specific language. However, there are many other domains that can use + compiler technology that are not typically considered. For example, + LLVM has been used to implement OpenGL graphics acceleration, + translate C++ code to ActionScript, and many other cute and clever + things. Maybe you will be the first to JIT compile a regular + expression interpreter into native code with LLVM? + +Have fun - try doing something crazy and unusual. Building a language +like everyone else always has, is much less fun than trying something a +little crazy or off the wall and seeing how it turns out. If you get +stuck or want to talk about it, feel free to email the `llvmdev mailing +list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_: it has lots +of people who are interested in languages and are often willing to help +out. + +Before we end this tutorial, I want to talk about some "tips and tricks" +for generating LLVM IR. These are some of the more subtle things that +may not be obvious, but are very useful if you want to take advantage of +LLVM's capabilities. + +Properties of the LLVM IR +========================= + +We have a couple common questions about code in the LLVM IR form - lets +just get these out of the way right now, shall we? + +Target Independence +------------------- + +Kaleidoscope is an example of a "portable language": any program written +in Kaleidoscope will work the same way on any target that it runs on. +Many other languages have this property, e.g. lisp, java, haskell, +javascript, python, etc (note that while these languages are portable, +not all their libraries are). + +One nice aspect of LLVM is that it is often capable of preserving target +independence in the IR: you can take the LLVM IR for a +Kaleidoscope-compiled program and run it on any target that LLVM +supports, even emitting C code and compiling that on targets that LLVM +doesn't support natively. You can trivially tell that the Kaleidoscope +compiler generates target-independent code because it never queries for +any target-specific information when generating code. + +The fact that LLVM provides a compact, target-independent, +representation for code gets a lot of people excited. Unfortunately, +these people are usually thinking about C or a language from the C +family when they are asking questions about language portability. I say +"unfortunately", because there is really no way to make (fully general) +C code portable, other than shipping the source code around (and of +course, C source code is not actually portable in general either - ever +port a really old application from 32- to 64-bits?). + +The problem with C (again, in its full generality) is that it is heavily +laden with target specific assumptions. As one simple example, the +preprocessor often destructively removes target-independence from the +code when it processes the input text: + +.. code-block:: c + + #ifdef __i386__ + int X = 1; + #else + int X = 42; + #endif + +While it is possible to engineer more and more complex solutions to +problems like this, it cannot be solved in full generality in a way that +is better than shipping the actual source code. + +That said, there are interesting subsets of C that can be made portable. +If you are willing to fix primitive types to a fixed size (say int = +32-bits, and long = 64-bits), don't care about ABI compatibility with +existing binaries, and are willing to give up some other minor features, +you can have portable code. This can make sense for specialized domains +such as an in-kernel language. + +Safety Guarantees +----------------- + +Many of the languages above are also "safe" languages: it is impossible +for a program written in Java to corrupt its address space and crash the +process (assuming the JVM has no bugs). Safety is an interesting +property that requires a combination of language design, runtime +support, and often operating system support. + +It is certainly possible to implement a safe language in LLVM, but LLVM +IR does not itself guarantee safety. The LLVM IR allows unsafe pointer +casts, use after free bugs, buffer over-runs, and a variety of other +problems. Safety needs to be implemented as a layer on top of LLVM and, +conveniently, several groups have investigated this. Ask on the `llvmdev +mailing list <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_ if +you are interested in more details. + +Language-Specific Optimizations +------------------------------- + +One thing about LLVM that turns off many people is that it does not +solve all the world's problems in one system (sorry 'world hunger', +someone else will have to solve you some other day). One specific +complaint is that people perceive LLVM as being incapable of performing +high-level language-specific optimization: LLVM "loses too much +information". + +Unfortunately, this is really not the place to give you a full and +unified version of "Chris Lattner's theory of compiler design". Instead, +I'll make a few observations: + +First, you're right that LLVM does lose information. For example, as of +this writing, there is no way to distinguish in the LLVM IR whether an +SSA-value came from a C "int" or a C "long" on an ILP32 machine (other +than debug info). Both get compiled down to an 'i32' value and the +information about what it came from is lost. The more general issue +here, is that the LLVM type system uses "structural equivalence" instead +of "name equivalence". Another place this surprises people is if you +have two types in a high-level language that have the same structure +(e.g. two different structs that have a single int field): these types +will compile down into a single LLVM type and it will be impossible to +tell what it came from. + +Second, while LLVM does lose information, LLVM is not a fixed target: we +continue to enhance and improve it in many different ways. In addition +to adding new features (LLVM did not always support exceptions or debug +info), we also extend the IR to capture important information for +optimization (e.g. whether an argument is sign or zero extended, +information about pointers aliasing, etc). Many of the enhancements are +user-driven: people want LLVM to include some specific feature, so they +go ahead and extend it. + +Third, it is *possible and easy* to add language-specific optimizations, +and you have a number of choices in how to do it. As one trivial +example, it is easy to add language-specific optimization passes that +"know" things about code compiled for a language. In the case of the C +family, there is an optimization pass that "knows" about the standard C +library functions. If you call "exit(0)" in main(), it knows that it is +safe to optimize that into "return 0;" because C specifies what the +'exit' function does. + +In addition to simple library knowledge, it is possible to embed a +variety of other language-specific information into the LLVM IR. If you +have a specific need and run into a wall, please bring the topic up on +the llvmdev list. At the very worst, you can always treat LLVM as if it +were a "dumb code generator" and implement the high-level optimizations +you desire in your front-end, on the language-specific AST. + +Tips and Tricks +=============== + +There is a variety of useful tips and tricks that you come to know after +working on/with LLVM that aren't obvious at first glance. Instead of +letting everyone rediscover them, this section talks about some of these +issues. + +Implementing portable offsetof/sizeof +------------------------------------- + +One interesting thing that comes up, if you are trying to keep the code +generated by your compiler "target independent", is that you often need +to know the size of some LLVM type or the offset of some field in an +llvm structure. For example, you might need to pass the size of a type +into a function that allocates memory. + +Unfortunately, this can vary widely across targets: for example the +width of a pointer is trivially target-specific. However, there is a +`clever way to use the getelementptr +instruction <http://nondot.org/sabre/LLVMNotes/SizeOf-OffsetOf-VariableSizedStructs.txt>`_ +that allows you to compute this in a portable way. + +Garbage Collected Stack Frames +------------------------------ + +Some languages want to explicitly manage their stack frames, often so +that they are garbage collected or to allow easy implementation of +closures. There are often better ways to implement these features than +explicit stack frames, but `LLVM does support +them, <http://nondot.org/sabre/LLVMNotes/ExplicitlyManagedStackFrames.txt>`_ +if you want. It requires your front-end to convert the code into +`Continuation Passing +Style <http://en.wikipedia.org/wiki/Continuation-passing_style>`_ and +the use of tail calls (which LLVM also supports). + |