Code Completion Design

From CodeBlocks
Revision as of 15:21, 12 July 2009 by Ollydbg (Talk | contribs) (Add token query method description and many typo fix)

Jump to: navigation, search

How to build

Get the source code

When you download the svn source code of code::blocks,(see here Installing_Code::Blocks_from_source_on_Windows#Code::Blocks_sources the source code of CodeCompletion plugin was already included.

See a screen shot of these code opened in code::blocks under windows below.

Code completion source tree opened in code::blocks

Build the code completion plug in

Code completion build target option in code::blocks

Note, you should use "update.bat" to copy the new generated dll to the destination and strip the debug information. Here is the modified bat file which only update CodeCompletion.DLL.

@echo off

setlocal

echo Creating output directory tree

set CB_DEVEL_RESDIR=devel\share\CodeBlocks
set CB_OUTPUT_RESDIR=output\share\CodeBlocks

set ZIPCMD=zip

xcopy /D /y %CB_DEVEL_RESDIR%\plugins\codecompletion.dll %CB_OUTPUT_RESDIR%\plugins\codecompletion.dll

echo Stripping debug info from output tree

strip %CB_OUTPUT_RESDIR%\plugins\codecompletion.dll

see Installing_Code::Blocks_from_source_on_Windows for more information.

A belief description of every project files

ccdebuginfo.cpp a dialog for debugging CC, can be opened by double click on the code browser tree entry with shift and ctrl key pressed
ccoptionsdlg.cpp code completion options dialog, can be opened by menu->setting->editor->code completion and symbols browser
ccoptionsprjdlg.cpp setting the additional parser search path
classbrowser.cpp viewing the symbols tree ctrl(token tree).
classbrowserbuilderthread.cpp a thread to build the class browser tree above
codecompletion.cpp The Main file need by code completion plug-in, maintain all the CC's GUI and native parser
insertclassmethoddlg.cpp a dialog to insert class method, can be open by context menu in editor
nativeparser.cpp a class derived from wxEvtHandler, NativeParser class has a member variable "Parser m_Parser";
selectincludefile.cpp select multiply matched token names before any jump to declaration or jump to implementation.
parser/parser.cpp Parser class was also derived from wxEvtHandler, can start batch parse... this class has member variables like :cbThreadPool m_Pool(which will create a thread from thread pool for each file need passed);TokensTree* m_pTokens(contains all the Token database);
parser/parserthread.cpp will do the syntax analysis for every file in a project, it has a Tokenizer as member variable
parser/token.cpp definition of the "Token" class, and TokensTree(which means the Token dababase)
parser/tokenizer.cpp tokenizer will return every wxString it regard as a symbol by GetToken(), also do a replacement before return
parser/searchtree.cpp implement the patricia search tree using by TokensTree
Header files no description needed

Low level parser(Lexical analysis)

For someone haven't heard what does "Token" and "Tokenize" means, you should read the wikibooks article A brief explain of what does a parser do and Tokenize on wikipedia. Shortly, a parser treats your C++ or C code as a large array of characters, then this big string was divided to small atomic strings(these string has a unique meanings and can't divided into sub-strings, such as symbols, identifiers, keywords, digital numbers), meanwhile "spaces" and "comments" were ignored.

for a simple c++ program like below

int main()
{
    std::cout << "hello world" << std::endl;
    return 0;
}

After tokenizing, it should give these 15 tokens

1 = string "int"
2 = string "main"
3 = opening parenthesis
4 = closing parenthesis
5 = opening brace
6 = string "std"
7 = namespace operator
8 = string "cout"
9 = << operator
10 = string ""hello world""
11 = string "endl"
12 = semicolon
13 = string "return"
14 = number 0
15 = closing brace

Tokenizer class

A class named "Tokenizer" was introduced in "tokenizer.cpp" and "tokenizer.cpp". There are several steps to running the Tokenizer class. It can just return a unicode wxString as a token.

Read a source file

Open the source file and convert the file buff to Unicode mode.(since we are all using Unicode build of code::blocks, and ANSI mode is outdated and deprecated).

Get or Peek a token

The class contains file position indicator(see File Open In C language ) pointing to the current position of the character(m_TokenIndex), you can Get or Peek to get the correct token you desired.

   //Get the current Token and increase the Tokenindex
   wxString GetToken();
   //Peak the current and NOT increase the index
   wxString PeekToken();

For example, if the Tokenizer was parsing the example code above.

  • After initialize the Tokenizer, call the GetToken() function will return a wxString "int" and increase the token index to pointing to "int".
  • Then, call the PeekToken() will return a wxString "main", but the tokenindex was still pointing to "main".
  • If you call the GetToken() again, then it will return a "main" immediately and increase the file pointer to "main".

Note: Internally, the Tokenizer class use a "undo and peek cache" to do this trick. Once a token is peeked, it will be saved in the m_Peek member, so, calling GetToken() will quickly return saved value without calling the "DoGetToken()" procedure again.

Cb token cache.png

Nested Value

This value was keep to indicate your are in the correct brace pair.If the Tokenizer meets a {, it will increase the nestValue, and if it meets a }, it will decrease the m_NestLeve. See the pseudo code in Tokenizer.cpp below.

        if (CurrentChar() == '{')
            ++m_NestLevel;
        else if (CurrentChar() == '}')
            --m_NestLevel;

SkipUnwanted tokens

There is a member function in Tokenizer class to skip comments, assignments, preprocessor etc.

For example, if there is a statement below:

a = b + c;

if SkipUnwanted() meet the "=" symbol, it will skip everything until it meets "," or ";" or "}", this means this statement will be omitted by the Tokenizer if m_SkipUnwantedTokens == true.

Sometimes, this behavior becomes a nightmare to parse the statement like default argument in template.

template<class T = int> 
class abc {
 T m_a;
 ......
 ......
}

if the Tokenizer find that a "=", it will skip any characters until it meets a "}", so, the class declaration will totally be skipped. So, at this time, we should manually disable this functionality by setting m_SkipUnwantedTokens = false to parse these statements correctly. That's why you will see many situations when you enter a function, you should save the m_SkipUnwantedTokens and disabled it, when you leave a function, you should manually restore it.(Seefunction implementation in ParseThread.cpp)

Return a correct token, Macro replacement

Special token should be replaced for parsing correctly. For example, in the standard c++ header (mingw), there are a string named "_GLIBCXX_STD", this should be replaced to "std". See the dialog below.

Cc std replacement.png

The inline function in the Tokenizer class will check whether a token should be replaced before return.

   //This is a map, check the first string and return the second string
   inline const wxString& ThisOrReplacement(const wxString& str) const
   {
       ConfigManagerContainer::StringToStringMap::const_iterator it = s_Replacements.find(str);
       if (it != s_Replacements.end())
           return it->second;
       return str;
   }


Code completion build target option in code::blocks

Setting the replacement mapping. Note that before return a token, a replacement map was searched to check if it matches any entry in the map, so, the bigger this map goes, the slower it will do parsing.

Note: Code Completion plug-in is not a preprocessor, so it is difficult to deal with the source mixed with many macro, or some strange macros. This is something like Ctags' replacement options "−I identifier−list" in ctags option detial or Code Completion macro FAQ

High level parser(Syntax Analysis)

parser thread

Basically, we can say, the low level parser(Tokenizer) moves its pointer character by character, and return a wxString(token) to feed the high level parser(Syntax analyzer).All the syntax analysis was done in ParserThread. A thread must be created to parse a source file. see parserthread.cpp and parserthread.h, a thread will be allocated from thread pool. For example, a file contains these statement:

    void  f1();
    int   f2(char c);
    float f3(void * p);
    int   f1;
    double f2;

After the ParserThread finished its parsing, it will recognize five tokens, which has the keyword "f1","f2" and "f3", note, tokens can have the same names, but they differ from different types( variables, functions...).

Token class

How can a large number of tokens be recorded? A Token(note: it as a capital means it's class type) class was introduced to recorded every token. For boosting the speed of allocating Tokens, the "new" and "delete" operator were overloaded in its base class BlockAllocated. See the memory pool page on wikipedia as a reference.

class Token  : public BlockAllocated<Token, 10000>
{
        ......
        wxString m_Type; // this is the return value (if any): e.g. const wxString&
        wxString m_ActualType; // this is what the parser believes is the actual return value: e.g. wxString
        wxString m_Name;
        wxString m_Args;
        wxString m_AncestorsString; // all ancestors comma-separated list
        unsigned int m_File;
        unsigned int m_Line;
        unsigned int m_ImplFile;
        unsigned int m_ImplLine; // where the token was met
        unsigned int m_ImplLineStart; // if token is impl, opening brace line
        unsigned int m_ImplLineEnd; // if token is impl, closing brace line
        TokenScope m_Scope;
        TokenKind m_TokenKind;
        bool m_IsOperator;
        bool m_IsLocal; // found in a local file?
        bool m_IsTemp; // if true, the tree deletes it in FreeTemporaries()
        bool m_IsConst;    // the member method is const (yes/no)

        int m_ParentIndex;
        TokenIdxSet m_Children;
        TokenIdxSet m_Ancestors;
        TokenIdxSet m_DirectAncestors;
        TokenIdxSet m_Descendants;
        ......

   
};

You can see the Token class contains all the information needed for recording its locating, its type or class derived hierarchy...

For example, in the source code in[Low level parser(Lexical analysis)]. A Token for "main" should contains it's name (obviously , m_Name="main" ), then m_File will record which file dose this Token exist. m_Line will give the line number of "main" in this source file, and so on.

Memory Pool--BlockAllocated class

In BlockAllocated class, there is only a static member say "static BlockAllocator<T, pool_size, debug> allocator;" to keep all the pre-allocated memory for all derived class.

10000 means a pool of 10000 Tokens were allocated in the memory pool, so, dynamically allocate a Token object will be fast and efficient.

Operator new overloading for fast allocate in the heap

ParserThread

The function Parse() will do the most job of syntax analysis. See the pseudo code below.

ParserThread::Parse()
{
   ......
   do
    {
        ......
        DoParse();
        ......
 
        
    }while(false);

    return result;
}

In the DoParse(), it checks the token from Tokenizer. For example, if the token words = "enum", then, the ParserThread::HandleEnum() will do the job to parse this enum block.


A simple look ahead parser

We can explain a little about it parser, the member variable m_Str of class ParserThread will be considered as a type stack, for example, we want to parse the statement below:

symbolA symbolB symbolC symbolD;

Only symbolD can be recognized as a variable, and it has a type of "symbolA symbolB symbolC". When the parser meets each symbol, it will look ahead to see the next token is whether ";", if not, the current token will pushed to m_Str. These iteration will be ended when the parser look ahead one step from symbolD and find the next token is a ";".

TokensTree&SearchTree

Maybe, you would ask a question: where do these Tokens store? The answer is that all the Tokens will be recorded in "TokensTree class".

When a certain Token is identified(whether it's a global variable, a class declaration, a class member function, and so on), it will be inserted to a database(TokensTree).

Furthermore, for fast Token query, Tokens should be sorted by it's wxString m_Name member;

A compact Patricia tree(see the wikipedia Patricia tree on wikipedia) is built to hold all their names.

For example, If you add three item to the TokensTree.

    mytree->AddItem(_T("physics"),_T("1 - uno"));
    mytree->AddItem(_T("physiology"),_T("2 - dos"));
    mytree->AddItem(_T("psychic"),_T("3 - tres"));

The Patricia tree structure will show as below, the edge of a tree contains a "label string" and the number in parentheses refers to a node Id.

- "" (0)
      \- "p" (4)
              +- "hysi" (2)
              |          +- "cs" (1)
              |          \- "ology" (3)
              \- "sychic" (5)

Patricia Search Tree Node Depth

Depth of Search Tree Node is defined by the string length from the root node. See a depth of each node on the search tree above. For example, the Node of "hysi" has a m_Depth = 5 ("" + "p" + "hysi" = 5).

Node Lable

For example, the node "hysi" (2) has two children, they are "cs" (1) and "ology" (3), show below.

Cb tree node lable.png

For more information, see the forum discussion here. rickg22's SearchTree development as a reference.

How to query a Token by a keyword

The parser collect all the Token information, and stored them in the TokensTree, the GUI function can query keywords from the database to show function tips or to build a Class Browser tree.

For example, if you want to find all the Tokens named "ab". In the picture above from TokenDatabase(TokensTree). we can search on the Patricia tree containing all the Tokens names, finaly, we find a tree node with a edge "abcd". So, "ab" is in it's Node's items list. Then, we can find a TokenIdxSet in a vector<TokenIdxSet>, this TokenIdxSet has all the index named by "ab", so, we can get the result like: There are many Tokens named "ab".Token "ab" may be a member varialbe name in a class, or a global function name...

ClassA::ab
ClassB::ab
void ab()
....

CCTokenTree1.png

Code completion debugging support

Debug Log output

If you want to debug your plug-in, you may need to Logout the debug message to the "Code::Blocks Debug" panel. Here is the sample code:

Manager::Get()->GetLogManager()->DebugLog(_("XXXXX "));

wxString name;
wxString args;
wxString m_Str;
//.....
Manager::Get()->GetLogManager()->DebugLog(F(_T("Add token name='")+name+_T("', args='")+args+_T("', return type='") + m_Str+ _T("'")));

Also, you need start the Code::Blocks with the command line argument. For example in windows.

codeblocks.exe --debug-log

then a Code::blocks debug panel will be shown to display the log.

Debug Log output panel

Code-Completion debug tool dialog

When you press shift and ctrl key and double click on any entry of the navigator tree, a debug tool dialog will pop up to give a more detail about the selected token. You can query its information such as its member variables, its Ancestors and so on.

CcDebugToolDialog.png

Usefull Links

  • A discussion on search tree in the forum [1] and [2].
svn co https://vcfbuilder.svn.sourceforge.net/svnroot/vcfbuilder vcfbuilder