Difference between revisions of "Code Completion Design"

From Code::Blocks
Line 108: Line 108:
 
Note: CodeCompletion plugin is not a preprocessor, so it is difficult to deal with the source mixed with many macro, or some strange marcos. This is something like Ctags' replacement options "−I identifier−list" in [http://ctags.sourceforge.net/ctags.html#OPERATIONAL%20DETAILS ctags option detial] or [http://codelite.org/LiteEditor/FAQ CodeCompletion macro FAQ]
 
Note: CodeCompletion plugin is not a preprocessor, so it is difficult to deal with the source mixed with many macro, or some strange marcos. This is something like Ctags' replacement options "−I identifier−list" in [http://ctags.sourceforge.net/ctags.html#OPERATIONAL%20DETAILS ctags option detial] or [http://codelite.org/LiteEditor/FAQ CodeCompletion macro FAQ]
  
==High level parser==
+
==High level parser(Syntax Analysis)==
 
===Token class===
 
===Token class===
For boosting the speed of allocating Tokens, the "new" and "delete" operator were overloaded in it's base class BlockAllocated.
+
For boosting the speed of allocating Tokens, the "new" and "delete" operator were overloaded in it's base class BlockAllocated. See the [http://en.wikipedia.org/wiki/Memory_pool memory pool] page on wikipedia as a reference.
 
<pre>
 
<pre>
 
class Token  : public BlockAllocated<Token, 10000>
 
class Token  : public BlockAllocated<Token, 10000>
Line 148: Line 148:
 
In BlockAllocated class, there is only a static member say "static BlockAllocator<T, pool_size, debug> allocator;" to keep all the pre-allocated memorys for all derived class.
 
In BlockAllocated class, there is only a static member say "static BlockAllocator<T, pool_size, debug> allocator;" to keep all the pre-allocated memorys for all derived class.
  
'''10000''' means 10000 Tokens were pre-allocated.
+
'''10000''' means a pool of 10000 Tokens were allocated, so, dynamically allocate a Token object will be fast and efficient.  
  
 
[[Image:Cb blockalloc.png|frame|none|  Operator new overloading for fast allocate in the heap]]
 
[[Image:Cb blockalloc.png|frame|none|  Operator new overloading for fast allocate in the heap]]
  
 
===ParserThread===
 
===ParserThread===
The function Parse() will do the most job of syntax analysis. See the pesudo code below.
+
The function Parse() will do the most job of syntax analysis. See the pseudo code below.
 
<pre>
 
<pre>
 
ParserThread::Parse()
 
ParserThread::Parse()
Line 176: Line 176:
  
 
===TokenTree&SearchTree===
 
===TokenTree&SearchTree===
Each Token will be recorded in a database(TokenTree), also, for fast string searching, a compact Patricia tree(SearchTree) is build to keep their names.  
+
When a specific Token is identified, it will be recorded in a database(TokenTree), also, for fast string matching, a '''compact Patricia tree'''(see the wikipedia [http://en.wikipedia.org/wiki/Radix_tree Patricia tree on wikipedia]) is build to keep their names.  
  
 
For example, If you add three item to the TokenTree.
 
For example, If you add three item to the TokenTree.
Line 197: Line 197:
  
 
===Token Depth===  
 
===Token Depth===  
Depth of a Token is the string length from the root tree. See a depth of each node on the search tree below. For example, look at the Node of "hysi", it has a m_Depth = 5 means "" + "p" + "hysi" = 5.
+
Depth of Token is defined by the string length from the root tree. See a depth of each node on the search tree below. For example, the Node of "hysi" has a m_Depth = 5 ("" + "p" + "hysi" = 5).
  
 
[[Image:Cb node depth tree.png]]
 
[[Image:Cb node depth tree.png]]
Line 206: Line 206:
 
[[Image:Cb tree node lable.png]]
 
[[Image:Cb tree node lable.png]]
  
For more information, see the forum discussion here. [/index.php/topic,1696.0.html rickg22's SearchTree development] And see [http://en.wikipedia.org/wiki/Radix_tree Patricia tree on wikipedia] as a reference.
+
For more information, see the forum discussion here. [/index.php/topic,1696.0.html rickg22's SearchTree development] as a reference.
  
===Flow Chart of Token===
+
===flow of Token===
This the a belief view of the Token flow chart.
+
This the a belief view of the Token flow chart. The parser collect the token information, and record them in the TokenSearchTree, the GUI function can query words from the database.  
  
 
[[Image:CCTokenTree1.png]]
 
[[Image:CCTokenTree1.png]]

Revision as of 16:48, 7 March 2009

How to build

Get the source code

When you download the svn source code of code::blocks,(see here Installing_Code::Blocks_from_source_on_Windows#Code::Blocks_sources the source code of CodeCompletion plugin was already included.

See a screen shot of these code opened in code::blocks under windows below.

Code completion source tree opened in code::blocks

Build the code completion plug in

Code completion build target option in code::blocks

Note, you should use "update.bat" to copy the new generated dll to the destination and strip the debug information. Here is the modified bat file which only update CodeCompletion.DLL.

@echo off

setlocal

echo Creating output directory tree

set CB_DEVEL_RESDIR=devel\share\CodeBlocks
set CB_OUTPUT_RESDIR=output\share\CodeBlocks

set ZIPCMD=zip

xcopy /D /y %CB_DEVEL_RESDIR%\plugins\codecompletion.dll %CB_OUTPUT_RESDIR%\plugins\codecompletion.dll

echo Stripping debug info from output tree

strip %CB_OUTPUT_RESDIR%\plugins\codecompletion.dll

see Installing_Code::Blocks_from_source_on_Windows for more information.

Low level parser(Lexical analysis)

For someone haven't heard what does "Token" and "Tokenize" means, you should read the wikibooks article A brief explain of what does a parser do and Tokenize on wikipedia. Shortly, a parser treats your C++ or C code as a large array of characters, then this big string was divided to small atom strings(such as symbols, identifiers, keywords, digital numbers), meanwhile "spaces" and "comments" were ignored.

for a simple c++ program like below

int main()
{
    std::cout << "hello world" << std::endl;
    return 0;
}

After tokenizing, it should give these 15 tokens

1 = string "int"
2 = string "main"
3 = opening parenthesis
4 = closing parenthesis
5 = opening brace
6 = string "std"
7 = namespace operator
8 = string "cout"
9 = << operator
10 = string ""hello world""
11 = string "endl"
12 = semicolon
13 = string "return"
14 = number 0
15 = closing brace

Tokenizer class

A class named "Tokenizer" was introduced in "tokenizer.cpp" and "tokenizer.cpp". There are several steps to running the Tokenizer class.

parser thread

A thread must be created to parse a source file. see parserthread.cpp and parserthread.h

Read a source file

Open the source file and convert the file buff to Unicode mode.(since we are all using Unicode build of code::blocks, and ANSI mode is outdated).

Get or Peek a token

The class contains a Pointer to the current position of the character, you can Get or Peek the current character.

   //Get the current Token and increase the Tokenindex
   wxString GetToken();
   //Peak the current and NOT increase the index
   wxString PeekToken();

For example, if the Tokenizer was parsing the example code above.

After initialize the Tokenizer, call the GetToken() function will return a wxString "int" and increase the token index to pointing to "int", at this time, call the PeekToken() will return a wxString "main", but the tokenindex was still pointing to "main". If you call the GetToken() again, then it will return a "main" and increase the file pointer.

Nested Value

This value was keep to indicate your are in the correct brace pair.If the Tokenizer meets a {, it will increase the nestValue, and if it meets a }, it will decrease the nestValue.

Return a correct token

Special token should be replaced for parsing correctly. For example, in the standard c++ header (mingw), there are a string named "_GLIBCXX_STD", this should be replaced to "std". See the dialog below.

Cc std replacement.png

The inline function in the Tokenizer class will check whether a token should be replaced before return.

   //This is a map, check the first string and return the second string
   inline const wxString& ThisOrReplacement(const wxString& str) const
   {
       ConfigManagerContainer::StringToStringMap::const_iterator it = s_Replacements.find(str);
       if (it != s_Replacements.end())
           return it->second;
       return str;
   }


Code completion build target option in code::blocks

Setting the replacement mapping. Note that before return a token, a replacement map was searched to check if it matches any entry in the map, so, the bigger this map goes, the slower it will do parsing.

Note: CodeCompletion plugin is not a preprocessor, so it is difficult to deal with the source mixed with many macro, or some strange marcos. This is something like Ctags' replacement options "−I identifier−list" in ctags option detial or CodeCompletion macro FAQ

High level parser(Syntax Analysis)

Token class

For boosting the speed of allocating Tokens, the "new" and "delete" operator were overloaded in it's base class BlockAllocated. See the memory pool page on wikipedia as a reference.

class Token  : public BlockAllocated<Token, 10000>
{
        ......
        wxString m_Type; // this is the return value (if any): e.g. const wxString&
        wxString m_ActualType; // this is what the parser believes is the actual return value: e.g. wxString
        wxString m_Name;
        wxString m_Args;
        wxString m_AncestorsString; // all ancestors comma-separated list
        unsigned int m_File;
        unsigned int m_Line;
        unsigned int m_ImplFile;
        unsigned int m_ImplLine; // where the token was met
        unsigned int m_ImplLineStart; // if token is impl, opening brace line
        unsigned int m_ImplLineEnd; // if token is impl, closing brace line
        TokenScope m_Scope;
        TokenKind m_TokenKind;
        bool m_IsOperator;
        bool m_IsLocal; // found in a local file?
        bool m_IsTemp; // if true, the tree deletes it in FreeTemporaries()
        bool m_IsConst;    // the member method is const (yes/no)

        int m_ParentIndex;
        TokenIdxSet m_Children;
        TokenIdxSet m_Ancestors;
        TokenIdxSet m_DirectAncestors;
        TokenIdxSet m_Descendants;
        ......

   
};

Basically, you can see the Token class contains the information for locating. For example, in the previous source code. A Token for "main" should contains it's name (obviously , m_Name="main" ), then m_File will record which file dose this Token exist. m_Line will give the line number of "main" in this source file, and so on.

BlockAllocated class

In BlockAllocated class, there is only a static member say "static BlockAllocator<T, pool_size, debug> allocator;" to keep all the pre-allocated memorys for all derived class.

10000 means a pool of 10000 Tokens were allocated, so, dynamically allocate a Token object will be fast and efficient.

Operator new overloading for fast allocate in the heap

ParserThread

The function Parse() will do the most job of syntax analysis. See the pseudo code below.

ParserThread::Parse()
{
   ......
   do
    {
        ......
        DoParse();
        ......
 
        
    }while(false);

    return result;
}

In the DoParse(), it checks the token from Tokenizer. For example, if the token words = "enum", then, the ParserThread::HandleEnum() will do the job to record this enum block.


TokenTree&SearchTree

When a specific Token is identified, it will be recorded in a database(TokenTree), also, for fast string matching, a compact Patricia tree(see the wikipedia Patricia tree on wikipedia) is build to keep their names.

For example, If you add three item to the TokenTree.

    mytree->AddItem(_T("physics"),_T("1 - uno"));
    mytree->AddItem(_T("physiology"),_T("2 - dos"));
    mytree->AddItem(_T("psychic"),_T("3 - tres"));

The Tree structure will show as

- "" (0)
      \- "p" (4)
              +- "hysi" (2)
              |          +- "cs" (1)
              |          \- "ology" (3)
              \- "sychic" (5)

Token Depth

Depth of Token is defined by the string length from the root tree. See a depth of each node on the search tree below. For example, the Node of "hysi" has a m_Depth = 5 ("" + "p" + "hysi" = 5).

Cb node depth tree.png

Token Lable

For example, the node ""hysi" (2) has two children, they are "cs" (1) and "ology" (3), show below.

Cb tree node lable.png

For more information, see the forum discussion here. [/index.php/topic,1696.0.html rickg22's SearchTree development] as a reference.

flow of Token

This the a belief view of the Token flow chart. The parser collect the token information, and record them in the TokenSearchTree, the GUI function can query words from the database.

CCTokenTree1.png

UI issue

Debug Log output

If you want to debug your plug-in, you may need to Logout the debug information. Mostly, here is the code

Manager::Get()->GetLogManager()->DebugLog(_("XXXXX "));

Also, you need start the codeblocks with the command line argument. For example in windows.

codeblocks.exe --debug-log

then a Code::blocks debug panel will be shown to display the log.

Debug Log output panel

Usefull Links

A discussion on search tree in the forum [/index.php/topic,1696.0.html] and [/index.php/topic,1581.0.html].

Another opensource IDE VCF builderorvcfbuilder on sf, By the way , VCF use a CodeStore based on antlr parser to deal with parsing, the whole source code can be check out from

svn co https://vcfbuilder.svn.sourceforge.net/svnroot/vcfbuilder vcfbuilder


online bookData Structures and Algorithms II Course and pdf lectures

A forum discussion on why a standard perprocessor and parser works slowly. [/index.php/topic,2494.msg19801.html#msg19801 parsing iostream header takes 5 seconds]