Difference between revisions of "Unicode Standards"

From Code::Blocks
 
(10 intermediate revisions by 4 users not shown)
Line 1: Line 1:
This page is meant to be a location for developers to find all the current Unicode standards, or good practices, when editor and developing the Code::Blocks program.  I am going to try and summarize the discussions that I was pointed to here, but I am leaving out the original author.  Sorry  Feel free to edit this to improve it, or keep it up to date.  I am new to using wiki, so please exscuse the bad designs.  This is a a '''VERY''' rough draft with no clear organizational pattern. ~ Joe M.
+
[[Category:Developer Documentation]]
Actually, it's quite good.  I did a quick read-through as it looks fine to meAll I did was add some things to the bottom. ~ me22
+
This page is meant to be a location for developers to find all the current Unicode standards, or good practices, when developing the Code::Blocks IDE.   
  
reference: [http://www.wxwidgets.org/manuals/2.4.2/wx458.htm#unicode]
+
== Macros ==
 +
 
 +
=== Constants ===
 +
 
 +
{| border="1" cellpadding="3" cellspacing="0" style="border: 1px solid gray; border-collapse: collapse;"
 +
 
 +
|- style="background: #ececec; border: 1px solid gray"
 +
! Macro
 +
! Description
 +
|-
 +
 
 +
|-
 +
| '''__TFILE__'''
 +
| wxWidgets provide equivilant to __FILE__
 +
|-
 +
 
 +
|-
 +
| '''__TDATE__'''
 +
| wxWidgets provide equivilant to __DATE__
 +
|-
 +
 
 +
|-
 +
| '''__TTIME__'''
 +
| wxWidgets provide equivilant to __TIME__
 +
|-
  
 +
|}
  
== Macros ==
+
=== _U() ===
{NOTE: bullet list would look better here, but bold is used for now}<BR>
 
'''__TFILE__''' = wxWidgets provide equivilant to __FILE__<BR>
 
'''__TDATE__''' = wxWidgets provide equivilant to __DATE__<BR>
 
'''__TTIME__''' = wxWidgets provide equivilant to __TIME__<BR>
 
  
'''_U()''' = Use it to convert non-literal char* strings to wxString. Use it for reading attributes from TiXmlNode's. If you deal with functions that return strings, you must use our _U macro.<BR>
+
Use it to convert non-literal char* strings to wxString. Use it for reading attributes from TiXmlNode's. If you deal with functions that return strings, you must use our _U macro.
  
 
Code:
 
Code:
  #ifdef wxUSE_UNICODE
+
 
 +
#ifdef wxUSE_UNICODE
 
     #define _U(x) wxString((x),wxConvUTF8)
 
     #define _U(x) wxString((x),wxConvUTF8)
 
     #define _UU(x,y) wxString((x),y)
 
     #define _UU(x,y) wxString((x),y)
  #else
+
#else
 
     #define _U(x) (x)
 
     #define _U(x) (x)
 
     #define _UU(x,y) (x)
 
     #define _UU(x,y) (x)
  #endif
+
#endif
  
 
i.e.:
 
i.e.:
 
Code:
 
Code:
  const char* incompatible = "This is an incompatible string";
 
  wxString compatible = _U(incompatible);
 
  
  // wxString conftype = conf->Attribute("ConfigurationType"); // before
+
const char* incompatible = "This is an incompatible string";
  wxString conftype = _U(conf->Attribute("ConfigurationType")); // after :)
+
wxString compatible = _U(incompatible);
 +
 
 +
// wxString conftype = conf->Attribute("ConfigurationType"); // before
 +
wxString conftype = _U(conf->Attribute("ConfigurationType")); // after
  
'''_C()''' = multibyte C string see wxhelp (wxMBConv classes overview)
+
=== _C() ===
Use this one for interacting with APIs needing char const*s, such as saving things to tinyXML.
+
 
 +
multibyte C string see wxhelp (wxMBConv classes overview)
 +
Use this one for interacting with APIs needing char const*s, such as saving things to TinyXML.
  
 
Is defined in code as:  
 
Is defined in code as:  
   #if wxUSE_UNICODE
+
    
      #define _UU(x,y) wxString((x),(y))
+
#if wxUSE_UNICODE
      #define _CC(x,y) (x).mb_str((y))
+
    #define _UU(x,y) wxString((x),(y))
  #else
+
    #define _CC(x,y) (x).mb_str((y))
      #define _UU(x,y) (x)
+
#else
      #define _CC(x,y) (x)
+
    #define _UU(x,y) (x)
  #endif
+
    #define _CC(x,y) (x)
 +
#endif
  
  #define _U(x) _UU((x),wxConvUTF8)
+
#define _U(x) _UU((x),wxConvUTF8)
  #define _C(x) _CC((x),wxConvUTF8)
+
#define _C(x) _CC((x),wxConvUTF8)
  
'''_wxT()''' = fixed text's - like XRC resources object names (only adds an L before the string (ONLY if you're in a unicode build).)<BR>
+
=== _T()/wxT() ===
wxT() is a macro which can be used with character and string literals (in other words, 'x' or "foo") to automatically convert them to Unicode in Unicode build configuration. Please see the Unicode overview for more information.
 
  
This macro is simply returns the value passed to it without changes in ASCII build. In fact, its definition is:
+
_T()/wxT() are used for fixed text - like XRC resources object names (only adds an L before the string if you're using Unicode build).
  
  #ifdef UNICODE
+
_T()/wxT() are macros which can be used with character and string literals (in other words, 'x' or "foo") to automatically convert them to Unicode in Unicode build configuration. Please see the Unicode overview for more information.
  #define wxT(x) L ## x
 
  #else // !Unicode
 
  #define wxT(x) x
 
  #endif<BR>
 
  
'''_T()''' = fixed text's - like XRC resources object names (only adds an L before the string (ONLY if you're in a unicode build).).<BR>
+
These macros simply returns the value passed to it without changes in ASCII build. In fact, the wxT() definition is:
This macro is exactly the same as wxT and is defined in wxWidgets simply because it may be more intuitive for Windows programmers as the standard Win32 headers also define it (as well as yet another name for the same macro which is _TEXT()).
 
  
Don't confuse this macro with _()!<BR>
+
#ifdef UNICODE
  wxChar _T(char ch)
+
    #define wxT(x) L ## x
 +
#else // !Unicode
 +
    #define wxT(x) x
 +
#endif
  
  const wxChar * _T(const wxChar ch)<BR>
+
_T() is exactly the same as wxT() and is defined in wxWidgets simply because it may be more intuitive for Windows programmers as the standard Win32 headers also define it (as well as yet another name for the same macro which is _TEXT()).
 +
 
 +
Don't confuse this macro with _()!
 +
 
 +
wxChar _T(char ch)
 +
 
 +
const wxChar * _T(const wxChar ch)
 +
 
 +
=== _() ===
 +
 
 +
_() is used for text which might be translated to other user-languages.
  
'''_()''' = text's which might be translated to other user-languages<BR>
 
 
This macro expands into a call to wxGetTranslation function, so it marks the message for the extraction by xgettext just as wxTRANSLATE does, but also returns the translation of the string for the current locale during execution.
 
This macro expands into a call to wxGetTranslation function, so it marks the message for the extraction by xgettext just as wxTRANSLATE does, but also returns the translation of the string for the current locale during execution.
  
Don't confuse this macro with _T()!<BR>
+
Don't confuse this macro with _T()!
  
'''wxPLURAL''' = This macro is identical to _() but for the plural variant of wxGetTranslation.<BR>
+
'''wxPLURAL''' = This macro is identical to _() but for the plural variant of wxGetTranslation.
const wxChar * wxPLURAL(const char *sing, const char *plur, size_tn)<BR>
 
  
 +
const wxChar * wxPLURAL(const char *sing, const char *plur, size_tn)
  
== Guidlines ==
+
== Guidelines ==
'''char & wxChar:'''<BR>
+
'''char & wxChar:'''
 
Do not use wxChar when is not a text character, because a wxChar in unicode is an int of 16 bits (not 8 bits):
 
Do not use wxChar when is not a text character, because a wxChar in unicode is an int of 16 bits (not 8 bits):
  
 
Example for text:
 
Example for text:
  wxChar im_a_character = _T('f');
+
 
 +
wxChar im_a_character = _T('f');
  
 
Example for not text (not character):
 
Example for not text (not character):
  char im_a_byte = 254;
+
 
 +
char im_a_byte = 254;
 +
 
 
but perhaps better would be to use:
 
but perhaps better would be to use:
  byte im_a_byte = 254;
+
 
 +
byte im_a_byte = 254;
 +
 
 
so it's clear that it's a byte and not a character.
 
so it's clear that it's a byte and not a character.
  
  
  
'''Other:'''<BR>
+
'''Other:'''
 +
 
 
Problem code:
 
Problem code:
  // indent code accordingly
+
 
  wxString code = it->second;
+
// indent code accordingly
  code.Replace("\n", '\n' + lineIndent);
+
wxString code = it->second;
 +
code.Replace("\n", '\n' + lineIndent);
  
 
Solution:
 
Solution:
 
<nowiki>  If the input is a const char*, use "normal strings". If the input is a wxChar or wxString, use the _T("macros").  For example:</nowiki>
 
<nowiki>  If the input is a const char*, use "normal strings". If the input is a wxChar or wxString, use the _T("macros").  For example:</nowiki>
  
  // indent code accordingly
+
// indent code accordingly
  wxString code = it->second;
+
wxString code = it->second;
  code.Replace(_T("\n"), _T('\n') + lineIndent);
+
code.Replace(_T("\n"), _T('\n') + lineIndent);
 
 
  
 
Some of the strings already converted in C::B, use _( when they should be _T(.
 
Some of the strings already converted in C::B, use _( when they should be _T(.
Line 106: Line 144:
 
Example:
 
Example:
  
  WRONG: wxXmlResource::Get()->LoadDialog(this, parent, _("dlgGenericMultiSelect"));
+
WRONG: wxXmlResource::Get()->LoadDialog(this, parent, _("dlgGenericMultiSelect"));
  
 
dlgGenericMultiSelect is a reference to a resource. Therefore it must use _T instead.
 
dlgGenericMultiSelect is a reference to a resource. Therefore it must use _T instead.
  
  RIGHT: wxXmlResource::Get()->LoadDialog(this, parent, _T("dlgGenericMultiSelect"));
+
RIGHT: wxXmlResource::Get()->LoadDialog(this, parent, _T("dlgGenericMultiSelect"));
  
 
And don't forget to test for single characters, too!
 
And don't forget to test for single characters, too!
 
 
 
 
  
 
All operations with wxStrings (not char*'s) should have _("string") for strings to be displayed to the user, and _T("string") for strings used internally.
 
All operations with wxStrings (not char*'s) should have _("string") for strings to be displayed to the user, and _T("string") for strings used internally.
 
  
 
Printf-like functions is - use c_str() (in examples in wxwidgets.org there are used different arguments for unicode and non-unicode versions where formating string was both "%s").  For example:
 
Printf-like functions is - use c_str() (in examples in wxwidgets.org there are used different arguments for unicode and non-unicode versions where formating string was both "%s").  For example:
  
  tmpkey.Printf(_T("%s/editor/keywords/%d"), key.c_str(), i);
+
tmpkey.Printf(_T("%s/editor/keywords/%d"), key.c_str(), i);
 
 
 
 
  
'''XRCID and XRCCTRL macros:'''<BR>
+
=== XRCID and XRCCTRL macros ===
XRCID and XRCCTRL macros must _NOT_ be converted! They're pre-converted already!
 
  
  WRONG:  XRCCTRL(*this, _T("lblLabel"), wxStaticText)->SetLabel(label);
+
XRCID and XRCCTRL macros must '''not''' be converted! They're pre-converted already!
  
 +
WRONG:  XRCCTRL(*this, _T("lblLabel"), wxStaticText)->SetLabel(label);
  
  RIGHT:  XRCCTRL(*this, "lblLabel", wxStaticText)->SetLabel(label);
+
RIGHT:  XRCCTRL(*this, "lblLabel", wxStaticText)->SetLabel(label);
  
 +
=== Concatenated strings ===
  
'''concatenated strings:'''<BR>
 
 
_() is macro which calls one of wxWidget's internal function so concatenating should look like this:
 
_() is macro which calls one of wxWidget's internal function so concatenating should look like this:
  
  _("string 1" "string2" ... )
+
_("string 1" "string2" ... )
  
 
_T() macro simply adds 'L' before string given as a param (in Unicode of course, in normal mode it do nothing with the string) so concatenation should be:
 
_T() macro simply adds 'L' before string given as a param (in Unicode of course, in normal mode it do nothing with the string) so concatenation should be:
  
  _T("string1") _T("string2") ...
+
_T("string1") _T("string2") ...
  
 +
== Gotchas ==
  
== Gotchas ==
+
=== _C() can return a proxy ===
  
'''_C() can return a proxy, not nessesarily a char const*'''
+
Don't write code like:
  
Don't write code like
 
Code:
 
 
  char const * psz = _C( str ); // formerly str.mb_str(wxConvUTF8);
 
  char const * psz = _C( str ); // formerly str.mb_str(wxConvUTF8);
  
Line 159: Line 189:
  
 
Solution:
 
Solution:
Code:
+
 
 
  wxWX2MBbuf psz = str.mb_str(wxConvUTF8);
 
  wxWX2MBbuf psz = str.mb_str(wxConvUTF8);
 +
 
wxWX2MBbuf takes ownership of the buffer ( no, it's not copied -- it's transfer of ownership semantics similar to std::auto_ptr ).  That way you can actually use the memory until psz goes out of scope and deletes it.
 
wxWX2MBbuf takes ownership of the buffer ( no, it's not copied -- it's transfer of ownership semantics similar to std::auto_ptr ).  That way you can actually use the memory until psz goes out of scope and deletes it.
  
'''Printf uses wxChars'''
+
=== Printf uses wxChars ===
  
 
When using wxString::Printf, %s wants wxChar const*, so just use .c_str().  This is important to watch out for because Printf uses varargs, which aren't typesafe, so the copmiler doesn't catch the error.  If, for example, Mandrav uses .mb_str(), the compiler wont say anything because mb_str() is the same as c_str() in non-unicode mode, returning a char const*.  However, when me22 runs it in Unicode mode, mb_str() returns a proxy ( see above ), which can't be passed through a vararg and the program crashes at runtime.
 
When using wxString::Printf, %s wants wxChar const*, so just use .c_str().  This is important to watch out for because Printf uses varargs, which aren't typesafe, so the copmiler doesn't catch the error.  If, for example, Mandrav uses .mb_str(), the compiler wont say anything because mb_str() is the same as c_str() in non-unicode mode, returning a char const*.  However, when me22 runs it in Unicode mode, mb_str() returns a proxy ( see above ), which can't be passed through a vararg and the program crashes at runtime.
  
'''Streaming a plain char fails silently'''
+
=== Streaming a plain char fails silently ===
  
 
This one was the source of the mysterious bug that replaced all the )'s in the class browser with 41's :P
 
This one was the source of the mysterious bug that replaced all the )'s in the class browser with 41's :P
  
 
Problem:
 
Problem:
 +
 
In unicode, the following compiles fine and doesn't crash:
 
In unicode, the following compiles fine and doesn't crash:
 +
 
  my_stream << '\n';
 
  my_stream << '\n';
 +
 
However, my_stream traffics in wxChars, and '\n' is a plain char.  This means that integer promotion is applied to '\n', so the effect is as if the code were:
 
However, my_stream traffics in wxChars, and '\n' is a plain char.  This means that integer promotion is applied to '\n', so the effect is as if the code were:
 +
 
  my_stream << static_cast<int>('\n');
 
  my_stream << static_cast<int>('\n');
 +
 
Which shows a number, instead of the character.
 
Which shows a number, instead of the character.
  
 
Solution:
 
Solution:
 +
 
Don't forget the _T()s!
 
Don't forget the _T()s!
 +
 
Alternatively, use "\n" instead, which the compiler doesn't silently convert, so an easy-to-stop&fix compile error will occur.
 
Alternatively, use "\n" instead, which the compiler doesn't silently convert, so an easy-to-stop&fix compile error will occur.
  
 
+
== See also ==
This need to be rewritten. If nobody else improves on this, I will try and rewrite this once I have used these macros more. Joe M.
+
* [http://www.wxwidgets.org/manuals/2.6.3/wx_unicode.html#unicode wxWidgets Unicode reference]

Latest revision as of 16:27, 30 September 2006

This page is meant to be a location for developers to find all the current Unicode standards, or good practices, when developing the Code::Blocks IDE.

Macros

Constants

Macro Description
__TFILE__ wxWidgets provide equivilant to __FILE__
__TDATE__ wxWidgets provide equivilant to __DATE__
__TTIME__ wxWidgets provide equivilant to __TIME__

_U()

Use it to convert non-literal char* strings to wxString. Use it for reading attributes from TiXmlNode's. If you deal with functions that return strings, you must use our _U macro.

Code:

#ifdef wxUSE_UNICODE
    #define _U(x) wxString((x),wxConvUTF8)
    #define _UU(x,y) wxString((x),y)
#else
    #define _U(x) (x)
    #define _UU(x,y) (x)
#endif

i.e.: Code:

const char* incompatible = "This is an incompatible string";
wxString compatible = _U(incompatible);
// wxString conftype = conf->Attribute("ConfigurationType"); // before
wxString conftype = _U(conf->Attribute("ConfigurationType")); // after

_C()

multibyte C string see wxhelp (wxMBConv classes overview) Use this one for interacting with APIs needing char const*s, such as saving things to TinyXML.

Is defined in code as:

#if wxUSE_UNICODE
    #define _UU(x,y) wxString((x),(y))
    #define _CC(x,y) (x).mb_str((y))
#else
    #define _UU(x,y) (x)
    #define _CC(x,y) (x)
#endif
#define _U(x) _UU((x),wxConvUTF8)
#define _C(x) _CC((x),wxConvUTF8)

_T()/wxT()

_T()/wxT() are used for fixed text - like XRC resources object names (only adds an L before the string if you're using Unicode build).

_T()/wxT() are macros which can be used with character and string literals (in other words, 'x' or "foo") to automatically convert them to Unicode in Unicode build configuration. Please see the Unicode overview for more information.

These macros simply returns the value passed to it without changes in ASCII build. In fact, the wxT() definition is:

#ifdef UNICODE
    #define wxT(x) L ## x
#else // !Unicode
    #define wxT(x) x
#endif

_T() is exactly the same as wxT() and is defined in wxWidgets simply because it may be more intuitive for Windows programmers as the standard Win32 headers also define it (as well as yet another name for the same macro which is _TEXT()).

Don't confuse this macro with _()!

wxChar _T(char ch)
const wxChar * _T(const wxChar ch)

_()

_() is used for text which might be translated to other user-languages.

This macro expands into a call to wxGetTranslation function, so it marks the message for the extraction by xgettext just as wxTRANSLATE does, but also returns the translation of the string for the current locale during execution.

Don't confuse this macro with _T()!

wxPLURAL = This macro is identical to _() but for the plural variant of wxGetTranslation.

const wxChar * wxPLURAL(const char *sing, const char *plur, size_tn)

Guidelines

char & wxChar: Do not use wxChar when is not a text character, because a wxChar in unicode is an int of 16 bits (not 8 bits):

Example for text:

wxChar im_a_character = _T('f');

Example for not text (not character):

char im_a_byte = 254;

but perhaps better would be to use:

byte im_a_byte = 254;

so it's clear that it's a byte and not a character.


Other:

Problem code:

// indent code accordingly
wxString code = it->second;
code.Replace("\n", '\n' + lineIndent);

Solution: If the input is a const char*, use "normal strings". If the input is a wxChar or wxString, use the _T("macros"). For example:

// indent code accordingly
wxString code = it->second;
code.Replace(_T("\n"), _T('\n') + lineIndent);

Some of the strings already converted in C::B, use _( when they should be _T(.

Example:

WRONG: wxXmlResource::Get()->LoadDialog(this, parent, _("dlgGenericMultiSelect"));

dlgGenericMultiSelect is a reference to a resource. Therefore it must use _T instead.

RIGHT: wxXmlResource::Get()->LoadDialog(this, parent, _T("dlgGenericMultiSelect"));

And don't forget to test for single characters, too!

All operations with wxStrings (not char*'s) should have _("string") for strings to be displayed to the user, and _T("string") for strings used internally.

Printf-like functions is - use c_str() (in examples in wxwidgets.org there are used different arguments for unicode and non-unicode versions where formating string was both "%s"). For example:

tmpkey.Printf(_T("%s/editor/keywords/%d"), key.c_str(), i);

XRCID and XRCCTRL macros

XRCID and XRCCTRL macros must not be converted! They're pre-converted already!

WRONG:   XRCCTRL(*this, _T("lblLabel"), wxStaticText)->SetLabel(label);
RIGHT:   XRCCTRL(*this, "lblLabel", wxStaticText)->SetLabel(label);

Concatenated strings

_() is macro which calls one of wxWidget's internal function so concatenating should look like this:

_("string 1" "string2" ... )

_T() macro simply adds 'L' before string given as a param (in Unicode of course, in normal mode it do nothing with the string) so concatenation should be:

_T("string1") _T("string2") ...

Gotchas

_C() can return a proxy

Don't write code like:

char const * psz = _C( str ); // formerly str.mb_str(wxConvUTF8);

_C(), in unicode mode, returns a buffer, not a raw pointer. This is a good thing because the buffer's destructor takes care of freeing the memory used by the string. This buffer is implicitly convertible to a char const* so that it can be used in things like strlen( str.mb_str() ) immediatly, but that opens up the error I'm warning about in this post.

What actually happens in the above code? wxString::mb_str() returns a buffer object. Said buffer's implicit conversion to char const* is activated, and the result is stored in psz. The temporary buffer then GOES OUT OF SCOPE and ITS DESTRUCTOR DELETES THE MEMORY. It seems that windows doesn't care, but that linux often has already reused the memory by the time psz is used again in the code.

Solution:

wxWX2MBbuf psz = str.mb_str(wxConvUTF8);

wxWX2MBbuf takes ownership of the buffer ( no, it's not copied -- it's transfer of ownership semantics similar to std::auto_ptr ). That way you can actually use the memory until psz goes out of scope and deletes it.

Printf uses wxChars

When using wxString::Printf, %s wants wxChar const*, so just use .c_str(). This is important to watch out for because Printf uses varargs, which aren't typesafe, so the copmiler doesn't catch the error. If, for example, Mandrav uses .mb_str(), the compiler wont say anything because mb_str() is the same as c_str() in non-unicode mode, returning a char const*. However, when me22 runs it in Unicode mode, mb_str() returns a proxy ( see above ), which can't be passed through a vararg and the program crashes at runtime.

Streaming a plain char fails silently

This one was the source of the mysterious bug that replaced all the )'s in the class browser with 41's :P

Problem:

In unicode, the following compiles fine and doesn't crash:

my_stream << '\n';

However, my_stream traffics in wxChars, and '\n' is a plain char. This means that integer promotion is applied to '\n', so the effect is as if the code were:

my_stream << static_cast<int>('\n');

Which shows a number, instead of the character.

Solution:

Don't forget the _T()s!

Alternatively, use "\n" instead, which the compiler doesn't silently convert, so an easy-to-stop&fix compile error will occur.

See also