7.2.2 Unicode Objects
These are the basic Unicode object types used for the Unicode
implementation in Python:
- Py_UNICODE
-
This type represents a 16-bit unsigned storage type which is used by
Python internally as basis for holding Unicode ordinals. On platforms
where wchar_t is available and also has 16-bits,
Py_UNICODE is a typedef alias for wchar_t to enhance
native platform compatibility. On all other platforms,
Py_UNICODE is a typedef alias for unsigned short.
- PyUnicodeObject
-
This subtype of PyObject represents a Python Unicode object.
- PyTypeObject PyUnicode_Type
-
This instance of PyTypeObject represents the Python Unicode type.
The following APIs are really C macros and can be used to do fast
checks and to access internal read-only data of Unicode objects:
- int PyUnicode_Check (PyObject *o)
-
Returns true if the object o is a Unicode object.
- int PyUnicode_GET_SIZE (PyObject *o)
-
Returns the size of the object. o has to be a
PyUnicodeObject (not checked).
- int PyUnicode_GET_DATA_SIZE (PyObject *o)
-
Returns the size of the object's internal buffer in bytes. o has to be
a PyUnicodeObject (not checked).
- int PyUnicode_AS_UNICODE (PyObject *o)
-
Returns a pointer to the internal Py_UNICODE buffer of the object. o
has to be a PyUnicodeObject (not checked).
- int PyUnicode_AS_DATA (PyObject *o)
-
Returns a (const char *) pointer to the internal buffer of the object.
o has to be a PyUnicodeObject (not checked).
Unicode provides many different character properties. The most often
needed ones are available through these macros which are mapped to C
functions depending on the Python configuration.
- int Py_UNICODE_ISSPACE (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a whitespace character.
- int Py_UNICODE_ISLOWER (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a lowercase character.
- int Py_UNICODE_ISUPPER (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a uppercase character.
- int Py_UNICODE_ISTITLE (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a titlecase character.
- int Py_UNICODE_ISLINEBREAK (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a linebreak character.
- int Py_UNICODE_ISDECIMAL (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a decimal character.
- int Py_UNICODE_ISDIGIT (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a digit character.
- int Py_UNICODE_ISNUMERIC (Py_UNICODE ch)
-
Returns 1/0 depending on whether ch is a numeric character.
These APIs can be used for fast direct character conversions:
- Py_UNICODE Py_UNICODE_TOLOWER (Py_UNICODE ch)
-
Returns the character ch converted to lower case.
- Py_UNICODE Py_UNICODE_TOUPPER (Py_UNICODE ch)
-
Returns the character ch converted to upper case.
- Py_UNICODE Py_UNICODE_TOTITLE (Py_UNICODE ch)
-
Returns the character ch converted to title case.
- int Py_UNICODE_TODECIMAL (Py_UNICODE ch)
-
Returns the character ch converted to a decimal positive integer.
Returns -1 in case this is not possible. Does not raise exceptions.
- int Py_UNICODE_TODIGIT (Py_UNICODE ch)
-
Returns the character ch converted to a single digit integer.
Returns -1 in case this is not possible. Does not raise exceptions.
- double Py_UNICODE_TONUMERIC (Py_UNICODE ch)
-
Returns the character ch converted to a (positive) double.
Returns -1.0 in case this is not possible. Does not raise exceptions.
To create Unicode objects and access their basic sequence properties,
use these APIs:
- PyObject* PyUnicode_FromUnicode (const Py_UNICODE *u,
int size)
-
Return value:
New reference.
Create a Unicode Object from the Py_UNICODE buffer u of the
given size. u may be NULL which causes the contents to be
undefined. It is the user's responsibility to fill in the needed data.
The buffer is copied into the new object.
- Py_UNICODE * PyUnicode_AsUnicode (PyObject *unicode)
-
Return a read-only pointer to the Unicode object's internal
Py_UNICODE buffer.
- int PyUnicode_GetSize (PyObject *unicode)
-
Return the length of the Unicode object.
- PyObject* PyUnicode_FromObject (PyObject *obj)
-
Return value:
New reference.
Coerce obj to an Unicode object and return a reference with
incremented refcount.
Coercion is done in the following way:
- Unicode objects are passed back as-is with incremented
refcount.
- String and other char buffer compatible objects are decoded
under the assumptions that they contain UTF-8 data. Decoding
is done in "strict" mode.
- All other objects raise an exception.
The API returns NULL in case of an error. The caller is responsible
for decref'ing the returned objects.
If the platform supports wchar_t and provides a header file
wchar.h, Python can interface directly to this type using the
following functions. Support is optimized if Python's own
Py_UNICODE type is identical to the system's wchar_t.
- PyObject* PyUnicode_FromWideChar (const wchar_t *w,
int size)
-
Return value:
New reference.
Create a Unicode Object from the whcar_t buffer w of the
given size. Returns NULL on failure.
- int PyUnicode_AsWideChar (PyUnicodeObject *unicode,
wchar_t *w,
int size)
-
Copies the Unicode Object contents into the whcar_t buffer
w. At most size whcar_t characters are copied.
Returns the number of whcar_t characters copied or -1 in case
of an error.
Subsections
See About this document... for information on suggesting changes.