Unicode

Unicode is a computer encoding methodology that assigns a unique number for every character. It doesn't matter what language, or computer platform it's on. This is important in a global, networked world, and for computer systems that must accommodate multiple languages and special characters. Unicode truly unifies all of these into a single standard.

In Unicode, all characters are represented by numeric values. For example, 65 is the letter A. 66 is B and so on. The lower-case letters start at 97. There are even special system-only characters in the list: carriage returns, tabs, etc. These can be useful in displaying text; while others are leftovers from older systems.

ASCII

ASCII, like unicode, uses numbers to represent symbols and letters. The main difference between the two is in the way they encode the character and the number of bits that they use for each. ASCII originally used seven bits to encode each character. This was later increased to eight with Extended ASCII to address the apparent inadequacy of the original.

With the exansion of ASCII combined many non-standard extended ASCII programs began to emerge. To reconcile the different systems, unicode was adpoted.

Advantages of Unicode over ASCII

Unicode uses a variable bit encoding program where you can choose between 32, 16, and 8-bit encodings. Using more bits lets you use more characters at the expense of larger files while fewer bits give you a limited choice but you save a lot of space. Using fewer bits (i.e. UTF-8 or ASCII) would probably be best if you are encoding a large document in English.

Unicode offers many advantages over ASCII. Variable bit encoding allows for the accommodation of a huge number of characters. Because of this, Unicode currently contains most written languages and still has room for even more. This includes typical left-to-right scripts like English and even right-to-left scripts like Arabic. Chinese, Japanese, and the many other variants are also represented within Unicode. So Unicode won’t be replaced anytime soon.

In Java, char is short for character. It is 16 bits in size - that is it can only take up 16 places in memory. As a review, consider the number of places to represent the number 77 in binary.

int places = (int)(Math.log10(77)/Math.log10(2));//returns 6

The actual value of 77 in binary is 1001101

To represent the number 77 in memory requires 6 + 1 or 7 bits

Most of the time, the actual data stored in the char data type doesn't take up more than 8 bits; the reason Java allows 16 bits is so that all characters in all languages can be represented. This representation is in the Unicode format described previously.

Declaring char variable types

You'd think that a char could be any value from a to Z, and all the numbers. That is true, except for one key item. You can either declare a char variable with single quotes, e.g., setting a char's value to the letter 'a'. Or, you could omit the quotes, and set the Unicode representation of the value. Take a look at the following code for declaring a char variable equal to 77.

char seventySeven = 77;
System.out.println(seventySeven)//Prints M - the unicode equivalent;

The char data type can also be assigned the value of M using single quotes,

char seventySeven = 'M';
System.out.println(seventySeven)//Prints M

Skill 13.2: Exercise 1

Compatiblity between char and String data types

char data types and String types cannot be stored into each other.

In the example above, if we had tried to declare the char as the actual "77", with quotes, there would be an error - incompatible types: String cannot be converted to char

char seventySeven = "77";//Error!

Recall, that char variables must be declared with single quotes, but the following code also produces an error, because there is no unicode equivalent for the character '77'.

char seventySeven = '77';//Error!

Compatiblity between char and int data types

char data types can be assigned to int data types. This is possible because char data types have a unicode (int) equivalent.

The example below illustrates how the unicode (int) equivalent of any symbol can be found using the char and intdata types.

char seventySeven = 'M';
int M = seventySeven;
System.out.println(M);//prints the unicode equivalent, 77

The above illustrates how the unicode equivalent of a symbol or letter can be found by assiging the char data type to an int. The reverse process however is not allowed. Consider the following example which results in a possible lossy conversion from int to char error,

int unicode = 12345678;
char symbol = unicode;//lossy conversion error!
System.out.println(symbol);

The reason why the above is illegal is because char variables can take on Unicode values from 0 – 65536 while int variables can go over 2 billion. The compiler justly complains about “possible loss of precision” and refuses to do it. To the process can be overwritten by casting the int as a char,

int unicode = 12345678;
char symbol = (char)unicode;
System.out.println(symbol);//prints 慎

Skill 13.3: Exercise 1

A String can be converted to a char using the charAt method. This is illustrated below,

String wString = "W";
char wChar = wString.charAt(0);//"W" now equals the character 'W'
int wUnicode = wChar;//converts 'W' to the unicode equivalent
System.out.println(wUnicode);//prints 87

Skill 13.4: Exercise 1

The symbols for the upper case letters in our alphabet begin at number 65 in the unicode systems. The symbols for the lower case letters begin at number 97. The letters of our alphabet and the corresponding unicode values are shown below.

char	unicode	char	unicode
A	65	a	97
B	66	b	98
C	67	c	99
D	68	d	100
E	69	e	101
F	70	f	102
G	71	g	103
H	72	h	104
I	73	i	105
J	74	j	106
K	75	k	107
L	76	l	108
M	77	m	109
N	78	n	110
O	79	o	111
P	80	p	112
Q	81	q	113
R	82	r	114
S	83	s	115
T	84	t	116
U	85	u	117
V	86	v	118
W	87	w	119
X	88	x	120
Y	89	y	121
Z	90	z	122

Notice that there is numerical difference of 32 between all the uppercase and lowercase equivalents. This enables for the easy conversion between uppercase and lower case char values,

char bigLetter = 'H';
//Adding an int to a char returns an int, so the result must be cast
char smallLetter = (char)(bigLetter + 32);
System.out.println(smallLetter);//prints h

Skill 13.5: Exercise 1

char	unicode	char	unicode
A	65	a	97
B	66	b	98
C	67	c	99
D	68	d	100
E	69	e	101
F	70	f	102
G	71	g	103
H	72	h	104
I	73	i	105
J	74	j	106
K	75	k	107
L	76	l	108
M	77	m	109
N	78	n	110
O	79	o	111
P	80	p	112
Q	81	q	113
R	82	r	114
S	83	s	115
T	84	t	116
U	85	u	117
V	86	v	118
W	87	w	119
X	88	x	120
Y	89	y	121
Z	90	z	122

char	unicode	char	unicode
A	65	a	97
B	66	b	98
C	67	c	99
D	68	d	100
E	69	e	101
F	70	f	102
G	71	g	103
H	72	h	104
I	73	i	105
J	74	j	106
K	75	k	107
L	76	l	108
M	77	m	109
N	78	n	110
O	79	o	111
P	80	p	112
Q	81	q	113
R	82	r	114
S	83	s	115
T	84	t	116
U	85	u	117
V	86	v	118
W	87	w	119
X	88	x	120
Y	89	y	121
Z	90	z	122

Set 13: Unicode & the char Data Type

char	unicode	char	unicode
A	65	a	97
B	66	b	98
C	67	c	99
D	68	d	100
E	69	e	101
F	70	f	102
G	71	g	103
H	72	h	104
I	73	i	105
J	74	j	106
K	75	k	107
L	76	l	108
M	77	m	109
N	78	n	110
O	79	o	111
P	80	p	112
Q	81	q	113
R	82	r	114
S	83	s	115
T	84	t	116
U	85	u	117
V	86	v	118
W	87	w	119
X	88	x	120
Y	89	y	121
Z	90	z	122