Character Sets and Encodings

ASCII #

RFC 20 ASCII format for Network Interchange
ASCII is a character set
- Codes 0-31 are unprintable control codes and are used to control peripherals such as printers.
- Codes 32-127 are called printable characters, they are for all the different variations of the ASCII table
use 8 bits to represent individual characters. (7-bit in early age)
HTTP 1.1 uses US-ASCII as basic character set for the request line in requests, the status line in responses (except the reason phrase) and the field names but allows any octet in the field values and the message body.

EASCII #

extended ASCII codes

UNICODE #

RFC 5198: Unicode Format for Network Interchange
unicode org
The Unicode Standard refers to the standard character set that represents all natural language characters. Unicode can encode up to roughly 1.1 million characters, allowing it to support all of the world’s languages and scripts in a single, universal standard.
UNICODE is ASCII compatible (U+0000 to U+007F)

UTF-8 #

RFC 3629: UTF-8, a transformation format of ISO 10646
UTF-8 is defined by the Unicode Standard [UNICODE]
In UTF-8, characters from the U+0000..U+10FFFF range are encoded using sequences of 1 to 4 octets.

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Example #

HTML header: Content-Type: text/plain; charset="UTF-8"
Golang: Rune literals use UTF-8
Rust: Struct std::string::String A UTF-8–encoded, growable string.

Base64 #

RFC: rfc4648#section-4
24 bits byte sequence can be represented by four 6-bit Base64 digits.
- 4 chars are used to represent 4 * 6 = 24 bits = 3 bytes (if we ignore the padding and round-up detail)
- 3-char string will become 4-char string after the encoding, which the means size will increase by about 33%.
Used when there was a need to encode binary data so that it can be stored and transferred over mediums that primarily designed to deal with ASCII text. E-Mail attachments are sent out as base64 encoded strings.
IS: case sensitive
In Unix system, crypt() uses a special Base64-type of encoding. It uses ./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz to encode the hashed password.

Example: #

In binary, "cat" is 01100011 01100001 01110100 ( 3 bytes)

base 64 "cat" would be

011000 110110 000101 110100
|      |      |      |
Y      2      F      0

Example - Generate random password #

Language-specific characters are typically avoided by password generators because they would not be universally available (US keyboards don't have accented characters, for instance). So don't take their omission from these tools as an indication that they might be weak or problematic. - Is it bad to use special characters in passwords? [duplicate]

package main
import (
	"crypto/rand"
	"encoding/base64"
	"fmt"
	"log"
)

func main() {
	buf := make([]byte, 32)
	_, err := rand.Read(buf)
	if err != nil {
		log.Fatalf("error while generating random string: %s", err)
	}
	// fmt.Println(string(buf)) // not printable

	printable_password := base64.StdEncoding.EncodeToString(buf)
	fmt.Println("generated password", printable_password)
}

Base64Url #

RFC4648: Section 5: Base 64 Encoding with URL and Filename Safe Alphabet
standard Base64 uses + and / for the last 2 characters, and = for padding.
Base64Url uses - and _ for the last 2 characters, and makes padding optional.

Usage #

If the Base64-encoded text needs to be transmitted/saved where +, /, or = have special meaning, e.g. in URLs where all 3 does, then it is better to use Base64Url.
If the Base64-encoded text needs to be transmitted/saved where - or _ have special meaning, then it is better to use Standard Base64.

Character Sets and Encodings

ASCII #

EASCII #

UNICODE #

UTF-8 #

Example #

Base64 #

Example: #

Example - Generate random password #

Base64Url #

Usage #

Ref #