허프만 코드의 알고리즘

허프만 코딩(Huffman Coding) 기술을 사용하여 데이터를 압축하면 정보 손실 없이 더 작아질 수 있습니다. David Huffman 이후 처음에 누가 만들었나요? 자주 반복되는 문자가 포함된 데이터는 일반적으로 허프만 코딩을 사용하여 압축됩니다.

그리디(Greedy) 알고리즘은 허프만 코딩(Huffman Coding)이라는 것이 잘 알려져 있습니다. 문자에 할당된 코드의 크기는 문자의 빈도에 따라 달라지므로 그리디 알고리즘이라고 합니다. 짧은 길이의 가변 코드는 빈도가 가장 높은 문자에 할당되고, 그 반대의 경우 빈도가 낮은 문자에 할당됩니다. 이는 가변 길이 인코딩을 사용합니다. 즉, 제공된 데이터 스트림의 각 문자에 서로 다른 가변 길이 코드를 제공한다는 의미입니다.

접두사 규칙

기본적으로 이 규칙은 문자에 할당된 코드가 다른 코드의 접두사가 되어서는 안 된다는 것을 명시합니다. 이 규칙이 깨지면 생성된 허프만 트리를 디코딩할 때 다양한 모호성이 나타날 수 있습니다.

더 잘 이해하기 위해 이 규칙의 그림을 살펴보겠습니다. 각 문자에 대해 다음과 같은 코드가 제공됩니다.

 a - 0 b - 1 c - 01

생성된 비트스트림을 001이라고 가정하면 코드를 디코딩하면 다음과 같이 표현될 수 있다.

늑대인가 여우인가

 0 0 1 = aab 0 01 = ac

허프만 코딩 프로세스란 무엇입니까?

허프만 코드는 주로 두 단계를 통해 각 고유 문자에 대해 획득됩니다.

제공된 데이터 스트림의 고유 문자만 사용하여 먼저 허프만 트리를 만듭니다.
둘째, 구성된 허프만 트리를 진행하고 문자에 코드를 할당한 다음 해당 코드를 사용하여 제공된 텍스트를 디코딩해야 합니다.

허프만 코딩을 위한 단계

100만은 0이 몇 개야?

제공된 문자를 사용하여 허프만 트리를 구성하는 데 사용되는 단계

 Input: string str = &apos;abbcdbccdaabbeeebeab&apos;

이 경우 데이터 압축을 위해 허프만 코딩을 사용하는 경우 디코딩을 위해 다음 정보를 결정해야 합니다.

각 캐릭터에 대해 허프만 코드
허프만 인코딩된 메시지 길이(비트), 평균 코드 길이
아래에 설명된 공식을 활용하여 마지막 두 공식을 발견했습니다.

입력 문자로부터 허프만 트리를 어떻게 구성할 수 있습니까?

제공된 문자열에 있는 각 문자의 빈도를 먼저 결정해야 합니다.

성격	빈도
ㅏ	4
비	7
씨	삼
디	2
그것은	4

문자를 빈도별로 오름차순으로 정렬합니다. 이는 Q/min-힙 우선순위 큐에 보관됩니다.
데이터 스트림의 각 고유 문자와 해당 빈도에 대해 리프 노드를 만듭니다.
해당 노드에서 빈도가 가장 낮은 두 노드를 제거하면 이 빈도의 합을 사용하여 트리의 새로운 루트가 생성됩니다.
- 최소 힙에서 빈도가 가장 낮은 노드를 추출하면서 첫 번째 추출된 노드를 왼쪽 자식으로, 두 번째 추출된 노드를 오른쪽 자식으로 만듭니다.
- 최소 힙에 이 노드를 추가합니다.
- 루트의 왼쪽에는 항상 최소 주파수가 포함되어야 하기 때문입니다.
힙에 노드가 하나만 남거나 모든 문자가 트리의 노드로 표시될 때까지 3단계와 4단계를 반복합니다. 루트 노드만 남으면 트리가 완성됩니다.

허프만 코딩의 예

알고리즘을 설명하기 위해 그림을 사용하겠습니다.

허프만 코딩 알고리즘

1 단계: 각 노드가 단일 노드가 있는 트리의 루트를 나타내고 5(제공된 데이터 스트림의 고유 문자 수)를 보유하는 최소 힙을 구축합니다.

2 단계: 2단계의 최소 힙에서 두 개의 최소 주파수 노드를 얻습니다. 추출된 두 노드를 결합하여 생성된 세 번째 내부 노드인 빈도 2 + 3 = 5를 추가합니다.

이제 최소 힙에는 4개의 노드가 있습니다. 그 중 3개는 각각 단일 요소가 있는 트리의 루트이고, 그 중 1개는 두 개의 요소가 있는 트리의 루트입니다.

3단계: 3단계에서 비슷한 방식으로 힙에서 두 개의 최소 주파수 노드를 가져옵니다. 추가적으로 추출된 두 개의 노드를 결합하여 형성된 새로운 내부 노드를 추가합니다. 트리에서의 빈도는 4 + 4 = 8이어야 합니다.

knn

이제 최소 힙에는 세 개의 노드가 있으므로 한 노드는 단일 요소가 있는 트리의 루트 역할을 하고 두 개의 힙 노드는 여러 노드가 있는 트리의 루트 역할을 합니다.

4단계: 4단계에서 두 개의 최소 주파수 노드를 가져옵니다. 추가적으로 추출된 두 개의 노드를 결합하여 형성된 새로운 내부 노드를 추가합니다. 트리에서의 빈도는 5 + 7 = 12여야 합니다.

허프만 트리를 만들 때 최소값은 항상 왼쪽에 있고 두 번째 값은 항상 오른쪽에 있는지 확인해야 합니다. 현재 아래 이미지는 형성된 트리를 보여줍니다.

5단계: 5단계에서 다음 두 개의 최소 주파수 노드를 가져옵니다. 또한 추출된 두 개의 노드를 결합하여 형성된 새 내부 노드를 추가합니다. 트리에서의 빈도는 12 + 8 = 20이어야 합니다.

뷰와 테이블

모든 개별 문자가 트리에 추가될 때까지 계속합니다. 지정된 캐릭터 캐스트에 대해 생성된 허프만 트리가 위 이미지에 표시됩니다.

이제 리프가 아닌 각 노드에 대해 왼쪽 가장자리에 0을 할당하고 오른쪽 가장자리에 1을 할당하여 각 문자에 대한 코드를 만듭니다.

간선 가중치를 결정하기 위해 따라야 할 규칙:

왼쪽 가장자리에 가중치를 0으로 주면 오른쪽 가장자리에 가중치를 1로 주어야 합니다.
왼쪽 가장자리에 가중치 1이 지정되면 오른쪽 가장자리에 가중치 0이 지정되어야 합니다.
앞서 언급한 두 가지 규칙 중 하나를 사용할 수 있습니다.
그러나 트리를 디코딩할 때도 동일한 프로토콜을 따르십시오.

가중치를 적용한 후 수정된 트리는 다음과 같이 표시됩니다.

자바를 두 배로 늘리는 정수

코드 이해

결과 허프만 트리에서 각 문자에 대한 허프만 코드를 디코딩하려면 요소가 존재하는 리프 노드에 도달할 때까지 허프만 트리를 통과해야 합니다.
노드 전체의 가중치는 순회 중에 기록되어야 하며 특정 리프 노드에 있는 항목에 할당되어야 합니다.
다음 예는 우리가 의미하는 바를 더 자세히 설명하는 데 도움이 될 것입니다.
위 그림의 각 문자에 대한 코드를 얻으려면 전체 트리를 탐색해야 합니다(모든 리프 노드가 덮일 때까지).
결과적으로 생성된 트리는 각 노드의 코드를 디코딩하는 데 사용됩니다. 다음은 각 캐릭터의 코드 목록입니다.

성격	빈도/횟수	암호
ㅏ	4	01
비	7	열하나
씨	삼	101
디	2	100
그것은	4	00

다음은 C 프로그래밍의 구현입니다.

 // C program for Huffman Coding #include #include // This constant can be avoided by explicitly // calculating height of Huffman Tree #define MAX_TREE_HT 100 // A Huffman tree node struct MinHeapNode { // One of the input characters char data; // Frequency of the character unsigned freq; // Left and right child of this node struct MinHeapNode *left, *right; }; // A Min Heap: Collection of // min-heap (or Huffman tree) nodes struct MinHeap { // Current size of min heap unsigned size; // capacity of min heap unsigned capacity; // Array of minheap node pointers struct MinHeapNode** array; }; // A utility function allocate a new // min heap node with given character // and frequency of the character struct MinHeapNode* newNode(char data, unsigned freq) { struct MinHeapNode* temp = (struct MinHeapNode*)malloc( sizeof(struct MinHeapNode)); temp-&gt;left = temp-&gt;right = NULL; temp-&gt;data = data; temp-&gt;freq = freq; return temp; } // A utility function to create // a min heap of given capacity struct MinHeap* createMinHeap(unsigned capacity) { struct MinHeap* minHeap = (struct MinHeap*)malloc(sizeof(struct MinHeap)); // current size is 0 minHeap-&gt;size = 0; minHeap-&gt;capacity = capacity; minHeap-&gt;array = (struct MinHeapNode**)malloc( minHeap-&gt;capacity * sizeof(struct MinHeapNode*)); return minHeap; } // A utility function to // swap two min heap nodes void swapMinHeapNode(struct MinHeapNode** a, struct MinHeapNode** b) { struct MinHeapNode* t = *a; *a = *b; *b = t; } // The standard minHeapify function. void minHeapify(struct MinHeap* minHeap, int idx) { int smallest = idx; int left = 2 * idx + 1; int right = 2 * idx + 2; if (left size &amp;&amp; minHeap-&gt;array[left]-&gt;freq array[smallest]-&gt;freq) smallest = left; if (right size &amp;&amp; minHeap-&gt;array[right]-&gt;freq array[smallest]-&gt;freq) smallest = right; if (smallest != idx) { swapMinHeapNode(&amp;minHeap-&gt;array[smallest], &amp;minHeap-&gt;array[idx]); minHeapify(minHeap, smallest); } } // A utility function to check // if size of heap is 1 or not int isSizeOne(struct MinHeap* minHeap) { return (minHeap-&gt;size == 1); } // A standard function to extract // minimum value node from heap struct MinHeapNode* extractMin(struct MinHeap* minHeap) { struct MinHeapNode* temp = minHeap-&gt;array[0]; minHeap-&gt;array[0] = minHeap-&gt;array[minHeap-&gt;size - 1]; --minHeap-&gt;size; minHeapify(minHeap, 0); return temp; } // A utility function to insert // a new node to Min Heap void insertMinHeap(struct MinHeap* minHeap, struct MinHeapNode* minHeapNode) { ++minHeap-&gt;size; int i = minHeap-&gt;size - 1; while (i &amp;&amp; minHeapNode-&gt;freq array[(i - 1) / 2]-&gt;freq) { minHeap-&gt;array[i] = minHeap-&gt;array[(i - 1) / 2]; i = (i - 1) / 2; } minHeap-&gt;array[i] = minHeapNode; } // A standard function to build min heap void buildMinHeap(struct MinHeap* minHeap) { int n = minHeap-&gt;size - 1; int i; for (i = (n - 1) / 2; i &gt;= 0; --i) minHeapify(minHeap, i); } // A utility function to print an array of size n void printArr(int arr[], int n) { int i; for (i = 0; i left) &amp;&amp; !(root-&gt;right); } // Creates a min heap of capacity // equal to size and inserts all character of // data[] in min heap. Initially size of // min heap is equal to capacity struct MinHeap* createAndBuildMinHeap(char data[], int freq[], int size) { struct MinHeap* minHeap = createMinHeap(size); for (int i = 0; i array[i] = newNode(data[i], freq[i]); minHeap-&gt;size = size; buildMinHeap(minHeap); return minHeap; } // The main function that builds Huffman tree struct MinHeapNode* buildHuffmanTree(char data[], int freq[], int size) { struct MinHeapNode *left, *right, *top; // Step 1: Create a min heap of capacity // equal to size. Initially, there are // modes equal to size. struct MinHeap* minHeap = createAndBuildMinHeap(data, freq, size); // Iterate while size of heap doesn&apos;t become 1 while (!isSizeOne(minHeap)) { // Step 2: Extract the two minimum // freq items from min heap left = extractMin(minHeap); right = extractMin(minHeap); // Step 3: Create a new internal // node with frequency equal to the // sum of the two nodes frequencies. // Make the two extracted node as // left and right children of this new node. // Add this node to the min heap // &apos;$&apos; is a special value for internal nodes, not // used top = newNode(&apos;$&apos;, left-&gt;freq + right-&gt;freq); top-&gt;left = left; top-&gt;right = right; insertMinHeap(minHeap, top); } // Step 4: The remaining node is the // root node and the tree is complete. return extractMin(minHeap); } // Prints huffman codes from the root of Huffman Tree. // It uses arr[] to store codes void printCodes(struct MinHeapNode* root, int arr[], int top) { // Assign 0 to left edge and recur if (root-&gt;left) { arr[top] = 0; printCodes(root-&gt;left, arr, top + 1); } // Assign 1 to right edge and recur if (root-&gt;right) { arr[top] = 1; printCodes(root-&gt;right, arr, top + 1); } // If this is a leaf node, then // it contains one of the input // characters, print the character // and its code from arr[] if (isLeaf(root)) { printf(&apos;%c: &apos;, root-&gt;data); printArr(arr, top); } } // The main function that builds a // Huffman Tree and print codes by traversing // the built Huffman Tree void HuffmanCodes(char data[], int freq[], int size) { // Construct Huffman Tree struct MinHeapNode* root = buildHuffmanTree(data, freq, size); // Print Huffman codes using // the Huffman tree built above int arr[MAX_TREE_HT], top = 0; printCodes(root, arr, top); } // Driver code int main() { char arr[] = { &apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos; }; int freq[] = { 5, 9, 12, 13, 16, 45 }; int size = sizeof(arr) / sizeof(arr[0]); HuffmanCodes(arr, freq, size); return 0; }

산출

 f: 0 c: 100 d: 101 a: 1100 b: 1101 e: 111 &#x2026;&#x2026;&#x2026;&#x2026;&#x2026; Process executed in 1.11 seconds Press any key to continue.

위 코드의 Java 구현:

 import java.util.Comparator; import java.util.PriorityQueue; import java.util.Scanner; class Huffman { // recursive function to print the // huffman-code through the tree traversal. // Here s is the huffman - code generated. public static void printCode(HuffmanNode root, String s) { // base case; if the left and right are null // then its a leaf node and we print // the code s generated by traversing the tree. if (root.left == null &amp;&amp; root.right == null &amp;&amp; Character.isLetter(root.c)) { // c is the character in the node System.out.println(root.c + &apos;:&apos; + s); return; } // if we go to left then add &apos;0&apos; to the code. // if we go to the right add&apos;1&apos; to the code. // recursive calls for left and // right sub-tree of the generated tree. printCode(root.left, s + &apos;0&apos;); printCode(root.right, s + &apos;1&apos;); } // main function public static void main(String[] args) { Scanner s = new Scanner(System.in); // number of characters. int n = 6; char[] charArray = { &apos;a&apos;, &apos;b&apos;, &apos;c&apos;, &apos;d&apos;, &apos;e&apos;, &apos;f&apos; }; int[] charfreq = { 5, 9, 12, 13, 16, 45 }; // creating a priority queue q. // makes a min-priority queue(min-heap). PriorityQueue q = new PriorityQueue( n, new MyComparator()); for (int i = 0; i <n; i++) { creating a huffman node object and add it to the priority queue. huffmannode hn="new" huffmannode(); hn.c="charArray[i];" hn.data="charfreq[i];" hn.left="null;" hn.right="null;" functions adds q.add(hn); } create root here we will extract two minimum value from heap each time until its size reduces 1, all nodes are extracted. while (q.size()> 1) { // first min extract. HuffmanNode x = q.peek(); q.poll(); // second min extract. HuffmanNode y = q.peek(); q.poll(); // new node f which is equal HuffmanNode f = new HuffmanNode(); // to the sum of the frequency of the two nodes // assigning values to the f node. f.data = x.data + y.data; f.c = &apos;-&apos;; // first extracted node as left child. f.left = x; // second extracted node as the right child. f.right = y; // marking the f node as the root node. root = f; // add this node to the priority-queue. q.add(f); } // print the codes by traversing the tree printCode(root, &apos;&apos;); } } // node class is the basic structure // of each node present in the Huffman - tree. class HuffmanNode { int data; char c; HuffmanNode left; HuffmanNode right; } // comparator class helps to compare the node // on the basis of one of its attribute. // Here we will be compared // on the basis of data values of the nodes. class MyComparator implements Comparator { public int compare(HuffmanNode x, HuffmanNode y) { return x.data - y.data; } } </n;>

산출

 f: 0 c: 100 d: 101 a: 1100 b: 1101 e: 111 &#x2026;&#x2026;&#x2026;&#x2026;&#x2026;&#x2026;. Process executed in 1.11 seconds Press any key to continue.

설명:

탐색을 통해 허프만 트리가 생성되고 디코딩됩니다. 순회 중에 수집된 값은 리프 노드에 있는 캐릭터에 적용됩니다. 제공된 데이터 스트림의 모든 고유 문자는 이러한 방식으로 허프만 코드를 사용하여 식별될 수 있습니다. O(nlogn)(여기서 n은 총 문자 수)은 시간 복잡도입니다. n개의 노드가 있는 경우 ExtractMin()은 2*(n - 1)번 호출됩니다. extractMin()은 minHeapify()를 호출하므로 실행 시간은 O(logn)이다. 따라서 전체 복잡도는 O(nlogn)입니다. 입력 배열이 정렬된 경우 선형 시간 알고리즘이 있습니다. 이에 대해서는 다음 글에서 더 자세히 다룰 예정입니다.

허프만 코딩 문제

이 부분에서는 허프만 코딩의 단점과 이것이 항상 최선의 선택이 아닌 이유에 대해 이야기해 보겠습니다.

모든 캐릭터의 확률이나 빈도가 2의 음의 거듭제곱이 아닌 경우 이상적인 것으로 간주되지 않습니다.
기호를 그룹화하고 알파벳을 확장하면 이상에 가까워질 수 있지만 차단 방법에서는 더 큰 알파벳을 처리해야 합니다. 따라서 허프만 코딩이 항상 효과적이지는 않을 수 있습니다.
각 기호나 문자의 빈도를 계산하는 효과적인 방법은 많지만 각각에 대해 전체 트리를 재구성하는 데는 시간이 많이 걸릴 수 있습니다. 알파벳이 크고 각 기호에 따라 확률 분포가 빠르게 변하는 경우가 일반적입니다.

탐욕스러운 허프만 코드 구성 알고리즘

Huffman은 입력 데이터 스트림의 각 개별 문자에 대해 이상적인 접두사 코드인 Huffman 코드를 생성하는 그리디 기술을 개발했습니다.
이 접근 방식은 매번 가장 적은 수의 노드를 사용하여 아래에서 위로 허프만 트리를 생성합니다.
각 문자는 주어진 데이터 스트림에 나타나는 빈도에 따라 코드 길이를 수신하므로 이 방법을 탐욕적 접근 방식이라고 합니다. 검색된 코드의 크기가 작을 경우 데이터에서 일반적으로 발생하는 요소입니다.

허프만 코딩의 활용

여기서는 허프만 코딩의 몇 가지 실제 용도에 대해 이야기하겠습니다.
PKZIP, GZIP 등과 같은 기존 압축 형식은 일반적으로 허프만 코딩을 사용합니다.
허프만 코딩은 파일 크기를 최소화하고 전송 속도를 높이기 때문에 팩스 및 텍스트를 통한 데이터 전송에 사용됩니다.
허프만 인코딩(특히 접두사 코드)은 JPEG, PNG 및 MP3를 포함한 여러 멀티미디어 저장 형식에서 파일을 압축하는 데 사용됩니다.
허프만 코딩은 주로 이미지 압축에 사용됩니다.
자주 반복되는 문자열을 전송해야 하는 경우에는 이것이 더 도움이 될 수 있습니다.

결론

일반적으로 허프만 코딩은 자주 발생하는 문자가 포함된 데이터를 압축하는 데 유용합니다.
가장 자주 발생하는 문자의 코드가 가장 짧고, 가장 적게 발생하는 문자의 코드가 가장 크다는 것을 알 수 있습니다.
허프만 코드 압축 기술은 각 문자나 기호에 대해 다양한 양의 비트를 사용하는 가변 길이 코딩을 만드는 데 사용됩니다. 이 방법은 고정 길이 코딩보다 메모리 사용량이 적고 데이터 전송 속도가 빠르기 때문에 우수합니다.
그리디 알고리즘에 대한 더 나은 지식을 얻으려면 이 기사를 읽어보세요.