Package-merge algorithm: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Oravec (talk | contribs)
Added Reference Section
m Disambiguating links to Code word (link changed to Code word (communication)) using DisamAssist.
 
(36 intermediate revisions by 20 users not shown)
Line 1: Line 1:
The '''package-merge algorithm''' is an ''[[Big O notation|O]](nL)''-time algorithm for finding an optimal [[length-limited Huffman code]] for a given distribution on a given alphabet of size ''n'', where no [[Code word (communication)|code word]] is longer than ''L''. It is a [[greedy algorithm]], and a generalization of [[Huffman coding|Huffman's original algorithm]]. Package-merge works by reducing the code construction problem to the binary ''[[Coupon collector's problem|coin collector's problem]]''.<ref name="LaHi">{{cite journal |first1=Lawrence L. |last1=Larmore |author-link1=Lawrence L. Larmore |first2=Daniel S. |last2=Hirschberg |author-link2=Daniel S. Hirschberg |title=A fast algorithm for optimal length-limited Huffman codes |journal=[[Journal of the Association for Computing Machinery]] |volume=37 |number=3 |pages=464–473 |date=1990 |doi=10.1145/79147.79150|s2cid=11696729 |doi-access=free }}</ref>
{{importance}}
{{verify}}


==The coin collector's problem==

Suppose a coin collector has a number of coins of various denominations, each of which has a [[numismatic value]] unrelated to its denomination. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost ''N''. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total ''N''.
The '''Package-Merge Algorithm''' is an '''O(nL)'''-time algorithm for finding an
optimal [[Huffman coding#Length-limited_Huffman_coding|length-limited Huffman code]] for a given distribution on a given alphabet of size '''n''', where no code string is permitted to have length more than '''L'''. It is greedy algorithm which is a generalization of [[Huffman coding|Huffman's original algorithm]]. Package-merge works by reducing the problem to the binary '''Coin Collector's problem'''.
==Coin Collector's Problem==
Suppose a coin collector has a number of coins of various denominations, each of which has a [[Numismatics|numismatic value]]. The coin collector has run out
of money and needs to use some of his coin collection to buy something of cost '''N'''. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total '''N'''.
The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.
The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.

==Description of the Algorithm==
==Description of the package-merge algorithm==
Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then '''packages''' the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then '''merged''' into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then '''packaged''' in pairs, and merged into the next smallest list, and so forth.
Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then '''packages''' the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then '''merged''' into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then '''packaged''' in pairs, and merged into the next smallest list, and so forth.
Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.
Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.
Note that the time of the algorithm is linear in the number of coins.
Note that the time of the algorithm is linear in the number of coins.

==Reduction of length-limited Huffman coding to the coin collector's problem==
==The Reduction==
Let ''L'' be the maximum length any code word is permitted to have.
Let ''p''<sub>1</sub>,&nbsp;…,&nbsp;''p<sub>n</sub>'' be the frequencies of the
Let '''L''' be the maximum length of any code string is permitted to have.
symbols of the alphabet to be encoded. We first sort the symbols so that ''p''<sub>''i''</sub>&nbsp;≤&nbsp;''p''<sub>''i''+1</sub>. Create ''L'' coins for each symbol, of denominations 2<sup>&minus;1</sup>,&nbsp;…,&nbsp;2<sup>&minus;''L''</sup>, each of numismatic value ''p<sub>i</sub>''. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total ''n''&nbsp;&minus;&nbsp;1. Let ''h<sub>i</sub>'' be the number of coins of numismatic value ''p<sub>i</sub>'' selected.
Let '''p<sub>1</sub>''', ... '''p<sub>n</sub>''' be the frequencies of the
The optimal length-limited Huffman code will encode symbol ''i'' with a bit string of length ''h<sub>i</sub>''. The [[canonical Huffman code]] can easily be constructed by a simple bottom-up greedy method, given that the ''h<sub>i</sub>'' are known, and this can be the basis for fast [[data compression]].<ref name="MoTu">{{cite journal |first1=Alistair |last1=Moffat |author-link1=Alistair Moffat (computer scientist) |first2=Andrew |last2=Turpin |title=On the implementation of minimum redundancy prefix codes |journal=[[IEEE Transactions on Communications]] |volume=45 |number=10 |pages=1200–1207 |date=Oct 1997 |doi=10.1109/26.634683}}</ref>
symbols of the alphabet that needs to be encoded. We first sort the symbols so that '''p<sub>i</sub> <= p<sub>i+1</sub>'''. Create '''L''' coins for each symbol, of denominations '''2<sup>-1</sup> ... 2<sup>-L</sup>''', each of numismatic value '''p<sub>i</sub>'''. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total '''n-1'''. Let '''h<sub>i</sub>''' be the number of coins of numismatic value '''p<sub>i</sub>''' selected.

==Performance improvements and generalizations==
The optimal length-limited Huffman code will encode symbol '''i''' with a bit string of length '''h<sub>i</sub>'''. The actual Huffman tree can easily be constructed by a simple bottom-up greedy method, given that the '''h<sub>i</sub>''' are known.
With this reduction, the algorithm is ''O(nL)''-time and ''O(nL)''-space. However, the original paper, "''A fast algorithm for optimal length-limited Huffman codes''", shows how this can be improved to ''O(nL)''-time and ''O(n)''-space. The idea is to run the algorithm a first time, only keeping enough data to be able to determine two equivalent subproblems that sum to half the size of the original problem. This is done recursively, resulting in an algorithm that takes about twice as long but requires only linear space.<ref name="LaHi"/>

==External Links==
Many other improvements have been made to the package-merge algorithm to reduce the [[Big O notation|multiplicative constant]] and to make it faster in special cases, such as those problems having repeated ''p<sub>i</sub>''s.<ref name="MG">{{cite book |first1=Ian H. |last1=Witten |author-link1=Ian H. Witten | first2=Alistair |last2=Moffat |author-link2=Alistair Moffat (computer scientist) |first3=Timothy Clinton |last3=Bell |author-link3=Timothy Clinton Bell |title=Managing Gigabytes: Compressing and indexing documents and images |edition=2 |publisher=[[Morgan Kaufmann Publishers]] |date=1999 |isbn=978-1-55860-570-1 |id=1558605703}}</ref> The package-merge approach has also been adapted to related problems such as [[alphabetic Huffman coding|alphabetic coding]].<ref name="LaPr">{{cite journal |first1=Lawrence L. |last1=Larmore |author-link1=Lawrence L. Larmore |first2=Teresa M. |last2=Przytycka|author2-link=Teresa Przytycka |title=A Fast Algorithm for Optimal Height-Limited Alphabetic Binary-Trees |journal=[[SIAM Journal on Computing]] |volume=23 |number=6 |pages=1283–1312 |date=1994 |doi=10.1137/s0097539792231167}}</ref>
* Michael B. Baer "[http://132.236.180.11/pdf/cs.IT/0602085 Twenty (or so) Questions: Bounded-Length Huffman Coding.]" (PDF)

* Alistair Moffit and Andrew Turpin "[http://www.cs.mu.oz.au/~alistair/abstracts/dcc95.pm.html Space-Efficient Construction of Optimal Prefix Codes.]"
Methods involving [[graph theory]] have been shown to have better asymptotic space complexity than the package-merge algorithm, but these have not seen as much practical application.


==References==
==References==
<references/>
* [[Lawrence L. Larmore | L. L. Larmore]] and [[Dan Hirschberg | D. S. Hirschberg]]. A fast algorithm for optimal length-limited Huffman codes. Journal of the Association for Computing Machinery, V 37 No. 3:464--473, 1990.


==External links==
* {{cite arXiv |first=Michael B. |last=Baer |eprint=cs.IT/0602085 |title=Twenty (or so) Questions: ''D''-ary Length-Bounded Prefix Coding|year=2006 }}<!-- Please do not add as a reference; it has not yet been peer-reviewed, but it does survey and expand upon the techniques and literature on this subject -->
* {{cite conference |first1=Alistair |last1=Moffat |first2=Andrew |author-link1=Alistair Moffat (computer scientist) |last2=Turpin |first3=Jyrki |last3=Katajainen |title=Space-Efficient Construction of Optimal Prefix Codes |conference=IEEE Data Compression Conference |date=March 1995 |location=Snowbird, Utah, USA |doi=10.1109/DCC.1995.515509}}
* An implementation of the package-merge algorithm "[http://sourceforge.net/p/tpabbrevia/code/26/tree/trunk/source/AbDfPkMg.pas]"
* A fast entropy coder that uses package-merge algorithm [https://github.com/algorithm314/FPC]


[[Category:Lossless compression algorithms]]
{{Stub}}
[[Category:Coding theory]]

Latest revision as of 01:40, 24 October 2023

The package-merge algorithm is an O(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code word is longer than L. It is a greedy algorithm, and a generalization of Huffman's original algorithm. Package-merge works by reducing the code construction problem to the binary coin collector's problem.[1]

The coin collector's problem[edit]

Suppose a coin collector has a number of coins of various denominations, each of which has a numismatic value unrelated to its denomination. The coin collector has run out of money and needs to use some of his coin collection to buy something of cost N. He wishes to select a subset of coins from his collection of minimum numismatic value whose denominations total N.

The binary version of this problem is that all denominations are powers of 2, that is, 1, 1/2, 1/4, etc. dollars.

Description of the package-merge algorithm[edit]

Assume that the largest denomination is 1 dollar, and that N is an integer. (The algorithm works even if these assumptions do not hold, by trivial modifications.) The coin collector first separates his coins into lists, one for each denomination, sorted by numismatic value. He then packages the smallest denomination coins in pairs, starting from the pair of smallest total numismatic value. If there is one coin left over, it will be the coin of highest numismatic value of that denomination, and it is set aside and ignored henceforth. These packages are then merged into the list of coins of the next smallest denomination, again in order of numismatic value. The items in that list are then packaged in pairs, and merged into the next smallest list, and so forth.

Finally, there is a list of items, each of which is a 1 dollar coin or a package consisting of two or more smaller coins whose denominations total 1 dollar. They are also sorted in order of numismatic value. The coin collector then selects the least value N of them.

Note that the time of the algorithm is linear in the number of coins.

Reduction of length-limited Huffman coding to the coin collector's problem[edit]

Let L be the maximum length any code word is permitted to have. Let p1, …, pn be the frequencies of the symbols of the alphabet to be encoded. We first sort the symbols so that pi ≤ pi+1. Create L coins for each symbol, of denominations 2−1, …, 2L, each of numismatic value pi. Use the package-merge algorithm to select the set of coins of minimum numismatic value whose denominations total n − 1. Let hi be the number of coins of numismatic value pi selected. The optimal length-limited Huffman code will encode symbol i with a bit string of length hi. The canonical Huffman code can easily be constructed by a simple bottom-up greedy method, given that the hi are known, and this can be the basis for fast data compression.[2]

Performance improvements and generalizations[edit]

With this reduction, the algorithm is O(nL)-time and O(nL)-space. However, the original paper, "A fast algorithm for optimal length-limited Huffman codes", shows how this can be improved to O(nL)-time and O(n)-space. The idea is to run the algorithm a first time, only keeping enough data to be able to determine two equivalent subproblems that sum to half the size of the original problem. This is done recursively, resulting in an algorithm that takes about twice as long but requires only linear space.[1]

Many other improvements have been made to the package-merge algorithm to reduce the multiplicative constant and to make it faster in special cases, such as those problems having repeated pis.[3] The package-merge approach has also been adapted to related problems such as alphabetic coding.[4]

Methods involving graph theory have been shown to have better asymptotic space complexity than the package-merge algorithm, but these have not seen as much practical application.

References[edit]

  1. ^ a b Larmore, Lawrence L.; Hirschberg, Daniel S. (1990). "A fast algorithm for optimal length-limited Huffman codes". Journal of the Association for Computing Machinery. 37 (3): 464–473. doi:10.1145/79147.79150. S2CID 11696729.
  2. ^ Moffat, Alistair; Turpin, Andrew (Oct 1997). "On the implementation of minimum redundancy prefix codes". IEEE Transactions on Communications. 45 (10): 1200–1207. doi:10.1109/26.634683.
  3. ^ Witten, Ian H.; Moffat, Alistair; Bell, Timothy Clinton (1999). Managing Gigabytes: Compressing and indexing documents and images (2 ed.). Morgan Kaufmann Publishers. ISBN 978-1-55860-570-1. 1558605703.
  4. ^ Larmore, Lawrence L.; Przytycka, Teresa M. (1994). "A Fast Algorithm for Optimal Height-Limited Alphabetic Binary-Trees". SIAM Journal on Computing. 23 (6): 1283–1312. doi:10.1137/s0097539792231167.

External links[edit]

  • Baer, Michael B. (2006). "Twenty (or so) Questions: D-ary Length-Bounded Prefix Coding". arXiv:cs.IT/0602085.
  • Moffat, Alistair; Turpin, Andrew; Katajainen, Jyrki (March 1995). Space-Efficient Construction of Optimal Prefix Codes. IEEE Data Compression Conference. Snowbird, Utah, USA. doi:10.1109/DCC.1995.515509.
  • An implementation of the package-merge algorithm "[1]"
  • A fast entropy coder that uses package-merge algorithm [2]