reorganize module hierarchy

2025-01-13 20:22:52 +08:00 · 2022-03-29 19:41:01 +08:00 · 2022-03-29 19:41:01 +08:00 · 6fabfeaf2c
commit 6fabfeaf2c
parent b02e076566
29 changed files with 1749 additions and 2133 deletions
--- a/README.md
+++ b/README.md
@ -1,126 +1,143 @@
-![test](https://img.shields.io/badge/test-passing-green.svg)
-![docs](https://img.shields.io/badge/docs-passing-green.svg)
-![platform](https://img.shields.io/badge/platform-Quartus|Vivado-blue.svg)
-
+![语言](https://img.shields.io/badge/语言-systemverilog_(IEEE1800_2005)-CAD09D.svg)![仿真](https://img.shields.io/badge/仿真-iverilog-green.svg)![部署](https://img.shields.io/badge/部署-quartus-blue.svg)![部署](https://img.shields.io/badge/部署-vivado-FF1010.svg)

 Hard-PNG
 ===========================
-基于**FPGA**的流式的**png**图象解码器
+基于FPGA的流式的 **png** 图象解码器，输入 png 码流，输出原始像素

+* 支持图像宽度<4000，高度不限。
+* **支持所有颜色类型**：灰度、灰度+A、RGB、索引RGB、RGB+A。
+* 仅支持8bit深度（大多数 png 图像都是8bit深度）。

-
-# 特点
-* 支持宽度不大于**4000像素**的png图片，对图片高度没有限制。
-* **支持所有颜色类型**: 灰度、灰度透明、RGB、索引RGB、RGBA。
-* 仅支持**8bit深度**，大多数png图片都是**8bit深度**。
-* 完全使用**SystemVerilog**实现，方便移植和仿真。
-
-| ![框图](./images/blockdiagram.png) |
+| ![diagram](./figures/diagram.png) |
 | :----: |
-| **图1** : Hard-PNG 原理框图 |
+| 图1 : Hard-PNG 原理框图 |
+
+

 # 背景知识

-**png**是仅次于**jpg**的第二常见的图象压缩格式，相比于**jpg**，**png**支持透明通道，支持无损压缩。在色彩丰富的数码照片中，无损压缩的**png**只能获得**1~4倍**的压缩比，低失真有损压缩的**png**能获得**4~20倍**的压缩比。在色彩较少的人工合成图（例如框图、平面设计）中，无损压缩的**png**就能获得**10倍**以上的压缩比。因此，**png**更适合压缩人工合成图，**jpg**更适合压缩数码照片。
+png 是仅次于jpg的第二常见的图象压缩格式。png支持透明通道（A通道），支持无损压缩，支持索引RGB（基于调色板的有损压缩）。在色彩丰富的数码照片中，png只能获得1~4倍的压缩比。在人工合成图（例如平面设计）中，png能获得10倍以上的压缩比。
+
+png 图像文件的扩展名为 .png 。以本库中的 SIM/test_image/img01.png 为例，它包含98字节，这98字节就称为 png 码流。我们可以用 [WinHex软件](http://www.x-ways.net/winhex/) 查看到这些字节：

-**png** 图片的文件扩展名为 **.png** 。以我们提供的文件 [**test1.png**](./images/test1.png) 为例，它包含**98字节**，称为**原始码流**。我们可以使用[**WinHex软件**](http://www.x-ways.net/winhex/)查看它：
 ```
 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, ...... , 0xAE, 0x42, 0x60, 0x82
 ```
-该图象文件解压后只有**4列2行**，共**8个像素**，16进制表示如下表。其中R, G, B, A分别代表像素的**红**、**绿**、**蓝**、**透明**通道。
+该png码流解码后会产生原始像素，这是个小图像，只有4列2行，共8个像素，这些像素的十六进制表示如下表。其中R, G, B, A分别代表像素的红、绿、蓝、透明通道。

 |          | 列 1 | 列 2 | 列 3 | 列 4 |
 | :---:    | :---: | :---: | :---: | :---: |
-| **行 1** | R:**FF** G:**F2** B:**00** A:**FF** | R:**ED** G:**1C** B:**24** A:**FF** | R:**00** G:**00** B:**00** A:**FF** | R:**3F** G:**48** B:**CC** A:**FF** |
-| **行 2** | R:**7F** G:**7F** B:**7F** A:**FF** | R:**ED** G:**1C** B:**24** A:**FF** | R:**FF** G:**FF** B:**FF** A:**FF** | R:**FF** G:**AE** B:**CC** A:**FF** |
+| 行 1 | R:FF G:F2 B:00 A:FF | R:ED G:1C B:24 A:FF | R:00 G:00 B:00 A:FF | R:3F G:48 B:CC A:FF |
+| 行 2 | R:7F G:7F B:7F A:FF | R:ED G:1C B:24 A:FF | R:FF G:FF B:FF A:FF | R:FF G:AE B:CC A:FF |

-# Hard-PNG 的使用

-**Hard-PNG**是一个能够输入**原始码流**，输出**解压后的像素**的硬件模块，它的代码在 [**hard_png.sv**](./hard_png.sv) 中。其中 **hard_png** 是顶层模块，它的接口如**图2**所示

-| ![接口图](./images/interface.png) |
+# 使用 Hard-PNG
+
+RTL 目录中的 hard_png.sv 是一个能够输入 png 码流，输出解压后的像素的模块，它的接口如图2所示。
+
+| ![接口图](./figures/interface.png) |
 | :----: |
-| **图2** : **hard_png** 接口图 |
+| 图2 : hard_png 接口图 |

-它的使用方法很简单，首先需要给 **clk** 信号提供时钟(频率不限)，并将 **rst** 信号置低，解除模块复位。
-然后将**原始码流**从**原始码流输入接口** 输入，就可以从**图象基本信息输出接口**和**像素输出接口**中得到解压结果。
+hard_png 的使用方法很简单，在输入一张 png 图像的码流前，先要给模块复位（令 rstn=0 至少一个时钟周期），然后解除复位（令 rstn=1），然后输入 png 码流，并从图象基本信息输出接口和像素输出接口中得到解码结果。

-以[**test1.png**](./images/test1.png)为例，我们应该以**图3**的时序把**原始码流**（98个字节）输入**hard_png**中。
-该输入接口类似 **AXI-stream** ，其中 **ivalid=1** 时说明外部想发送一个字节给 **hard_png**。**iready=1** 时说明 **hard_png** 已经准备好接收一个字节。只有 **ivalid** 和 **iready** 同时 **=1** 时，**ibyte** 才被成功的输入 **hard_png** 中。
+以 SIM/test_image/img01.png 为例，我们应该以图3的时序把 png 码流中的98个字节逐一输入到 hard_png 中。其中 ivalid 和 iready 构成了握手信号： ivalid=1 时说明外部想发送一个字节给 hard_png。iready=1 时说明 hard_png 已经准备好接收一个字节。只有 ivalid 和 iready 同时=1 时，ibyte 才被成功的输入 hard_png 中。

-| ![输入时序图](./images/wave1.png) |
+| ![输入时序图](./figures/wave1.png) |
 | :----: |
-| **图3** : **hard_png** 输入时序图，以 **test1.png** 为例 |
+| 图3 : hard_png 的输入波形图 |

-在输入的同时，解压结果从模块中输出，如**图4**。在新的一帧图象输出前，**newframe** 信号会出现一个时钟周期的高电平脉冲，同时 **colortype, width, height** 保持有效直到该图象的所有像素输出完为止。其中 **width, height** 分别为图象的宽度和高度， **colortype** 的含义如下表。另外， **ovalid=1** 代表该时钟周期有一个像素输出，该像素的R,G,B,A通道分别出现在 **opixelr,opixelg,opixelb,opixela** 信号上。
+在输入的同时，解压结果从模块中输出，如图4。在一帧图象输出前，newframe 信号会出现一个时钟周期的高电平脉冲，同时 colortype, width, height 有效。其中：

-| colortype | 2'd0 | 2'd1 | 2'd2 | 2'd3 |
-| :-------: | :--: | :--: | :--: | :--: |
-| **颜色类型** | 灰度图 | 灰度+透明 | RGB / 索引RGB | RGBA |
-| **含义** | RGB通道相等, A通道=0xFF | RGB通道相等 | RGB通道不等, A通道=0xFF | RGBA通道均不等 |
+- width, height 分别为图象的宽度和高度
+- colortype 为 png 图像的颜色类型，含义如下表。

-| ![输出时序图](./images/wave2.png) |
+| colortype | 3'd0 | 3'd1 | 3'd2 | 3'd3 | 3‘d4 |
+| :-------: | :--: | :--: | :--: | :--: | :--: |
+| 颜色类型 | 灰度 | 灰度+A | RGB | RGB+A | 索引RGB |
+| 备注 | R=G=B，A=0xFF | R=G=B≠A | R≠G≠B，A=0xFF | R≠G≠B≠A | R≠G≠B，A=0xFF |
+
+然后，ovalid=1 代表该时钟周期有一个像素输出，该像素的 R,G,B,A 通道分别出现在 opixelr,opixelg,opixelb,opixela 信号上。
+
+| ![输出时序图](./figures/wave2.png) |
 | :----: |
-| **图4** : **hard_png** 输出时序图，以 **test1.png** 为例 |
+| 图4 : hard_png 的输出波形图 |
+
+当一个 png 图象输入结束后，可以立即或稍后输入下一张图像（复位->解除复位->输入码流）。

-当一个图象完全输入结束后，我们可以紧接着输入下一个图象进行解压。如果一个图象输入了一半，我们想打断当前解压进程并输入下一个图象，则需要将 **rst** 信号拉高至少一个时钟周期进行复位。


 # 仿真

-[**tb_hard_png.sv**](./tb_hard_png.sv) 是仿真的顶层，它从指定的 **.png** 文件中读取**原始码流**输入[**hard_png**](./hard_png.sv)中，再接收**解压后的像素**并写入一个 **.txt** 文件。
+仿真相关的东西都在 SIM 文件夹中，其中：

-仿真前，请将 [**tb_hard_png.sv**](./tb_hard_png.sv) 中的**PNG_FILE宏名**改为 **.png** 文件的路径，将**OUT_FILE宏名**改为 **.txt** 文件的路径。然后运行仿真。 **.png** 文件越大，仿真的时间越长。当**ivalid**信号出现下降沿时，仿真完成。然后你可以从 **.txt** 文件中查看解压结果。
+- test_image 中提供 14 张不同尺寸，不同颜色类型的 png 图像文件。
+- tb_hard_png.sv 是仿真代码，它会依次进行这些图像的压缩，然后把结果（原始像素）写入 txt 文件中。
+- tb_hard_png_run_iverilog.bat 包含了运行 iverilog 仿真的命令。
+- validation.py （Python代码）对仿真输出和软件 png 解码的结果进行比对，验证正确性。
+
+使用 iverilog 进行仿真前，需要安装 iverilog ，见：[iverilog_usage](https://github.com/WangXuan95/WangXuan95/blob/main/iverilog_usage/iverilog_usage.md)
+
+然后双击 tb_hard_png_run_iverilog.bat 运行仿真，会运行大约半小时（可以中途强制关闭，但产生的仿真波形就是不全的）。
+
+仿真运行完后，可以打开生成的 dump.vcd 文件查看波形。
+
+另外，每个 png 图像都会产生一个对应的 txt 文件，里面是解码结果。比如 img01.png 对应地产生 out01.txt ，里面包含了解码出的 8 个像素的值：

-我们在 [**images文件夹**](./images) 下提供了多个 **.png** 文件，它们尺寸各异，且有不同的颜色类型，你可以用它们进行仿真。以 [**test3.png**](./images/test3.png) 为例，仿真得到的 **.txt** 文件如下：
 ```
-frame  type:2  width:83  height:74 
-f4d8c3ff f4d8c3ff f4d8c3ff f4d8c3ff f4d8c3ff f4d9c3ff ......
+decode result:  colortype:3  width:4  height:2
+fff200ff ed1c24ff 000000ff 3f48ccff 7f7f7fff ed1c24ff ffffffff ffaec9ff 
 ```
-这代表图片的尺寸是**83x74**， **colortype** 是2（RGB），第1行第1列的像素是RGBA=(0xf4, 0xd8, 0xc3, 0xff)，第1行第2列的像素是RGBA=(0xf4, 0xd8, 0xc3, 0xff)，......

-# 正确性验证
+## 正确性验证
+
+为了验证解压结果是否正确，我提供了 Python 程序 validation.py，它对 .png 文件进行解压，并与仿真产生的 .txt 文件中的每个像素进行比较，若比较结果相同则验证通过。
+
+为了运行 validation.py ，请安装 Python3 以及其配套的 [numpy](https://pypi.org/project/numpy/) 和 [PIL](https://pypi.org/project/Pillow/) 库。
+
+安装好后，用 CMD 命令运行它，比如：

-为了验证解压结果是否正确，我们提供了**Python**程序 [**validation.py**](./validation.py) ，它对 **.png** 文件进行软件解压，并与仿真得到的 **.txt** 文件进行比较，若比较结果相同则验证通过。为了准备必要的运行环境，请安装**Python3**以及其配套的 [**numpy**](https://pypi.org/project/numpy/) 和 [**PIL**](https://pypi.org/project/Pillow/) 库。运行环境准备好后，打开 [**validation.py**](./validation.py) ，将变量 **PNG_FILE** 改为要验证的 **.png** 文件的路径，将 **TXT_FILE** 改为仿真输出的 **.txt** 文件的路径，然后用命令运行它：
 ```
-python validation.py
+python validation.py test_image/img03.png out03.txt
 ```
-若验证通过，则打印 **"validation successful!!"** 。目前我们测试了几十张不同的 **.png** 图片，均验证通过。
+这个命令的含义是： 比较 out03.txt 中的每个像素是否与 test_image/img03.png 匹配。

-# 性能测试
+打印如下（说明验证通过）：

-* **测试平台**: 在 Altera Cyclone IV EP4CE40F23C6 上运行 **Hard-PNG** 进行**png**解压，时钟频率= **50MHz** （正好时序收敛）。
-* **对比平台**: 使用**MSVC++编译器**以**O3优化级别**编译[**upng库**](https://github.com/elanthis/upng)，在笔记本电脑（**Intel Core I7 8750H**）上运行**png**解压。
-
-测试结果如下表，**Hard-PNG**的性能接近对比平台。由此可以推断，**Hard-PNG**的性能好于大部分**ARM嵌入式处理器**。
-
-| **png文件名** | **颜色类型** | **图象尺寸** | **对比平台耗时** | **Hard-PNG 耗时** |
-| :-----------: | :----------: | :----------: | :--------------: | :---------------: |
-|   test9.png   |     RGB      |   631x742    |      83 ms       |      204 ms       |
-|  test10.png   |   索引RGB    |   631x742    |      不支持      |       48 ms       |
-|  test11.png   |     RGBA     |  1920x1080   |      402 ms      |      993 ms       |
-|  test12.png   |   索引RGB    |  1920x1080   |      不支持      |      204 ms       |
-|  test13.png   |     RGB      |  1819x1011   |      321 ms      |      655 ms       |
-|  test14.png   |     黑白     |  1819x1011   |      135 ms      |      227 ms       |
-|   wave2.png   |   索引RGB    |   1427x691   |      不支持      |       27 ms       |
+```
+size1= (400, 4)
+size2= (400, 4)
+total 400 pixels validation successful!!
+```


-# FPGA 资源消耗

-下表是**hard_png模块**综合后占用的FPGA资源量。
+# 部署信息

-|           **FPGA 型号**            | LUT  | LUT(%) |  FF  | FF(%) | Logic | Logic(%) |  BRAM   | BRAM(%) |
-| :--------------------------------: | :--: | :----: | :--: | :---: | :---: | :------: | :-----: | :-----: |
-|     **Xilinx Artix-7 XC7A35T**     | 2581 |  13%   | 2253 |  5%   |   -   |    -     | 792kbit |   44%   |
-| **Altera Cyclone IV EP4CE40F23C6** |  -   |   -    |  -   |   -   | 4551  |   11%    | 427kbit |   37%   |
+## FPGA 资源消耗
+
+|           FPGA 型号            | LUT  | LUT(%) |  FF  | FF(%) | Logic | Logic(%) |  BRAM   | BRAM(%) |
+| :----------------------------: | :--: | :----: | :--: | :---: | :---: | :------: | :-----: | :-----: |
+|     Xilinx Artix-7 XC7A35T     | 2581 |  13%   | 2253 |  5%   |   -   |    -     | 792kbit |   44%   |
+| Altera Cyclone IV EP4CE40F23C6 |  -   |   -    |  -   |   -   | 4551  |   11%    | 427kbit |   37%   |
+
+## 性能
+
+在 Altera Cyclone IV EP4CE40F23C6 上部署 hard_png ，时钟频率= 50MHz （正好时序收敛）。根据仿真时每个图像消耗的时钟周期数，可以算出压缩图像时的性能，举例如下表。
+
+| png文件名 | 颜色类型 | 图象长宽 | png 码流大小 (字节) | 消耗的时钟周期数 | 消耗时间 |
+| :-----------: | :----------: | :----------: | :--------------: | :---------------: | :---------------: |
+| img05.png | RGB | 300x256 | 96536 | 1105702 | 23ms |
+| img06.png | 灰度 | 300x263 | 37283 | 395335 | 8ms |
+|   img10.png   |   索引RGB    |   631x742    |      193489      |     2374224 | 48ms |
+| img14.png |     索引RGB |  1920x1080  |      818885      |    10177644 | 204ms |




 # 参考链接

-感谢以下链接为我们提供参考。
-
-* [**upng**](https://github.com/elanthis/upng): 一个轻量化的 C 语言 **png** 解码库
-* [**TinyPNG**](https://tinypng.com/): 一个利用索引 RGB 对 **png** 图片进行有损压缩的工具
-* [**PNG Specification**](https://www.w3.org/TR/REC-png.pdf): **png** 标准手册
+* [upng](https://github.com/elanthis/upng): 一个轻量化的 C 语言 png 解码库
+* [TinyPNG](https://tinypng.com/): 一个利用索引 RGB 对 png 图像进行有损压缩的工具
+* [PNG Specification](https://www.w3.org/TR/REC-png.pdf): png 标准手册
--- a/RTL/hard_png.sv
+++ b/RTL/hard_png.sv
--- a/RTL/huffman_builder.sv
+++ b/RTL/huffman_builder.sv
@ -0,0 +1,210 @@
+
+//--------------------------------------------------------------------------------------------------------
+// Module  : huffman_builder
+// Type    : synthesizable, IP's sub module
+// Standard: SystemVerilog 2005 (IEEE1800-2005)
+//--------------------------------------------------------------------------------------------------------
+
+module huffman_builder #(
+    parameter NUMCODES = 288,
+    parameter CODEBITS = 5,
+    parameter BITLENGTH= 15,
+    parameter OUTWIDTH = 10
+) (
+    rstn, clk,
+    wren, wraddr, wrdata,
+    run , done,
+    rdaddr, rddata
+);
+
+function automatic integer clogb2(input integer val);
+    integer valtmp;
+    valtmp = val;
+    for(clogb2=0; valtmp>0; clogb2=clogb2+1) valtmp = valtmp>>1;
+endfunction
+
+input                               rstn;
+input                               clk;
+input                               wren;
+input  [  clogb2(NUMCODES-1)-1:0]   wraddr;
+input  [           CODEBITS -1:0]   wrdata;
+input                               run;
+output                              done;
+input  [clogb2(2*NUMCODES-1)-1:0]   rdaddr;
+output [            OUTWIDTH-1:0]   rddata;
+
+wire                              rstn;
+wire                              clk;
+wire                              wren;
+wire [  clogb2(NUMCODES-1)-1:0]   wraddr;
+wire [           CODEBITS -1:0]   wrdata;
+wire                              run;
+wire                              done;
+wire [clogb2(2*NUMCODES-1)-1:0]   rdaddr;
+reg  [            OUTWIDTH-1:0]   rddata;
+
+reg  [clogb2(NUMCODES)-1:0] blcount  [BITLENGTH];
+reg  [   (1<<CODEBITS)-1:0] nextcode [BITLENGTH+1];
+
+initial for(int i=0; i< BITLENGTH; i++)  blcount[i] = '0;
+initial for(int i=0; i<=BITLENGTH; i++) nextcode[i] = '0;
+
+reg  clear_tree2d = 1'b0;
+reg  build_tree2d = 1'b0;
+reg  [clogb2(BITLENGTH)-1:0] idx = '0;
+reg  [clogb2(2*NUMCODES-1)-1:0] clearidx = '0;
+reg  [ clogb2(NUMCODES)-1:0] nn='0, nnn, lnn='0;
+reg  [CODEBITS-1:0] ii='0, lii='0;
+reg  [CODEBITS-1:0] blenn, blen = '0;
+wire [(1<<CODEBITS)-1:0] tree1d = nextcode[blen];
+wire                     islast = (blen==0 || ii==0);
+reg  [clogb2(2*NUMCODES-1)-1:0] nodefilled = '0;
+reg  [clogb2(2*NUMCODES-1)-1:0] ntreepos, treepos='0;
+wire [clogb2(2*NUMCODES-1)-1:0] ntpos= {ntreepos[clogb2(2*NUMCODES-1)-2:0], tree1d[ii]};
+reg  [clogb2(2*NUMCODES-1)-1:0] tpos = '0;
+reg         rdfilled;
+reg         valid = 1'b0;
+wire [OUTWIDTH-1:0] wrtree2d = (lii==0) ? lnn : nodefilled + (clogb2(2*NUMCODES-1))'(NUMCODES);
+reg  alldone = 1'b0;
+
+assign done = alldone & run;
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn) begin
+        valid <= '0;
+        treepos <= '0;
+        tpos <= '0;
+        lii <= '0;
+        lnn <= '0;
+    end else begin
+        valid <= build_tree2d & nn<NUMCODES & blen>0;
+        treepos <= ntreepos;
+        tpos <= ntpos;
+        lii <= ii;
+        lnn <= nn;
+    end
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn)
+        blen <= '0;
+    else begin
+        if(islast) blen <= blenn;
+    end
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn) begin
+        for(int i=0; i<BITLENGTH; i++)
+            blcount[i] <= '0;
+    end else begin
+        if(done) begin
+            for(int i=0; i<BITLENGTH; i++)
+                blcount[i] <= '0;
+        end else begin
+            if(wren && wrdata<BITLENGTH)
+                blcount[wrdata] <= blcount[wrdata] + (clogb2(NUMCODES))'(1);
+        end
+    end
+
+always_comb
+    if(build_tree2d)
+        nnn = (nn<NUMCODES && islast) ? nn + (clogb2(NUMCODES))'(1) : nn;
+    else
+        nnn = (idx<BITLENGTH) ? '1 : '0;
+        
+always @ (posedge clk or negedge rstn)
+    if(~rstn)
+        nn <= '0;
+    else
+        nn <= nnn;
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn) begin
+        for(int i=0; i<=BITLENGTH; i++) nextcode[i] <= '0;
+        alldone <= 1'b0;
+        ii <= '0;
+        idx <= '0;
+        build_tree2d <= 1'b0;
+        clearidx <= '0;
+        clear_tree2d <= 1'b0;
+    end else begin
+        nextcode[0] <= '0;
+        alldone <= 1'b0;
+        if(run) begin
+            if(~clear_tree2d) begin
+                if( clearidx >= (clogb2(2*NUMCODES-1))'(2*NUMCODES-1) )
+                    clear_tree2d <= 1'b1;
+                clearidx <= clearidx + (clogb2(2*NUMCODES-1))'(1);
+            end else if(build_tree2d) begin
+                if(nn < NUMCODES) begin
+                    if(islast) begin
+                        ii <= blenn - (CODEBITS)'(1);
+                        if(blen>0)
+                            nextcode[blen] <= tree1d + (1<<CODEBITS)'(1);
+                    end else
+                        ii <= ii - (CODEBITS)'(1);
+                end else
+                    alldone <= 1'b1;
+            end else begin
+                if(idx<BITLENGTH) begin
+                    idx <= idx + (clogb2(BITLENGTH))'(1);
+                    nextcode[idx+1] <= ( ( nextcode[idx] + ((1<<CODEBITS)'(blcount[idx])) ) << 1 );
+                end else begin
+                    ii <= blen - (CODEBITS)'(1);
+                    build_tree2d <= 1'b1;
+                end
+            end
+        end else begin
+            ii <= '0;
+            idx <= '0;
+            build_tree2d <= 1'b0;
+            clearidx <= '0;
+            clear_tree2d <= 1'b0;
+        end
+    end
+
+always_comb
+    if(~run)
+        ntreepos = 0;
+    else if(valid) begin
+        if(~rdfilled)
+            ntreepos = (clogb2(2*NUMCODES-1))'(rddata) - (clogb2(2*NUMCODES-1))'(NUMCODES);
+        else
+            ntreepos = (lii==0) ? '0 : nodefilled;
+    end else
+        ntreepos = treepos;
+    
+always @ (posedge clk or negedge rstn)
+    if(~rstn) begin
+        nodefilled <= '0;
+    end else begin
+        if(~run)
+            nodefilled <=              (clogb2(2*NUMCODES-1))'(1);
+        else if(valid & rdfilled & lii>0)
+            nodefilled <= nodefilled + (clogb2(2*NUMCODES-1))'(1);
+    end
+
+
+
+reg [CODEBITS-1:0] mem_huffman_bitlens [NUMCODES];
+
+always @ (posedge clk)
+    if(wren)
+        mem_huffman_bitlens[wraddr] <= wrdata;
+
+wire [clogb2(NUMCODES-1)-1:0] mem_rdaddr = (clogb2(NUMCODES-1))'(nnn) + (clogb2(NUMCODES-1))'(1);
+
+always @ (posedge clk)
+    blenn <= mem_huffman_bitlens[mem_rdaddr];
+
+
+
+reg [OUTWIDTH:0] mem_tree2d [2*NUMCODES];
+
+always @ (posedge clk)
+    if( ~clear_tree2d | (valid & rdfilled) )
+        mem_tree2d[ (clogb2(2*NUMCODES-1))'(~clear_tree2d ? clearidx : tpos ) ] <= ~clear_tree2d ? {1'b1, (OUTWIDTH)'(0)} : {1'b0, wrtree2d};
+
+always @ (posedge clk)
+    {rdfilled, rddata} <= mem_tree2d[ (clogb2(2*NUMCODES-1))'(alldone ? rdaddr : ntpos ) ];
+
+endmodule
--- a/RTL/huffman_decoder.sv
+++ b/RTL/huffman_decoder.sv
@ -0,0 +1,67 @@
+
+//--------------------------------------------------------------------------------------------------------
+// Module  : huffman_decoder
+// Type    : synthesizable, IP's sub module
+// Standard: SystemVerilog 2005 (IEEE1800-2005)
+//--------------------------------------------------------------------------------------------------------
+
+module huffman_decoder #(
+    parameter    NUMCODES = 288,
+    parameter    OUTWIDTH = 10
+)(
+    rstn, clk,
+    inew, ien, ibit,
+    oen, ocode,
+    rdaddr, rddata
+);
+
+function automatic integer clogb2(input integer val);
+    integer valtmp;
+    valtmp = val;
+    for(clogb2=0; valtmp>0; clogb2=clogb2+1) valtmp = valtmp>>1;
+endfunction
+
+input                               rstn, clk;
+input                               inew, ien, ibit;
+output                              oen;
+output  [            OUTWIDTH-1:0]  ocode;
+output  [clogb2(2*NUMCODES-1)-1:0]  rdaddr;
+input   [            OUTWIDTH-1:0]  rddata;
+
+wire                              rstn, clk;
+wire                              inew, ien, ibit;
+reg                               oen = 1'b0;
+reg  [            OUTWIDTH-1:0]   ocode = '0;
+wire [clogb2(2*NUMCODES-1)-1:0]   rdaddr;
+wire [            OUTWIDTH-1:0]   rddata;
+
+reg  [clogb2(2*NUMCODES-1)-2:0]   tpos = '0;
+wire [clogb2(2*NUMCODES-1)-2:0]   ntpos;
+reg                               ienl = 1'b0;
+
+assign rdaddr = {ntpos, ibit};
+
+assign ntpos = ienl ? (clogb2(2*NUMCODES-1)-1)'(rddata<(OUTWIDTH)'(NUMCODES) ? '0 : rddata-(OUTWIDTH)'(NUMCODES)) : tpos;
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn)
+        ienl <= '0;
+    else
+        ienl <= inew ? '0 : ien;
+
+always @ (posedge clk or negedge rstn)
+    if(~rstn)
+        tpos <= '0;
+    else
+        tpos <= inew ? '0 : ntpos;
+
+always_comb
+    if(ienl && rddata<NUMCODES) begin
+        oen   = 1'b1;
+        ocode = rddata;
+    end else begin
+        oen   = 1'b0;
+        ocode = '0;
+    end
+
+endmodule
--- a/SIM/tb_hard_png.sv
+++ b/SIM/tb_hard_png.sv
@ -0,0 +1,144 @@
+
+//--------------------------------------------------------------------------------------------------------
+// Module  : tb_hard_png
+// Type    : simulation, top
+// Standard: SystemVerilog 2005 (IEEE1800-2005)
+// Function: testbench for hard_png
+//--------------------------------------------------------------------------------------------------------
+
+`timescale 1ps/1ps
+
+
+`define START_NO  1      // first png file number to decode
+`define FINAL_NO  14     // last png file number to decode
+
+`define IN_PNG_FILE_FOMRAT    "test_image/img%02d.png"
+`define OUT_TXT_FILE_FORMAT   "out%02d.txt"
+
+
+module tb_hard_png ();
+
+initial $dumpvars(1, tb_hard_png);
+
+
+reg rstn = 1'b0;
+reg clk  = 1'b1;
+always  #10000 clk = ~clk;    // 50MHz
+
+
+reg  [ 7:0]  ibyte = '0;
+reg          ivalid = 1'b0;
+wire         iready;
+
+wire         newframe;
+wire [ 2:0]  colortype;
+wire [13:0]  width;
+wire [31:0]  height;
+
+wire         ovalid;
+wire [ 7:0]  opixelr, opixelg, opixelb, opixela;
+
+
+
+hard_png hard_png_i (
+    .rstn      ( rstn      ),
+    .clk       ( clk       ),
+    // data input
+    .ivalid    ( ivalid    ),
+    .iready    ( iready    ),
+    .ibyte     ( ibyte     ),
+    // image size output
+    .newframe  ( newframe  ),
+    .colortype ( colortype ),
+    .width     ( width     ),
+    .height    ( height    ),
+    // data output
+    .ovalid    ( ovalid    ),
+    .opixelr   ( opixelr   ),
+    .opixelg   ( opixelg   ),
+    .opixelb   ( opixelb   ),
+    .opixela   ( opixela   )
+);
+
+
+
+int fptxt = 0, fppng = 0;
+reg [256*8:1] fname_png;
+reg [256*8:1] fname_txt;
+int png_no = 0;
+int txt_no = 0;
+
+int cyccnt = 0;
+int bytecnt = 1;
+
+initial begin
+    fork
+        // thread: input png file
+        for(png_no=`START_NO; png_no<=`FINAL_NO; png_no=png_no+1) begin
+            @ (posedge clk);
+            rstn <= 1'b1;
+            
+            $sformat(fname_png, `IN_PNG_FILE_FOMRAT , png_no);
+            
+            fppng = $fopen(fname_png, "rb");
+            if(fppng == 0) begin
+                $error("input file %s open failed", fname_png);
+                $finish;
+            end
+            cyccnt = 0;
+            bytecnt = 1;
+            
+            $display("start to decode %30s", fname_png );
+            
+            ibyte <= $fgetc(fppng);
+            while( !$feof(fppng) ) @(posedge clk) begin
+                if(~ivalid | iready ) begin
+                    ivalid <= 1'b1;                   // use this to always try to input a byte to hard_png (no bubble, will get maximum throughput)
+                    //ivalid <= ($random % 3) == 0;     // use this to add random bubbles to the input stream of hard_png. (Although the maximum throughput cannot be achieved, it allows input with mismatched rate, which is more common in the actual engineering scenarios)
+                end
+                if( ivalid & iready ) begin
+                    ibyte <= $fgetc(fppng);
+                    bytecnt++;
+                end
+                cyccnt++;
+            end
+            ivalid <= 1'b0;
+            rstn <= 1'b0;
+            
+            $fclose(fppng);
+            $display("image %30s decode done, input %d bytes in %d cycles, throughput=%f byte/cycle", fname_png, bytecnt, cyccnt, (1.0*bytecnt)/cyccnt );
+        end
+        
+        
+        // thread: output txt file
+        for(txt_no=`START_NO; txt_no<=`FINAL_NO; txt_no=txt_no+1) begin
+            $sformat(fname_txt, `OUT_TXT_FILE_FORMAT , txt_no);
+        
+            while(~newframe) @ (posedge clk);
+            $display("decode result:  colortype:%1d  width:%1d  height:%1d", colortype, width, height);
+            
+            fptxt = $fopen(fname_txt, "w");
+            if(fptxt != 0)
+                $fwrite(fptxt, "decode result:  colortype:%1d  width:%1d  height:%1d\n", colortype, width, height);
+            else begin
+                $error("output txt file %30s open failed", fname_txt);
+                $finish;
+            end
+            
+            for(int ii=0; ii<width*height; ii++) begin
+                @ (posedge clk);
+                while(~ovalid) @ (posedge clk);
+                $fwrite(fptxt, "%02x%02x%02x%02x ", opixelr, opixelg, opixelb, opixela);
+                if( (ii % (width*height/10)) == 0 ) $display("%d/%d", ii, width*height);
+            end
+            
+            $fclose(fptxt);
+        end
+    join
+    
+    repeat(100) @ (posedge clk);
+    $finish;
+end
+
+
+endmodule
--- a/SIM/tb_hard_png_run_iverilog.bat
+++ b/SIM/tb_hard_png_run_iverilog.bat
@ -0,0 +1,5 @@
+del sim.out dump.vcd
+iverilog  -g2005-sv  -o sim.out  tb_hard_png.sv  ../RTL/hard_png.sv  ../RTL/huffman_builder.sv  ../RTL/huffman_decoder.sv
+vvp -n sim.out
+del sim.out
+pause
--- a/SIM/test_image/img01.png
+++ b/SIM/test_image/img01.png
--- a/SIM/test_image/img02.png
+++ b/SIM/test_image/img02.png
--- a/SIM/test_image/img03.png
+++ b/SIM/test_image/img03.png
--- a/SIM/test_image/img04.png
+++ b/SIM/test_image/img04.png
--- a/SIM/test_image/img05.png
+++ b/SIM/test_image/img05.png
--- a/SIM/test_image/img06.png
+++ b/SIM/test_image/img06.png
--- a/SIM/test_image/img07.png
+++ b/SIM/test_image/img07.png
--- a/SIM/test_image/img08.png
+++ b/SIM/test_image/img08.png
--- a/SIM/test_image/img09.png
+++ b/SIM/test_image/img09.png
--- a/SIM/test_image/img10.png
+++ b/SIM/test_image/img10.png
--- a/SIM/test_image/img11.png
+++ b/SIM/test_image/img11.png
--- a/SIM/test_image/img12.png
+++ b/SIM/test_image/img12.png
--- a/SIM/test_image/img13.png
+++ b/SIM/test_image/img13.png
--- a/SIM/test_image/img14.png
+++ b/SIM/test_image/img14.png
--- a/SIM/validation.py
+++ b/SIM/validation.py
@ -1,6 +1,5 @@
-PNG_FILE = "E:/FPGAcommon/Hard-PNG/images/test15.png"
-TXT_FILE = "E:/FPGAcommon/Hard-PNG/result/test15.txt"

+import sys
 import numpy as np
 from PIL import Image

@ -14,7 +13,7 @@ def read_txt(fname):
                    rgba = [int(value[0:2],16), int(value[2:4],16), int(value[4:6],16), int(value[6:8],16)]
                    arr[idx] = rgba
                return height, width, arr
-            if line.startswith("frame"):
+            if line.startswith("decode result"):
                height, width = 0, 0
                for item in line.split():
                    pair = item.split(':')
@ -46,13 +45,24 @@ def read_png(fname):
        return 0, 0, np.zeros([0], dtype=np.uint8)


+# usage python validation.py <png_file>.png <hardware_result>.txt
+PNG_FILE = sys.argv[1]
+TXT_FILE = sys.argv[2]

 h_hw, w_hw, arr_hw = read_txt(TXT_FILE)
 h_sw, w_sw, arr_sw = read_png(PNG_FILE)

-for idx, (pix_hw, pix_sw) in enumerate(zip(arr_hw, arr_sw)):
-    if pix_hw[0]!=pix_sw[0] or pix_hw[1]!=pix_sw[1] or pix_hw[2]!=pix_sw[2]:
-        print("  ** mismatch at %d   " % (idx,), pix_hw, pix_sw)
-        break
+
+if h_hw != h_sw or w_hw != w_sw:
+    print("** size mismatch,  size1=%dx%d, size2=%dx%d" % (w_hw, h_hw, w_sw, h_sw) )
 else:
-    print("  validation successful!!")
+    print("size1=", arr_hw.shape )
+    print("size2=", arr_sw.shape )
+    idx = 0
+    for (pix_hw, pix_sw) in zip(arr_hw, arr_sw):
+        if pix_hw[0]!=pix_sw[0] or pix_hw[1]!=pix_sw[1] or pix_hw[2]!=pix_sw[2] or pix_hw[3]!=pix_sw[3]:
+            print("** mismatch at %d   " % (idx,) , pix_hw, pix_sw)
+            break
+        idx += 1
+    else:
+        print("total %d pixels validation successful!!" % idx)
--- a/figures/diagram.png
+++ b/figures/diagram.png
--- a/figures/interface.png
+++ b/figures/interface.png
--- a/figures/wave1.png
+++ b/figures/wave1.png
--- a/figures/wave2.png
+++ b/figures/wave2.png
--- a/hard_png.sv
+++ b/hard_png.sv
--- a/images/wave1.png
+++ b/images/wave1.png
--- a/result/test0.txt
+++ b/result/test0.txt
@ -1,3 +0,0 @@
-
-frame  type:2  width:1  height:154
-ebe6afff f5f0b9ff f1ebb7ff dbd5a3ff d1c392ff c6a673ff 8b622eff 78501dff a37d4eff a68155ff 6a441dff 482200ff 583110ff 663f20ff 673e20ff 66391cff 5e2f11ff 5b280bff 642f10ff 774022ff 864f30ff 633110ff 401200ff 5b2f14ff 552c16ff 2e0900ff 391404ff 4f2515ff 5a2f1eff 582917ff 5e2f1bff 5c3120ff 452113ff 240d07ff 180c0cff 0c0b11ff 000107ff 000104ff 04080bff 090e12ff 000107ff 1e292fff 2a373fff 000811ff 202d36ff 07141dff 010e17ff 000a12ff 010c12ff 030a10ff 000509ff 000205ff 010204ff 06050aff 030002ff 060000ff 20110eff 281006ff 59351fff a36e4cff 9e5c2cff b36328ff a44b09ff bb5e1bff b35a18ff 934107ff 8f4613ff 8f5024ff a16c42ff 9a6e3fff a58151ff ddbb8dff efd0a4ff e9cda6ff e9d1adff ddc9a8ff e5d4b6ff d4c5a8ff f1e3c8ff e3d5baff e5d7bdff c8ba9fff bbad92ff ded0b3ff baaa90ff c8b6a2ff d8c5b4ff e0cfbdff dbccb7ff d9cdb7ff dfd4beff d9d1baff cdc5aeff ded6bfff ddd3baff dad0b7ff d2c6b0ff cebda9ff d8c6b2ff dec9b6ff d2bda8ff c3b199ff d0c0a7ff dacab1ff d5c5acff cdbca2ff ccb89fff cfba9fff d1bca1ff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 
--- a/tb_hard_png.sv
+++ b/tb_hard_png.sv
@ -1,95 +0,0 @@
-`timescale 1 ns/1 ns
-
-`define PNG_FILE "E:/FPGAcommon/Hard-PNG/images/test14.png"   // the png file to decode
-`define OUT_FILE "E:/FPGAcommon/Hard-PNG/result/test14.txt"   // decode result txt file
-`define OUT_ENABLE 1                                          // whether to write result to the decode result txt file
-
-module tb_hard_png();
-
-integer fppng, fptxt;
-reg [7:0] rbyte;
-
-reg rst = 1'b1;
-reg clk = 1'b1;
-always #5 clk = ~clk;
-
-reg          ivalid = 1'b0;
-wire         iready;
-reg  [ 7:0]  ibyte = '0;
-
-wire         newframe;
-wire [ 1:0]  colortype;
-wire [13:0]  width;
-wire [31:0]  height;
-
-wire         ovalid;
-wire [ 7:0]  opixelr, opixelg, opixelb, opixela;
-
-
-initial begin
-    fppng = $fopen(`PNG_FILE, "rb");
-    if(`OUT_ENABLE) fptxt = $fopen(`OUT_FILE, "w");
-    rbyte = $fgetc(fppng);
-    
-    @(posedge clk) rst = 1'b1;
-    @(posedge clk) rst = 1'b0;
-        
-    @(posedge clk) #1
-    ivalid <= 1'b0;
-    ibyte  <= 1'b0;
-
-    while(!$feof(fppng)) begin
-        @(posedge clk) #1
-        ivalid <= 1'b1;
-        ibyte  <= rbyte;
-        #1 if(iready) begin
-            rbyte = $fgetc(fppng);
-            //@(posedge clk) #1
-            //ivalid <= 1'b0;
-            //ibyte  <= '0;
-        end
-    end
-    
-    @(posedge clk) #1
-    ivalid <= 1'b0;
-    ibyte  <= 1'b0;
-
-    $fclose(fppng);
-    if(`OUT_ENABLE) $fclose(fptxt);
-end
-
-hard_png hard_png_i(
-    .rst       ( rst       ),
-    .clk       ( clk       ),
-    // data input
-    .ivalid    ( ivalid    ),
-    .iready    ( iready    ),
-    .ibyte     ( ibyte     ),
-    // image size output
-    .newframe  ( newframe  ),
-    .colortype ( colortype ),
-    .width     ( width     ),
-    .height    ( height    ),
-    // data output
-    .ovalid    ( ovalid    ),
-    .opixelr   ( opixelr   ),
-    .opixelg   ( opixelg   ),
-    .opixelb   ( opixelb   ),
-    .opixela   ( opixela   )
-);
-
-reg [31:0] pixcnt = 0;
-
-always @ (posedge clk)
-    if(newframe) begin
-        pixcnt <= 0;
-        if(`OUT_ENABLE)
-            $fwrite(fptxt, "\nframe  type:%1d  width:%1d  height:%1d\n", colortype, width, height);
-        else
-            $write("\nframe  type:%1d  width:%1d  height:%1d\n", colortype, width, height);
-    end else if(ovalid) begin
-        pixcnt <= pixcnt + 1;
-        if(`OUT_ENABLE) $fwrite(fptxt, "%02x%02x%02x%02x ", opixelr, opixelg, opixelb, opixela);
-    end
-
-endmodule