Category Archives: Computer Graphics

关于对std::vector的遍历

June 30, 2009 11:40 pm / 1 Comment / Benny Chen

上图是通过Instancing渲染了10000个低精度模型（低于200个面），有skin动画，但是人物没有AI。在实验室Geforce 8800GT的显卡上fps可以跑到80帧。

接着，我给人群加上点简单的AI，每个人物进行向一个目标点移动，于是我在每帧更新的时候添加了如下的这些代码。代码中，MeshInstance是instance的类，对应于一个人物实例，Move是移动人物实例的简单AI函数。对于所有的Instancing数据，我使用一个vector列表存储——m_vpInstancingData。代码通过vector的iterator(迭代器)遍历所有的instance，对每个instance执行Move函数。

for( vector< MeshInstance* >::iterator i = m_vpInstancingData.begin(); i != m_vpInstancingData.end(); i ++ )
{
    ( *i )->Move();
}

结果，加上这段代码之后，程序的效率居然骤降，如下图，fps只剩下44帧。这让我很是纳闷，因为在加上代码之前，CPU基本上是空闲的，因为所有的骨骼蒙皮+渲染全部都是GPU扛着，而在CPU加上一个10000次的for循环后，整体效率大打折扣。它的杀伤力有这么大么……CPU不太可能这么低能。

然后，我把(*i)->Move()这行代码注释掉了，仍然只有40多帧，即一个只是10000次的空for循环，仍然是效率的瓶颈，10000次的Move根本不是问题。

难道是迭代器在影响效率？于是把代码改成了下面这样，不用迭代器遍历vector，而直接使用数组形式访问vector来遍历。

for( int i = 0; i < NUM_INSTANCE; i ++ )
{
    m_vpInstancingData[i]->Move();
}

再次执行之后，fps又回归80帧！！

对于vector的遍历，一直以来一直都是通过迭代器遍历，但对于大型vector它居然会如此的影响效率，也是到今天才刚发现。但是STL的设计本来就是奔着方便高效的啊，迭代器不至于效率影响这么大吧，可能与Debug模式有关。于是，我做了一个小实验，代码如下。

#include <iostream>
#include <vector>
#include <time.h>

using std::vector;
using std::cout;
using std::endl;

#define MAX_NUM 1000000

int _tmain(int argc, _TCHAR* argv[])
{
    vector< int > vIntList;
    for ( int i = 0; i < MAX_NUM; i ++ )
    {
        vIntList.push_back( i );
    }

    clock_t start, end;
    double duration;

    start = clock();
    for ( vector< int >::iterator i = vIntList.begin(); i != vIntList.end(); i ++ )
    {
        ( *i ) ++;
    }
    end = clock();
    duration = ( double )( end - start ) / CLOCKS_PER_SEC;
    cout << duration << "seconds" << endl;

    start = clock();
    for ( int i = 0; i < MAX_NUM; i ++ )
    {
        vIntList[i]++;
    }
    end = clock();
    duration = ( double )( end - start ) / CLOCKS_PER_SEC;
    cout << duration << "seconds" << endl;

    return 0;
}

在Debug模式下执行的结果

Release模式下

可见，在Release版本中，它们几乎是一样快的。而在Debug版本中，可能因为迭代器需要额外的很多检查工作，所以比数组形式访问慢了很多很多……所以，对于采用哪种方式对vector进行遍历，效率怎样，如果最终是要发布为release版本的，那么这个问题大可不必担心。

Posted in: Computer Graphics, Some Experiences / Tagged: vector, 数组, 迭代器, 遍历

关于D3D10_MAPPED_TEXTURE2D的RowPitch

June 30, 2009 5:02 pm / Leave a Comment / Benny Chen

当对一个ID3D10Texture2D进行Map操作时，会遇到D3D10_MAPPED_TEXTURE2D结构。该结构有一个属性是UINT RowPitch，如果没有很好的理解这个属性的含义，Map操作的结果很有可能是不对的。

这是DX10 SDK文档对RowPitch的解释：

The pitch, or width, or physical size (in bytes), of one row of an uncompressed texture.

一个普通texture一行的字节总数就是它的RowPitch。但要特别注意的是：RowPitch并不就等于Texture2D的width乘以其每个纹元(texel)的字节数，即：

RowPitch ≠ width* sizeof (pixelFormat)

RowPitch总是大于等于后者，并且一般是等于一个2的n次幂。从上面也可以看出Pitch是以字节为单位，而width是以像素为单位的。

举例说明：

一个ID3D10Texture2D，创建它时所使用的D3D10_TEXTURE2D_DESC结构的Format属性是DXGI_FORMAT_R32G32B32A32_FLOAT，即一个纹元占16(4×4)个字节，Width属性是400，即每一行有400个纹元，则可计算每一行16 * 400 = 6400bytes。但如果对Texture2D进行Map操作时，可以发现，Map后所得到的D3D10_MAPPED_TEXTURE2D结构的RowPitch的值却是8192(是大于6400的最小的2的n次幂）。

所以在进行Map操作时，需要针对RowPitch，而不要依赖定义texture时的width。

但是，在fx文件中对纹理进行采样的时候，针对的则是width，见如下fx代码。其中offset是相对于起点的偏移量，g_TexWidth是一个二维纹理的width，可见为了获得offset在纹理中的uv坐标，计算都是相对于width的，这时不用考虑pitch。

	
uint baseU = offset % g_TexWidth;
uint baseV = offset / g_TexWidth;

Posted in: Computer Graphics / Tagged: D3D10_MAPPED_TEXTURE2D, ID3D10Texture2D, RowPitch, width

复习了一下Frustum Culling

June 25, 2009 6:17 pm / Leave a Comment / Benny Chen

上次跟frustum culling的亲密接触还是两年前的事情，那时的一个游戏Demo里实现了quad-tree地形，并使用frustum culling显著减少三角形面的渲染。

两年前的游戏Demo：麦田里的守望者

这一丢就是两年了，最近的大规模人群渲染项目，逼得我再次对frustum culling发出了呼唤，凭着模糊的记忆，再把frustum的一些原理复习了一下，不用1个小时，我就重拾了frustum culling的相关核心概念和技术，并获得了新的理解。

这是我从两年前就开始膜拜的Chad Vernon(www.chadvernon.com)大大的一段话：

When we tell DirectX to render geometry, it will perform a culling test to see if a vertex is within the view frustum before rendering the vertex. DirectX will only render a vertex if it is within the view frustum. However, this test occurs after the geometry has been transformed and lit by the world and view matrices. In more complex scenes, thousands of vertices would be transformed just to be rejected by the frustum test in DirectX. We can speed up performance quite a bit if we implement a view frustum culling test prior to our render calls.

DirectX本身在其pipeline中就会对顶点进行culling test的，但这要在顶点被”顶点变换与光照”(Transform&Lighting)之后。Vernon在写这段话的时候，应该还是DX9的时代。在DX10的文档里也赫然写着：(Rasterizer Stage)the stage clips vertices to the view frustum，是在VS,GS这些之后才进行。

而自己手工实现frustum culling的好处，就是可以将大量的非可视的顶点在送进渲染管线之前就被拒掉~

下面的这条链接对frustum culling有比较基础而详细的介绍(这哥们爆了好多粗口……)，同时进行了一系列的优化，这也让我对frustum culling有了更深的理解。里面所链接的那篇讲解如何构造frustum的文章，当我再次翻开它的时候，马上就从我大脑中的碎片中搜索并意识到，我两年前曾经读过这篇文章。记忆总是在某个似曾相识的环境下被突然的激活。

http://www.flipcode.com/archives/Frustum_Culling.shtml

另外DX10的时代早已来临，AMD的那篇March of Froblins的论文里，Frustum Culling和LOD已经全部是在GPU里进行了，通过了Geometry Shader的帮忙。在如今这个时代，貌似把任何运算转移到GPU，一切皆有可能。

打算最近把frustum culling相对于我目前所进行的人群渲染项目，在CPU和GPU都实现一个版本，并进行一些性能的比较。在我现在的项目里，估计实现后GPU的版本不一定就比CPU的跑的快，因为我的GPU已经承载了大量的人群渲染任务，而CPU到目前为止还基本是空闲的。

Posted in: Computer Graphics / Tagged: frustum culling

indexed triangle lists最快

June 16, 2009 5:30 pm / Leave a Comment / Benny Chen

一直以为，因为一条三角形带(triangle strip)会把处理与传输m个三角形的代价从3m个顶点降到（m＋2）个顶点，所以它是最高效的。今天从Gamedev的一篇帖子才知道，索引三角形列表(indexed triangle lists)才是最快的。

确实，三角形带减少了输入显卡的顶点数，但对于现今的显卡来说，带宽早就不是问题了！

至于为什么三角形带是最快的，因为在处理顶点时它可以最大化显卡缓存的使用率(cache hit ratio)。

下面这段话摘自Tom Forsyth的论文Linear-Speed Vertex Cache Optimisation，如何分配三角形的序列，以使的cache得到最好的利用。算法是贪心性质的，速度可以达到O(N)，有兴趣可以研究研究。

Indexed primitives have replaced strips and fans as the most common rendering primitive in graphics hardware today. When rendering each triangle, the vertices it uses must be processed (for example, transformed to screen space and lit in various ways) before rendering. Indexed rendering hardware typically uses a small cache of vertices to avoid re-processing vertices that are shared between recently-rendered triangles. Using a moderately-sized cache (anywhere between 12 and 24 vertices is common) reduces the load on the vertex processing pipeline far more than the older non-indexed strips and fans model – often halving the amount of work required per triangle.

Posted in: Computer Graphics / Tagged: 三角形列, 三角形带

ID3D10EffectPass::Apply

June 15, 2009 5:13 pm / Leave a Comment / Benny Chen

因为没有很好的理解这个函数的作用，而导致被一个Bug纠缠了半天。

这个Bug是这样的：渲染一个人物模型和一个地板模型，它们分别有不同的纹理，但渲染出来的结果却是——人物的纹理贴到了地板上，地板的纹理贴到了人物上，纹理错位了！

我明明在渲染前都分别将各自的纹理视图(ID3D10ShaderResourceView)设置到shader的纹理接口(ID3D10EffectShaderResource)上了啊，怎么会出现如此诡异的现象。

调的很崩溃，freaking me out…对于bug源的推理，我磨了好久，才想到了去打破我的思维定势——真正启动Shader开始执行的是Draw函数，所以对于渲染状态的设置只需要在Draw函数前设置就是有效的。

前半句没错，后半句，非也！

我忽视了Apply函数的作用，它不仅仅只是挑选某个pass而已。

这是DirectX SDK Document对ID3D10EffectPass::Apply的描述：

Set the state contained in a pass to the device.

将状态提交到设备。马上我就恍然大悟，我设置纹理的语句是在Apply之后进行的。把设置纹理的语句提到Apply函数之前，于是，纹理物归原主。

调试BUG，除了需要很好的逻辑推理排错能力，还要勇于去怀疑一些思维定势（当然也没必要怀疑一切），有时，或许越是深信不疑的某些东西就是错误的根源，打破它，错误也便迎刃而解。

before bug was fixed

after bug was fixed

Posted in: Computer Graphics / Tagged: Apply, DX10, ID3D10EffectPass

Newer Posts →

Category Archives: Computer Graphics

关于对std::vector的遍历

关于D3D10_MAPPED_TEXTURE2D的RowPitch

复习了一下Frustum Culling

indexed triangle lists最快

ID3D10EffectPass::Apply

Post Navigation

LinkedIn

Categories

Recent Posts

Recent Comments

Category Archives: Computer Graphics

关于对std::vector的遍历

关于D3D10_MAPPED_TEXTURE2D的RowPitch

复习了一下Frustum Culling

indexed triangle lists最快

ID3D10EffectPass::Apply

Post Navigation

LinkedIn

Categories

Recent Posts

Recent Comments

Tags