Getting rendered graphics (quickly) from the card

I had a thread about thins some days back, but for some reason I can not write into it now.

I am faced with the task of rendering some graphics with directx and then copying that graphics into another card which generates a video signal. I need to feed it a new frame 25 times a second. The size is 720*576

The rendering does not have to be any faster that this.

I have tried two methods of geting the pixels, and both are much too slow. One takes 500ms for a frame and the other 1800ms. I need a time below 40 for this to work.

I post the two methods below. Perhaps someone can provide input as to what I need to do. I might add that others have acomplished this using opengl, so it should not be a hardwarelimitation, as far as I can tell. If I have go do things unsafe then that is fine. Some of the other pixel manipulations are already unsafe for speed.

If this is something which a certain hardwareupgrade can help with, then I am also interested in hearing about that. It should be possible in software though... or so I beleive.

Method which takes 500 ms

byte[] buffer = new byte[720];

Surface surface = device.GetBackBuffer(0, 0, BackBufferType.Left);

GraphicsStream graphicsStream = surface.LockRectangle(LockFlags.ReadOnly);

int timer = Environment.TickCount;

for (int line = 0; line < 576; line++)

graphicsStream.Read(buffer, 0, 720);

timer = Environment.TickCount - timer;

MessageBox.Show(timer.ToString());

Method which takes 1800ms

Surface surface = device.GetBackBuffer(0, 0, BackBufferType.Left);

int timer = Environment.TickCount;

Array data = surface.LockRectangle(typeof(VideoColor),LockFlags.DoNotWait,new int[]{720,576});

timer = Environment.TickCount - timer;

MessageBox.Show(timer.ToString());



Answer this question

Getting rendered graphics (quickly) from the card

  • Linto Poulose E

    I've been working on a video capture library for Managed DirectX myself and I can confirm that this indeed is very hardware dependent. I'm using the approach below on a rig with a P4 dual core and a Radeon X850, which yields ~1200 fps including AVI encoding. On my centrino laptop with a Radeon Mobility though, it tops out at about 10 fps. I guess dual core is always a nice thing to have to guarantee stable framerates when doing this kind of processing on the side.

    Anyway, my approach to this is to copy the backbuffer data into a (offscreen plain) buffer surface in the SystemMemory pool using device.GetRenderTargetData and reading the data from this buffer surface. I found that the performance drastically increased compared to reading from the backbuffer directly (from 10fps to about 400fps).

    To increase performance some more, I read the buffer surface into a Bitmap object each frame, so the processing on this data can be done from another thread without having to lock any device-bound resources. Copying to a Bitmap can be easily done efficiently in system memory using SurfaceLoader.SaveToStream and the Bitmap.FromStream method. This sounds quite expensive, but I found that overhead from this is minimal.

    Anyway, here's the current beta for our screen capture library.

    Hope this helps



  • avner ben-zvi

    Thomas,

    what kind of bus do you use - is it a PCI-X or a standard PCI (32Bit/33MHz) bus Does the bus matter in any way or does the speed of the standard PCI bus fit

    Regards,
    Florian

  • hriverag93

    This is the method, which I call, after EndScene() but before Present()

    VideoColor is a structure of 4 bytes. RGBA

    private unsafe void copyGraphicsToAddress(VideoColor* buffer)

    {

    //now grab the data from the backbuffer

    Surface backBuffer = device.GetBackBuffer(0, 0, BackBufferType.Mono);

    if (bufferSurface == null) //bufferSurface is declared outside the method and is just a Surface

    bufferSurface = device.CreateOffscreenPlainSurface(backBuffer.Description.Width,

    backBuffer.Description.Height, backBuffer.Description.Format, Pool.SystemMemory);

    device.GetRenderTargetData(backBuffer, bufferSurface);

    GraphicsStream graphicsStream = bufferSurface.LockRectangle(LockFlags.ReadOnly);

    VideoColor* videoData = (VideoColor*)graphicsStream.InternalDataPointer;

    //do the actual copy into the external render target

    int numPixels = backBuffer.Description.Width * backBuffer.Description.Height;

    for (int i = 0; i < numPixels; i++)

    {

    bufferIdea = videoDataIdea;

    }

    //unlock the buffer

    bufferSurface.UnlockRectangle();

    }


  • cathalconnolly

    I am trying to solve a similar problem, would you mind posting the short bit of code you used to get the data from the GPU so quickly Thanks
  • Dlloyd

    I didn't mean to say that is is not hardware dependant. My point was that the openGl method can work at 25fps on the same hardware where the other dx methods, that i tried, gives runs at 2fps.

    That was why I concluded that 25fps was within the reach of the hardware.

    Thanks for the replie. It sounds as if you have found a method which can give me the measly 25fps I need. Ok, 30 fps to have a little something to spend elsewhere ;-)

    I will take a look at it and see if I can duplicate the results. It will also be a big help to see the actual cals to create surfaces and set rendertargets etc. I am still new to dx and that is something which often fails for me.

     ----------

    Ok, I tried your method and it is much faster than what I did. Now it can be done in 16ms+47ms. 16ms for getting the backbuffer and 47ms for reading the 720*576*4 bytes into a bitmap. It is still a little too slow though.

    Do you have any further suggestions for optimization

    Another thing is that I need the data written into a specific address. I can get a graphics stream and read out single bytes, which is slow, or I can get the surfaceloader to return a bitmap which is better,  but then I have to read out the data and write it to the buffer myself.

    Is there a way to get directx to write the data to a specific location

    Is there something I can do to optimize, when you know that I do not need to write to an visible surface If i can simply render, get the data and feed them to the videoout, then it is fine.

     


  • GSGIMD

    Just as a general thought... OpenGL != Direct3D - they may seem very similar, but they do have different architectures/methodologies. Can be difficult to try and compare features.

    Do you have any further suggestions for optimization

    Have you considered any buffered reading from the GPU I helped someone with this a while back and they said it worked really well. If you create a number of render targets in a ring-buffer fashion you can allow yourself more time to read back the data (e.g. with 10 elements in the ring, you can spend up to 9 frames pulling back the first frames data). It's more difficult to program - but the basic idea is to try and make use of cooperative programming techniques and avoid stalling the GPU/CPU.

    Some more profiling (dig into PIX with some IHV-specific plugins) will probably be useful. You've got a number of specific stages involved in what you're doing - it'll be useful to know which one is slowing things down so that you can focus your efforts on it.

    Is there a way to get directx to write the data to a specific location

    Not that I'm aware of. When reading these sorts of things, you basically get a handle to it's existing (internal) location.

    hth
    Jack


  • dannback

    Currently i use a pci bus on a slow motherboard.

    If my memory serves me correctly a realistic throughput for a pci but on a system with some other activity is 80MB/s

    In my (and your ) case with 576*720*4 bytes pr frame at 25 fps you use 41MB/s which is half the bus capability. If you want to go 50fps then you are at the limit and if the gfx card and the videocard are both on the same bus, then you are in trouble.

    It does matter, but for tv framerates and resolutions and for more up-to-date systems, you should have plenty of time to spare.

    It is working here now by the way. I can do this on my slow system. Agp gfx card and pci videocard. Runs at 33 fps if I do not sync to the tv framerate. Sending out frames and not fields by the way. It looks ok, and I avoid throwing half my data away in the interlacing.


  • Jeff Menninger

    My data currently is a standard Bitmap object but could just as easily be a byte[] etc (let me know what would be fastest). The textures I'm loading are not always the same size, but all of the textures I load at one time are the same size. I don't need MipMaps. I just need a way to quickly (video rate would be nice) load images in the size range of 1024X1024 into the effect so that I can run the pixel shader and then render to screen. Thanks
  • Mazzel

    Thank you Tom for your post, this is much faster than the method I was using before. Have you found an equally fast method for uploading textures to the graphics card The 2 methods I have found to create textures to give to the Effect are:

    texture = TextureLoader.FromFile(device, @"textures_All_9.bmp");

    texture = Texture.FromBitmap(device,this.bmp,(Usage)0,Pool.Managed);

    The first method of reading from a file is very fast, while the second method of reading from memory is oddly enough very slow (~100ms compared to ~1500ms). Do you know of a similar technique to take an image or stream in system memory and quickly convert it to a texture on the device Thanks again for your help.


  • vinayshetty

    I might add that i now tried using the InternalDataPointer in an unsafe context to move through the data one pixel (32 bit) at a time. It was a tiny bit faster. This was arround 450 versus the previous 500.

    The opengl code which does this 25 times a second simply uses

    glReadBuffer(GL_BACK);
    glReadPixels(0, 0, 720, 576, GL_BGRA_EXT, GL_UNSIGNED_BYTE, distinationBuffer);

    What i really need is to move the rendered pixels from the graphics card and into a predefined buffer. I do not need to look at them one by one.


  • Andrew Pardoe

    You can use the MemoryStream class from the framework and the FromStream Methode from TextureLoader.
    Depending on your special use case there are even faster methods to update the content of a texture.

    Do you have raw bitmap data or data stored in a common file format (BMP, JPG)

    Does your texture always have the same size

    Do you need Mipmaps



  • LCasselle

    Do you have any further suggestions for optimization

    Well, if you know the format of your backbuffer, you don't need the (potentially expensive) conversion to the Bitmap. If you have a 32bit ARGB color backbuffer for example, you could simply read the int's into an array from the GraphicsStream, which should be faster than contructing a Bitmap (at least you'll skip any format conversions).

    Using the buffered Read(System.ValueType, int[] ranks) method from the GraphicsStream should allow you to read this data very efficiently, especially if you read the entire stream into the buffer at once (i.e. into a buffer array with a length of 720x576). This can be done by supplying this number as the first and only ranks parameter.

    I haven't got a clue about the device you are writing to, but I think it makes sense to match your backbuffer format to the input format your device expects. This way, you can just write the binary data you read from the backbuffer directly to the device without worrying about any conversions. It's both easier and probably also the fastest approach.

    Writing to a specific address can be done by locking a surface and either writing to the buffer array or by writing to the GraphicsStream returned from this method. You can use the Seek(long pos) method on the GraphicsStream to position the write cursor to where you need it and use the buffered Write method to write your array data. If you use the buffered array, you can use Array.Copy to efficiently copy parts of the backbuffer data onto your surface. The choice between the buffer array or the GraphicsStream depends on what data you need to write. For random access writes, the buffer array may be faster.

    But exactly how does the video-out card work anyway Is it a normal D3D9 device, or what is it If it is a specialized card with a specialized API, it may expose methods to write to it more efficiently.



  • Pikker1981

    For some odd reason I missed the graphicsstream.internaldatapointer. I used the ReadByte, which was stupid slow. After all... there is no point in readin out bytes over at bus 32 bits wide.

    Thanks a lot for the feedback. Currently the locking, reading and unlocking takes 15ms on this machine. I can make a full rendered frame, lock,read,unlock in just about the same time which is probably due to the potent graphics card.

    Using the pointer to internal data and reading out 32 bit at a time (using the correct dataformat) did the trick, it seems. Reading into a fixed array is just as fast.

    "Writing to a specific address can be done by locking a surface and either writing to the buffer array or by writing to the GraphicsStream returned from this method. "

    Why would I want to write into the buffer I do not need to change its data. Just feed it to the videocard.

    An odd thing is that the surface I get from the lock does not havethe same size as my display. is is 716*572 and not 720*576 for some undefined reason..

    The ring buffer you suggest seem to make sence. You mean that while i read from one rendertarget, the card can draw at another. I can not see why a ring is needed though. Would two targets not be enough After all the two operations is generally equal in speed, so a buffer as such should not be needed. You suggest this because i can keep a constant rate of reads even if the gfx card hits a bump I doubt that the gfx card will ever have speed problems compared to the read.

    To answer the question about the videocard... It is a bluefish card which can do only one thing. That is to present an internal buffer as a videosignal. You render graphics on a normal graphics card, then you take the data from there and give it to the videocard. Out comes (in my case) a 50Hz interlaced PAL tv signal. Dma is possible, but only between a specific buffer the card allocates in system memory and one on the card itself.

    It looks like you solved my problem. Now I am under the magic 40ms limit with a fair margin. unless I missed something critical in my testing just now, then I am happy and owe you a big sign saying "thanks" :-)


  • Amethyste

    Glad to see it worked out. With the writing to the buffer part I assumed you had to write to a target surface on that bluefish card, but I guess you can safely ignore that completely



  • Getting rendered graphics (quickly) from the card