System.IO.Compression not as good as compressed folder

I'm getting much better compression when I make a compressed folder (Windows XP) than I am using DeflateStream or GZipStream.  I thought these were the same algorithms used in PKZIP and for compressing folders.  Why such bad compression

DeflateStream: 3544Kb -> 1261Kb
GZipStream: 3544Kb -> 1261Kb
Windows XP: 3544Kb -> 804Kb

So how can I get the same compression ratio as Windows XP

Thanks,

Jeremy


Answer this question

System.IO.Compression not as good as compressed folder

  • Raudhah

    This is very curious, by the way. I've been using GZipStream for months to compress XML files, and that works very well. The compression rate is about 10:1 or better -- nothing to complain about.
     
    Possibly Microsoft neglected to add detection code for poorly compressible data that should simply be stored. What data did you put in your file Your WinZip/XP compression rates don't look so great, either -- was that random test data, too, or perhaps some binary data that's close to random data

  • njahamaca

    Holy moly, you're right!  I could reproduce the effect with a file containing random bytes, as listed below.
     
    Random bytes never compress well, of course, but both .NET compression streams actually expanded the file size by a whopping 50%. The stand-alone WinRAR program only adds a few header bytes which is what should happen in this case.
     
    Original file -- 100,000 bytes
    ZipStream -- 153,829 bytes
    GZipStream -- 153,847 bytes
    WinRAR Zip -- 100,142 bytes (regardless of quality setting)
     
    I've used several runs to get different random values, always deleting the files in-between just to be sure.  The results were virtually identical.
     
    using System;
    using System.IO;
    using System.IO.Compression;
     
    namespace CompressionTest {
     
        public class MainClass {
     
            public static void Main() {
     
                // Save original file with random bytes
                byte[] file = new byte[100000];
                Random random = new Random();
                random.NextBytes(file);
                File.WriteAllBytes("data_txt.txt", file);
     
                // Save compressed version of file
                DeflateStream deflate = new DeflateStream(
                    File.OpenWrite("data_zip.txt"),
                    CompressionMode.Compress);
                deflate.Write(file, 0, file.Length);
                deflate.Close();
     
                // Save compressed version of file
                GZipStream gzip = new GZipStream(
                    File.OpenWrite("data_gzip.txt"),
                    CompressionMode.Compress);
                gzip.Write(file, 0, file.Length);
                gzip.Close();
            }
        }
    }

  • JimIT

    GZipStream is just a wrapper around DeflateStream, so you'll always see similar performance from these (with GZipStream being slightly less efficient in some cases due to the compatibility bits it adds).

    Standalone compression utilities, like the PKZIP, perform file-based compression, which is subtly different from stream compression. When compressing a file, more analysis is possible (since all bits are known at the start), and memory allocation (mainly dictionary size) can be optimized. File-based utilities can even pick the most efficient algorithm based on analysis of the input bits.

    Stream compression, on the other hand, has to 'take each bit as it comes' and is more restricted memory-wise, mostly because the working set of the stream compressor has to be predictable (and small, especially for general-purposes classes like DeflateStream).

    So, the short answer to your question is, possibly a bit disappointing, "don't use stream compression if you want the smallest possible output". Fortunately, there are several third-party compression toolkits (just Google "ZIP toolkit .NET"), many of which offer much better-performing compression algorithms than Deflate, which helps even when compressing streams.

    '//mdb

    P.S. It's perfectly normal for the data size to increase after compression if the input is random or otherwise unsuitable for the algorithm used. File-based compression utilities will opt to just store the original file in this case: pure stream compressors can't do that for obvious reasons. Of course, you can always look at the original data size and the compressed stream size, and decide which one to persist (setting a flag somewhere to indicate the format, of course...) yourself.


  • TootPeep

    Chris, thanks for filing the bug report.  Looks like Microsoft will fix this in the future.  For now, I'll investigate using a 3rd party solution.

    -Jeremy

  • k-ichiro

    It looks like the .NET compression routines are broken.  Here are some new results:

    Original: 372K
    Deflate: 540K  <--- Expanded!
    Gzip: 540K
    Windows XP: 353K
    WinZip: 353K

    On the off chance I'm doing something wrong, here's the code to save all three versions (file is a byte[]):

                File.WriteAllBytes("c:\\data_txt.txt", file);  // Save original file

                // Save compressed version of file
                DeflateStream deflate = new DeflateStream(File.OpenWrite("c:\\data_zip.txt"),
                                                          CompressionMode.Compress, false);
                deflate.Write(file, 0, file.Length);
                deflate.Close();

                // Save compressed version of file
                GZipStream gzip = new GZipStream(File.OpenWrite("c:\\data_gzip.txt"),
                                                 CompressionMode.Compress, false);
                gzip.Write(file, 0, file.Length);
                gzip.Close();

    -Jeremy

  • Nels P. Olsen

    Does the problems still exists in the .NET Framework 3.0
  • Kiryl Hakhovich

    I'm interested in compressing a folder to a file as well and would like to know if this is possible in .NET 3 and if the compression ratio issue has been fixed.

    However, if XP's compression algorithm is better and seeing as how there doesn't (yet) seem to be a simple folder-compression command in the .NET API, couldn't we just call a Shell command and have XP itself compress a source folder to a zip file Mind you, I don't actually know what Shell command would do this for us, but I would think it's worth looking into at least.

    Ideas


  • Gary W

    Apparently you can't.  What you see in the Deflate/GZipStream classes is all there is as far as the .NET Framework is concerned.
     
    However, this result is curious.  I too thought XP was using simple ZIP compression.  The two .NET compression streams don't expose compression quality parameters like stand-alone GZip/Zip applications, but even so the difference looks very big to me.  It's more like ZIP vs RAR than different ZIP settings.

  • mazen44

    > like the PKZIP, perform file-based compression, which is subtly different from stream compression.

    I don't buy it.  The file was written with one function call, so it should compress the same as PkZip since it's supposed to be same algorithm.  Even if it were broken in chunks of (say 256 bytes), the stream would only be expanded by 1% (about 3 bytes per chunk), and not a whopping 50%.

    Re: System.IO.Compression broken.

    -Jeremy


  • Najmunnisha

    Well, it looks like all the Microsofties who would know about this issue are on holiday so I filed a bug report:
     
     
    Happy new year everyone!

  • Bluesky_Jon

    Now you're quibbling over semantics. Yes, as a naive implementation of a zip algorithm DeflateStream is correct. But such a naive implementation that neglects to check for incompressible data is not what a user of this class expects when literally every other available implementation does perform this check -- including Microsoft's very own Windows XP folder compression!
     
    Nor can I see how using a third-party library would be "better" than having this functionality built into the standard library -- why then have a standard library in the first place   You say that a stream-based compressor cannot use "too much" memory to perform look-ahead optimization -- but is it not correct that DeflateStream performs no look-ahead optimization whatsoever since it does not check for incompressible data at all Or that only a few kilobytes of buffer space would be required to perform this check
     
    If this functionality is too difficult to implement in the stream itself, then Microsoft should provide a wrapper stream that performs after-the-fact checking when the stream has been closed, and add a warning to the documentation that a temporary file will be created for the original data. That would be fine with me, too.
     
    The present state of System.Compression reminds me of the original version of Math.Round with its insane "banker's rounding". That was technically "correct", too, but it wasn't what (almost) everyone expected -- so Microsoft had to fix it in version 2.0.

  • Pierre-Yves Troel

    That sounds like a reasonable explanation for the current state but it's no excuse why it couldn't be better. I don't see a reason why compression streams can't be buffered. The analysis buffer shouldn't have to be bigger than a few kilobytes to determine whether compression makes sense or straight storage should be used.

  • SqlShaun

    > it should compress the same as PkZip since it's supposed to be same algorithm

    PKzip and other file-based compression utilities can and will store files using wildly different algorithms or compression parameters. For example, here's a header dump (with the CRC and Attribute colums removed to save space) of a test ZIP file I just created using WinZIP:

     Length  Method   Size  Ratio    Date     Time   Name
     ------  ------   ----- -----    ----     ----   ----
     156000  DeflatN  81904  48%  12/17/2005  15:45  newcodes.txt
      82026  Stored   82026   0%  12/18/2005  14:47  newcodes.zip
     156000  DeflatF  83125  47%  12/17/2005  15:45  newcodes2.txt

    Newcodes.txt and newcodes2.txt are the exact same plaintext file, both compressed using the Deflate algorithm. Still, there is a noticable difference in compression ratio between the default N(ormal) Deflate configuration and the F(ast) version I forced via the command line.

    Results for a simple stream-based compressor (such as the one included in the .NET framework) will typically be in the 'DeflatF' range. This is a fact of life for stream compressors: to keep memory usage predictable and acceptable, they can't buffer too much of the stream, making look-ahead optimizations less effective.

    You'll also see that the ZIP file I added was 'Stored' instead of compressed. In this case, WinZP noticed that the file expanded after running it through the compression algorithm, and decided to discard the compressed version and store the original file. Note that this is not a function of the Deflate (or any other) algorithm, but an explicit check the programmer of the ZIP utility put in place.

    You can do the same with the System.IO.Compression streams: wrap them up in a class of your own, and persist either the plaintext or the compressed stream based on the final result. The fact that the (very basic) .NET stream compressor doesn't implement this functionality itself isn't a defect: you would need to do the exact same thing when using, say, zlib.

    Of course you're free to petition Microsoft for more full-featured compression (even though just going the third-party route sounds a lot better to me...), but the behavior of the current System.IO.Compression classes has always been as expected for me (including being able to supply streams to other Deflate implementations...).

    To prove there is a bug in DeflateStream, you would need to demonstrate significant differences in the output, for the same input file, of DeflateStream and another RFC1950 implementation, e.g. zlib. However, since such a bug would also cause major interoperability issues, and MS most likely used a RFC1950 reference implementation for DeflateStream anyway, I doubt there are any issues here.

    '//mdb

  • Chris Rogeski

    The problem does not exist in NetFX 3.0 because it didn't exist in version 2.0 either. The DeflateStream object applies the Deflate algorithm to data on a stream - the decision whether or not to use the output of the Deflate algorithm has to be made outside the DeflateStream object; for example, if you write code to read and write ZIP files (you can find an example of this at http://blogs.msdn.com/dotnetinterop/archive/2006/04/05/567402.aspx), it is up to you to determine what compression algorithm you will use for each stream inside the archive, including no compression at all (which the example code from the link above does NOT do). I repeat: DeflateStream is doing its job, and anyone complaining that it gives worse results than a ZIP utility doesn't understand the difference between the ZIP format itself, and the compression of data streams inside a ZIP file.

    As it happens, the compression algorithm in DeflateStream does appear to have a bug, but it is something entirely different. Please don't make MS staff waste their time simply because you do not understand what you're talking about.


  • System.IO.Compression not as good as compressed folder