I'm processing large binary files. These are PCL files, and I'm looking for page boundaries. I want to store the position of each Form Feed, which in PCL is decimal 12, hex 0C. However, that byte can also exist as part of a raster or other binary structure.
So I loop through the file, and when I find a "12", I read ahead 14 bytes to compare them to a known string. If I get a match, I know the 12 was a real Form Feed, and I store it's position in an ArrayList.
This works fine using a FileStream object and its ReadByte() method and .Position property. The problem is it is very slow. I'd like to use a StreamReader to take advantage of buffering. However, when I use a StreamReader, the FileStream's Position property points to the amount that's been buffered, not the actual file position.
So my question is, how can I have the speed of StreamReader, but still maintain an accurate file position
Sample code, the StreamReader Version. Hopefully, someone can suggest a change that would report the "virtual" file position of the "current byte", rather than the current file position reached through buffering.
using System; using System.IO; using System.Text; using System.Collections; namespace pcl_proc string bgn_of_page = " &l8c1E *p0x0Y"; long curr_pos; int pcl_char; char[] test; string filename = @"C:\Statements-05-03-05.pcl"; // need to initialize header and position of first page. test = new char[1024]; input.Read(test, 0 , test.Length); header = asciiString.Substring(0,asciiString.IndexOf("*b0M") + 4); while (input.Peek() >= 0 ) // this next line doesn't record the accurate position input.Read(test, 0, test.Length); asciiString = new string(test); if (asciiString == bgn_of_page) infile.Close(); |
Note: the "bgn_of_page" string is actually 14 bytes, the forum stripped out the two "escape" characters. I mention this in case anyone wonders why I'm reading 14 bytes and comparing it to a 12 byte string.
Note: http://msdn.microsoft.com/library/default.asp url=/library/en-us/cpref/html/frlrfsystemiostreamreaderclassbasestreamtopic.asp, contains this enigmatic statement:
"StreamReader might buffer input such that the position of the underlying stream will not match the StreamReader position." Yes, that's right. But they offer no method or example to deal with that situation. Also, they refer to "the StreamReader position". Well, what is the StreamReader position How do I find it What method or property returns it

StreamReader and File Position
angelasw
I have had a similar problem in a couple of instances as well. I used a two pass solution. In Pass one I mark all the interesting bytes. In pass two, I start from known positions and read however much I need to.
Here are two classes that encapsulate that behavior. IndexedFile scans an input file for markers you specify (as regular expressions or just a string) and creates a list of locations in the file where they are located. You can iterate over the IndexedFile positions, or you can access them directly by index. (Say you want to go to the third paragraph).
In my example main, I am calling DiscardBufferedData(); This is the only way to get the streamreader to sync back up to the underlying file stream. In this contrived example, you pay a lot for discarding the buffer before each read. However, in practice you won't be doing this very often. You will seek to an index in the file, then get lines from it for a while.
If you never wanted to call DiscardBufferedData(), you could instread read from the current location, to the next location minus one. If your data has markers often enough, you won't consume too much memory and you will have just the data you wanted to work with.
class Program //example code for using the classes below. { static void Main(string[] args) { string sampleDataFileName = @"somefile"; Regex SectionHeadingRe = new Regex(@"Chapter:(.*)"); IndexedFile file = new IndexedFile(sampleDataFileName, SectionHeadingRe); FileStream fs = new FileStream(sampleDataFileName, FileMode.Open); StreamReader sr = new StreamReader(fs); foreach (long position in file) { Console.WriteLine(position); fs.Seek(position, SeekOrigin.Begin); sr.DiscardBufferedData(); Console.WriteLine(sr.ReadLine()); } } } public class IndexedFile : System.Collections.IEnumerable { #region private properties private FileStream fs; private StreamReader sr; private List bookmarks = new List(); #endregion #region constructors private IndexedFile() { //no default constructor. This means filename and pattern are required. } ~IndexedFile() { Close(); } /// /// opens a file and parses it for the pattern string. Constructs a list of locations where the string is located. /// /// /// string to mark indexs for. public IndexedFile(string filename, string pattern) { init(filename); scanFile(pattern); } /// /// Opens a file and parses it using the supplied regular expression. Constructs a list of locations where the string is located. /// /// /// public IndexedFile(string filename, Regex patternRegex) { init(filename); scanFile(patternRegex); } #endregion #region public accessors public int Count { get { return bookmarks.Count; } } public long this[int index] { get { return bookmarks[index]; } } #endregion #region public methods public void Close() { sr.Close(); fs.Close(); } #endregion #region private methods private void init(string filename) { fs = new FileStream(filename, FileMode.Open); sr = new StreamReader(fs); } #endregion private void scanFile(string pattern) { string p = Regex.Escape(pattern); Regex patternAsRe = new Regex(p); scanFile(patternAsRe); } private void scanFile(Regex pattern) { long seekPos = 0; string line = string.Empty; while (sr.Peek() != -1) { line = sr.ReadLine(); MatchCollection matches = pattern.Matches(line); foreach (Match m in matches) { if (m.Success) { bookmarks.Add(m.Index + seekPos); } } seekPos = seekPos + line.Length + 2; // add two for the CR/LF readline strips. } Close(); } #region IEnumerable Members public System.Collections.IEnumerator GetEnumerator() { return new IndexedFileEnumerator(this); } #endregion } public class IndexedFileEnumerator :System.Collections.IEnumerator { private IndexedFile iFile = null; private int index = -1; public IndexedFileEnumerator(IndexedFile indexedFile) { this.iFile = indexedFile; } #region IEnumerator Members public object Current { get { try { return iFile[index]; } catch (IndexOutOfRangeException) { throw new InvalidOperationException(); } } } public bool MoveNext() { index++; return (index < iFile.Count); } public void Reset() { index = -1; } #endregion }Scott Berry
Note that you should chech the result of input.Read(...) to make sure you are getting all of the data you are expecting, as there isn't a requirement for it to return all of the available data.
dydoria
Electro808
Thanks. I need help now with implementing the code. How, in an overriden method, do you access private base members If I create a new class, with the following code, I'll get errors that "stream", "charPos", and "ReadBuffer" are inaccesible due to their protection level. This is because, in the StreamReader class, they are private members.
using System;
using System.Text;
using System.Runtime.InteropServices;
using System.IO;
namespace streamOR
{
public class StreamReader2 : System.IO.StreamReader
{
private int _lineLength;
public int LineLength
{
get{return _lineLength;}
}
public StreamReader2(String path) : base(path)
{
}
public override String ReadLine()
{
_lineLength = 0; /* added dac */
if (stream == null)
throw new NullReferenceException("Reader is closed");
if (charPos == charLen)
{
if (ReadBuffer() == 0) return null;
}
StringBuilder sb = null;
do
{
int i = charPos;
do
{
char ch = charBuffer[i ];
// Note the following common line feed chars:
// n - UNIX rn - DOS r - Mac
int EolChars = 0; /* added dac */
if (ch == 'r' || ch == 'n')
{
EolChars = 1; /* added dac */
String s;
if (sb != null)
{
sb.Append(charBuffer, charPos, i - charPos);
s = sb.ToString();
}
else
{
s = new String(charBuffer, charPos, i - charPos);
}
charPos = i + 1;
if (ch == 'r' && (charPos < charLen || ReadBuffer() > 0))
{
if (charBuffer[charPos] == 'n')
{
charPos++;
EolChars = 2; /* added dac */
}
}
_lineLength = s.Length + EolChars; /* added dac */
return s;
}
i++;
} while (i < charLen);
i = charLen - charPos;
if (sb == null) sb = new StringBuilder(i + 80);
sb.Append(charBuffer, charPos, i);
} while (ReadBuffer() > 0);
string ss = sb.ToString();
_lineLength = ss.Length; /* added dac */
return ss;
}
}
}
Dominic Baines
Kendal
In the latest incarnation of this problem, I'm processing a large TEXT file. I want to use a StreamReader. However, as I process the file, I need to "note" certain records. Imagine the file to be a large "document" consisting of many "pages". I may want to extract a "chapter". I know how to recognize when a "chapter" begins, and when it ends.
Once I encounter the record that ends a "chapter", I want to go back to the start of the chapter, and capture all intervening records to a second file.
What I really need is a "Position" property, to know that a certain record BEGINS at a specific byte-position in the underlying file. However, the base stream's Position property refers to how many BUFFERS have been read, not the actual Position of the CURRENT RECORD.
Is there an elegant way to process a TEXT file, using StreamReader.ReadLine(), and yet still have an accurate "Position" property
Dror_h
Here is the snippet of changes that I made to the ReadLine() method.
For my original use, I provided a read-only property that gave the byte length of the line that was just read by ReadLine(). You can easily modify it to keep track of the file position.
Lines that I changed from the original code are indicated by /*--mod--*/
private int _lineLength; /*--mod--*/
public int LineLength { /*--mod--*/
get{return _lineLength;} /*--mod--*/
}
public override String ReadLine() {
_lineLength = 0; /*--mod--*/
if (stream == null)
__Error.ReaderClosed();
if (charPos == charLen) {
if (ReadBuffer() == 0) return null;
}
StringBuilder sb = null;
do {
int i = charPos;
do {
char ch = charBuffer[ i ];
int EolChars = 0; /*--mod--*/
if (ch == '\r' || ch == '\n') {
EolChars = 1; /*--mod--*/
String s;
if (sb != null) {
sb.Append(charBuffer, charPos, i - charPos);
s = sb.ToString();
}
else {
s = new String(charBuffer, charPos, i - charPos);
}
charPos = i + 1;
if (ch=='\r' && (charPos<charLen || ReadBuffer()>0)) {
if (charBuffer[charPos] == '\n') {
charPos++;
EolChars = 2; /*--mod--*/
}
}
_lineLength = s.Length + EolChars; /*--mod--*/
return s;
}
i++;
} while (i < charLen);
i = charLen - charPos;
if (sb == null) sb = new StringBuilder(i + 80);
sb.Append(charBuffer, charPos, i);
} while (ReadBuffer() > 0);
string ss = sb.ToString();
_lineLength = ss.Length; /*--mod--*/
return ss;
}
Shreveport
I'm reading in large chunks to a byte array (local buffer). I'm looping through the bytes, looking for my target byte.
If I want to know the file position of the "current byte", I must multiply the size of the buffer/byte array, by the number of times I've looped, then subtract the index of the current byte from the size of the buffer... plus I have to handle situations where the target is too close to the start and/or end of the buffer.
I shouldn't have to do all of this, is my point.
Bruce Sandeman
iHEARTmicrosoft_BUT
I did manage to get high performance by effectively writing my own "buffering" system, by using the FileStream.Read() method to read in 8k of data, and then loop through the resulting byte array.
However, I still would like an answer to the basic question:
When using StreamReader, since it "may" buffer, how do you get the "calculated" file position of the underlying stream If I've read three 1k "buffers" automatically while using StreamReader.Read(), but the character I want is the 5 character in the 3rd buffer, I have a file position of 2048 + 5.
I shouldn't have to make that calculation, or keep track of how many buffers have been read, etc. There should be a "Position" property and/or method for the StreamReader object that will return the position, in the underlying stream, of the current "read" byte, regardless of buffering. I want buffering to be transparent.
JarodB172890
Bryan00000
I found source code for the StreamReader class at
http://www.123aspx.com/rotor/rotorsrc.aspx rot=42055