Remove everything except the last 500,000 bytes from the STL file

Our logging class, when initialized, truncates the log file to 500,000 bytes. At this point, the log statements are added to the file.

We do this to keep disk usage low, we are a marketable end product.

Obviously, saving the first 500,000 bytes is not useful, so we are saving the last 500,000 bytes.

Our solution has serious performance issues. What's the efficient way to do this?

+1


source to share


8 answers


"I would probably create a new file, look in the old file, do buffered read / write from the old file to the new file, rename the new file over the old one."

I think you would be better off:

#include <fstream>
std::ifstream ifs("logfile");  //One call to start it all. . .
ifs.seekg(-512000, std::ios_base::end);  // One call to find it. . .
char tmpBuffer[512000];
ifs.read(tmpBuffer, 512000);  //One call to read it all. . .
ifs.close();
std::ofstream ofs("logfile", ios::trunc);
ofs.write(tmpBuffer, 512000); //And to the FS bind it.

      

This avoids renaming the file by simply copying the last 512K to the buffer, opening your log file in truncate mode (clearing the contents of the log file) and splashing that same 512K back to the beginning of the file.

Please note that the above code has not been tested, but I think the idea should be robust.

You can load 512K into an in-memory buffer, close the input stream, and then open the output stream; this way you don't need two files as you enter, close, open, output 512 bytes, and then go. This way you avoid the file renaming / moving maneuver.



If you have no aversion to mixing C with C ++ to some extent, you can also:

(Note: pseudocode, I don't remember having the mmap call get removed from my head)

int myfd = open("mylog", O_RDONLY); // Grab a file descriptor
(char *) myptr = mmap(mylog, myfd, filesize - 512000) // mmap the last 512K
std::string mystr(myptr, 512000) // pull 512K from our mmap'd buffer and load it directly into the std::string
munmap(mylog, 512000); //Unmap the file
close(myfd); // Close the file descriptor

      

Depending on a lot of things, mmap can be faster than searching. Googling 'fseek vs mmap' gives an interesting insight into this, if you're interested.

NTN

+6


source


I would probably:

  • create a new file.
  • search in the old file.
  • perform buffered read / write from old file to new file.
  • rename the new file over the old one.

To complete the first three steps (error checking omitted, for example I can't remember what it looks for if the file is less than 500KB):

#include <fstream>

std::ifstream ifs("logfile");
ifs.seekg(-500*1000, std::ios_base::end);
std::ofstream ofs("logfile.new");
ofs << ifs.rdbuf();

      

Then I think you should do something non-standard to rename the file.

Obviously, you need 500KB free space for this to work, so if the reason you truncate the log file is because it just filled up the disk, that's not good.

I'm not sure why the search is slow, so I might be missing something. I would not expect the search time to depend on the file size. What may depend on the file is that I'm not sure if these functions handle 2GB + files on 32 bit systems.



If the copy itself is slow, then depending on the platform you can speed it up by using a larger buffer, as this reduces the number of system calls and, more importantly, the number of times the disk head has to search between the read point and the write point. For this:

const int bufsize = 64*1024; // or whatever
std::vector<char> buf(bufsize);
...
ifs.rdbuf()->pubsetbuf(&buf[0], bufsize);

      

Test it with different values ​​and see. You can also try increasing the buffer for the stream, I'm not sure if that will change.

Note that using my approach in "real" file logging is hairy. For example, if a log entry is added between copy and rename, then you lose it forever, and any open handles to the file you are trying to replace may cause problems (on Windows this will fail, and on Linux this will replace the file, but the old one will - still take up space and still record until the handle is closed).

If the truncation is done from the same thread that does all the logging then no problem and you can keep it simple. Otherwise, you will need to use blocking or another approach.

Whether or not this is completely reliable depends on platform and filesystem: move-and-replace may or may not be an atomic operation, but it usually isn't, so you might have to rename the old file aside, then rename the new file, and then delete the old one and repair the error which on startup detects if there is a renamed old file and, if so, returns it and restarts the truncation. The STL can't help you sort out platform differences, but there is a boost :: filesystem.

Sorry, there are many caveats here, but a lot is platform dependent. If you are on a PC, then I am puzzled that copying a dead crescent of data takes a while.

+3


source


If you are using windows, do not interfere with copying parts around. Just tell Windows you don't need the first bytes anymore by calling FSCTL_SET_SPARSE

andFSCTL_SET_ZERO_DATA

+3


source


If you can create a log file from multiple GBs in between reinitializations, it seems that truncating the file only on initialization won't help.

I think I'll try to come up with a special text file format to always replace the content in-place by pointing to the "current" line. You will need a constant line width to allocate disk space only once and place the pointer on the first or last line of that file.

This way the file will never grow or shrink and you will always have the last N records.

Illustration with N = 6 ("|" indicates a space as long as there):

#myapp logfile, lines = 6, width = 80, pointer = 4 |
[2008-12-01 15:23] foo bakes a cake |
[2008-12-01 16:15] foo has completed baking a cake |
[2008-12-01 16:16] foo eats the cake |
[2008-12-01 16:17] foo tells bar: I have made you a cake, but I have eaten it |
[2008-12-01 13:53] bar would like some cake |
[2008-12-01 14:42] bar tells foo: sudo bake me a cake |
+1


source


An alternative solution would be to detect the log class when the log file exceeds 500k and open a new log file and close the old one.

Then the log class will look at the old files and delete the oldest

The recorder will have two configuration parameters.

  • 500k for new log start threshold
  • the number of old logs to store.

This way the management of the log files will be self-sustaining.

+1


source


I don't think this is anything computer related, but how did you guys write your logging class. It seems strange to me that you read the last 500k lines per line, why would you do that?

Just add the log file.

  fstream myfile;
  myfile.open("test.txt",ios::app);

      

0


source


Widefinder 2 has a lot of talk about efficient IO access (or more precisely, the links in the Notes column have a lot of information about efficient IO).

Answering your question:

  • (Header) Remove the first 500,000 bytes from the [standard library] file

The standard library is somewhat limited when it comes to file system operations. Unless you are limited by the standard library, you can terminate a file prematurely very easily (that is, say "everything after this point is no longer part of this file"), but it is very difficult to run the file at the end ("everything up to this point is no longer is part of this file ").

It would be simple to simply request 500,000 bytes to a file and then run the buffered copy into a new file. But once you have done that, there is no ready-made "rename this file" function in the standard library. Native OS functions can rename files efficiently as well as Boost.Filesystem or STLSoft.

  1. (Factual question) Our logging class, when initialized, looks for 500,000 bytes to the end of the file, copies the rest to std::string

    and then writes them back to the file.

In this case, you are deleting the last bit of the file, and this is very easy to do outside of the standard library. Just use filesystem operations to set the file size to 500,000 bytes (eg ftruncate

, SetEndOfFile

). Anything after that will be ignored.

0


source


So you want the end of the file - are you copying it into some buffer so you can do what with it? What do you mean "write back" to the file. Do you mean that it overwrites the file, truncating on init to 500k bytes of the original + what it adds?

Suggestions:

  • Rethink what you are doing. If it works and what is needed, what is wrong with it? Why change? is there a performance issue? Are you starting to wonder where all your journal entries went? This helps most in this type of question provide more of a problem than posting existing behavior. No one can fully comment on this unless they know the full problem, because it is subjective.

  • If I and I were tasked with reworking your logging mechanism, I would build a mechanism to truncate log files by: duration or size.

0


source







All Articles