C ++ insert line into file with specific line number

I want to be able to read from the unsorted source text file (one record per line) and insert the line / record into the destination text file, specifying the line number where it should be inserted.

Where to insert the line / record in the target file will be determined by comparing the incoming line from the incoming file with the already ordered list in the target file. (The target file runs as an empty file, and the data will be sorted and inserted into it one line at a time as the program iterates over the input lines of the file.)

An example of an incoming file:

1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data

      

An example of the desired destination file:

2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data

      

I could do this by doing an in-memory sort via a linked list or similar, but I want this to scale to very large files. (And I'm happily trying to solve this problem, since I'm new to C ++ :).)

One way to do this could be to open 2 streams of files with fstream

(1 in and 1 out or just 1 in / out), but then I run into difficulties that are hard to find and find the file position because it seems to be depends on the absolute position from the beginning of the file, not on the line numbers :).

I'm sure problems like this have been resolved before and I would appreciate advice on how to proceed in this way, which is good practice.

I am using Visual Studio 2008 Pro C ++ and I am just learning C ++.

+1


source to share


8 answers


The main problem is that under normal OS, files are streams of bytes. There is no concept of lines at the file system level. These semantics should be added as an additional layer on top of those provided by the OS. Although I've never used it, I believe VMS has a write-oriented filesystem that will make what you want to do easier. But on Linux or Windows, you cannot insert them in the middle of the file without overwriting the rest of the file. It is similar to memory: at the highest level, it is just a sequence of bytes, and if you want something more complex, like a linked list, you need to add it at the top.



+1


source


If the file is just a text file, I'm afraid the only way to find a specific numbered line is to move the file count lines as you go.

The usual non-memory way of doing what you are trying to do is to copy the file from the original to a temporary file, paste the data at the desired point, and then rename / replace the original file.



Obviously, after you've done your paste, you can copy the rest of the file into one big chunk, because you don't need line counting anymore.

+1


source


The [explicitly-not-C ++] solution would be to use the * nix tool sort

, sorting by the second data column. It might look something like this:

cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>

      

It's not quite in place and it doesn't fulfill the request to use C ++, but it works :)

It is even possible to do:

cat <file> | sort -k 2,2 > <file>

      

I have not tried this route yet.
* http://www.ss64.com/bash/sort.html - sort man page

+1


source


One way to do this is not to sort the file, but use a separate index using berkley db ( BerkleyDB ). Each db entry has sort keys and an offset to the main file. The advantage of this is that you can have multiple ways of sorting without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end and updating the index to ignore the old line and point to the new one. We have used this successfully for text files with multiple GBs that we had to make small changes.

Edit: The code I developed to do this is part of a larger package that can be downloaded here . The specific code is in the btree * files in source / IO.

+1


source


I think the question is more about implementation rather than specific algorithms, in particular handling very large datasets.

Suppose the original file has 2 ^ 32 lines of data. Which would be an efficient way to sort the data.

This is how I would do it:

  • Parse the source file and extract the following information: sort key, line offset in the file, line length. This information is written to another file. This creates a fixed-size item dataset that is easy to index, call it an index file.

  • Use a modified merge sort. Recursively split the index file until the number of items to be sorted reaches the minimum sum - true merge sort recurses to 1 or 0 items, I suggest stopping at 1024 or whatever, this will require some tweaking. Load a block of data from the index file into memory and do quicksort on it, then write the data back to disk.

  • Merge the index file. It's difficult, but you can do it like this: load a block of data from each source (for example, 1024 records). Merge into a temporary output file and write. When the block is empty, refill it. When no more original data is found, read the temp file from the beginning and overwrite the two parts to be merged - they must be contiguous. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). With this step in mind, it might be possible to customize the naming convention for the merged index files so that the data does not need to overwrite unrelated data (if you know what I mean).

  • Read the sorted index file and pull out the data line from the source file and write to the results file.

It certainly won't be fast when reading and writing files, but should be quite efficient - the real killer is the random search for the original file in the last step. Up to this point, disk access is usually linear and therefore should be reasonably efficient.

0


source


Try a modified Bucket Sort . Assuming the id values ​​work well for it, you end up with a much more efficient sorting algorithm. You can improve I / O efficiency by actually writing out buckets (using small ones) when scanning, thereby potentially reducing the number of randomized files / files you need. Or not.

0


source


Hopefully there are some good examples of how to insert a line number based entry into the destination file.

You cannot insert content in the middle of a file (i.e. without overwriting a pre-existing one); I am not aware of the performance level filesystems that support it.

0


source


You can read a text file into a vector and after that it is very easy to insert the contents of your lines into that vector. You can check this article to read a text file into a vector: https://thispointer.com/c-how-to-read-a-file-line-by-line-into-a-vector/

0


source







All Articles