Should I keep the original files in memory when parsing?

I am writing the front end of an interpreter and at first I didnโ€™t like the idea of โ€‹โ€‹just dumping all source files into memory and then linking to that text directly. Thus, the tokenizer reads a char from the buffer and creates a stream of tokens.

However, I got to the parsing side of things and it hit me because I would like to output good errors and warnings that show the wrong line of source code. I guess I could put the column numbers in tokens, but then the error messages would be like getting directions over the phone: โ€œIt's in file X, on line Y, in column Z, next to the curly brace, you know that. semicolon, you've gone far. "

I seem to be in a situation where I want to get my cake and eat it too. I want nice messages, but I don't want the memory to be loaded.

Is there something I am missing? Or loads the source into memory, how to go?

+3


source to share


2 answers


When an error message appears to the user, it hardly matters how long, in milliseconds, to report it.

I would keep your tokenized stream in memory to keep your translator fast. (You actually need to switch to a streaming interpreter, or even compromise on a bad one pass to improve execution speed).



If an error occurs, go to disk, select the line of interest and show it to the user. If he doesn't make mistakes, it will cost you zero. If he makes a small number of mistakes, it may be marginally ineffective, but the user won't know. If it makes a large number of errors, the contents of the file from the files containing the errors will be read by the OS into its local cache, which is more than your programs anyway, and therefore access will be more efficient than if you saved the source entirely to disk.

+2


source


Best idea: mmap

your sources first, if you can. Revert to breaking the whole file if you are reading from a pipe or something.



After parsing, you can call madvise(MADV_DONTNEED)

(but only if it was originally mmap

ed) to tell the kernel to remove it from the cache (but still keep it for errors) ... but this is probably not necessary, and might not even be good idea, depending on your compiler design (e.g. identifiers still pointing, or interned for one, separate, distribution).

+1


source







All Articles