How do I perform a stream character conversion?

I have data stored on disk in files that are too large to be stored in main memory.

I want to pass this data from disk to the data processing pipeline via iconv

for example:

zcat myfile | iconv -f L1 -t UTF-8 | # rest of the pipeline goes here

      

Unfortunately I see the iconv buffer the entire file in memory until it is exhausted before any data is output. This means that I use all of my main memory on a blocking operation in a pipeline that is otherwise minimal memory size.

I've tried calling iconv like this:

stdbuf -o 0 iconv -f L1 -t UTF-8

      

But it looks like iconv itself manages the buffering - it has nothing to do with the Linux buffer buffer.

I see this with a binary that is packaged with gblic 2.6 and 2.7 on Arch Linux and I described it with glibc 2.5 on Debian.

Is there some way around this? I know streaming character conversion is not easy, but I would think that such a widely used unix tool would work on streams; it's not uncommon to work with files that won't fit into main memory. Should I collapse my own associated binary libiconv

?

+3


source to share


1 answer


Consider calling iconv (3) with iconv_open - wire a simple C procedure to these two calls. Read from stdin, write to stdout. Read this example:

http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html

      



This example is clearly intended to handle what you are describing. - avoid "stateful" waiting for data.

+2


source