Subprocess re-write to STDIN while reading from STDOUT (Windows)

I want to call an external process from python. The process I call reads the input string and gives a tokenized result and waits for another input (binary is MeCab tokenizer if that helps).

I need to tokenize thousands of lines of string by calling this process.

Problem Popen.communicate () works but waits for the process to complete before emitting an STDOUT result. I don't want to keep closing and opening new sub-processes thousands of times. (And I don’t want to send all the text, it could easily grow tens of thousands of -debts in the future.)

from subprocess import PIPE, Popen

with Popen("mecab -O wakati".split(), stdin=PIPE,
           stdout=PIPE, stderr=PIPE, close_fds=False,
           universal_newlines=True, bufsize=1) as proc:
    output, errors = proc.communicate("foobarbaz")

print(output)

      

I tried reading proc.stdout.read()

instead of using link, but it is blocked stdin

and returns no results before proc.stdin.close()

. Which again means that I have to create a new process every time.

I tried to implement queues and streams from a similar question as below, but it either doesn't return anything, so it's stuck on While True

, or when I force fill the stdin buffer by re-sending the lines, it outputs all the results at once.

from subprocess import PIPE, Popen
from threading import Thread
from queue import Queue, Empty

def enqueue_output(out, queue):
    for line in iter(out.readline, b''):
        queue.put(line)
    out.close()

p = Popen('mecab -O wakati'.split(), stdout=PIPE, stdin=PIPE,
          universal_newlines=True, bufsize=1, close_fds=False)
q = Queue()
t = Thread(target=enqueue_output, args=(p.stdout, q))
t.daemon = True
t.start()

p.stdin.write("foobarbaz")
while True:
    try:
        line = q.get_nowait()
    except Empty:
        pass
    else:
        print(line)
        break

      

Also looked at Pexpect route, but windows port doesn't support some important modules (pty based) so I couldn't apply that either.

I know there are many similar answers and I have tried most of them. But nothing I've tried seems to work on Windows.

EDIT: Some info on the binary that I am using when I use it via the command line. It triggers and tokenizes the suggestions I give until I finish and force the program to close.

(... waits_for_input -> input_recieved -> output -> waits_for_input ...)

Thank.

+3


source to share


2 answers


If mecab uses FILE

default buffered C streams , then piped stdout has 4 KiB buffer. The idea here is that the program can make efficient use of small, random reads and writes to buffers, and the underlying standard I / O implementation handles the auto-filling and flushing of significantly large buffers. This minimizes the number of system calls required and maximizes throughput. Obviously, you don't want this behavior for an interactive console or terminal I / O or writing to stderr

. In these cases, the C runtime uses line buffering or no buffering.

The program can override this behavior, and some of them have command line parameters to set the buffer size. For example, Python has a -u (unbuffered) option and an environment variable PYTHONUNBUFFERED

. If mecab does not have a similar option, then there will be no generic workaround on Windows. Too complex situation C. A Windows process can communicate statically or dynamically to one or more CRTs. The situation on Linux is different as the Linux process usually loads a unified CRT system (like GNU libc.so.6) into the global symbol table, which allows the library to LD_PRELOAD

customize C streams FILE

. Linux stdbuf

uses this trick for example. stdbuf -o0 mecab -O wakati

...




One experiment is to call CreateConsoleScreenBuffer

and get the file descriptor for the descriptor from msvcrt.open_osfhandle

. Then pass that as stdout

instead of using a pipe. The child process will see this as TTY and use line buffering instead of full buffering. However, managing this is not trivial. This would involve reading (i.e. ReadConsoleOutputCharacter

) a sliding buffer (a call GetConsoleScreenBufferInfo

to keep track of the cursor position) that is being actively written by another process. This kind of interaction is not something I have ever needed or even experimented with. But I used the console screen buffer non-interactively i.e. I read the buffer after the child left. This allows up to 9999 lines of output to be read from programs, which are written directly to the console instead of stdout

eg. programs that callWriteConsole

or open "CON" or "CONOUT $".

+3


source


Here's a workaround for Windows. It also needs to be adapted to other operating systems. Download a console emulator like ConEmu ( https://conemu.github.io/ ) Run it instead of mecab as your subprocess.

p = Popen(['conemu'] , stdout=PIPE, stdin=PIPE,
      universal_newlines=True, bufsize=1, close_fds=False)

      

Then send the following as your first input:

mecab -O wakafi & exit

      



You let the emulator handle file output problems for you; as usually it happens when manually interacting with it. I am still looking at it; but already looks promising ...

The only problem is conemu is a gui app; so if there is no other way to connect to its input and output, it could be customized and rebuilt from sources (this is open source). I have not found another way; but it should work.

I asked a question about running in some kind of console mode here ; so you can check this thread for something too. The author Maximus is on SO ...

0


source







All Articles