Parse a string into an array based on spaces or "double quote strings"
I'm trying to take a user input string and parse into an array named char * whole_line [100]; where each word is placed at a different index of the array, but if part of the string is encapsulated with a quote, that must be placed at one index. Therefore, if I have
char buffer[1024]={0,};
fgets(buffer, 1024, stdin);
example input: "word filename.txt" is a string where shoudl occupies one index in the output array ";
tokenizer=strtok(buffer," ");//break up by spaces
do{
if(strchr(tokenizer,'"')){//check is a word starts with a "
is_string=YES;
entire_line[i]=tokenizer;// if so, put that word into current index
tokenizer=strtok(NULL,"\""); //should get rest of string until end "
strcat(entire_line[i],tokenizer); //append the two together, ill take care of the missing space once i figure out this issue
}
entire_line[i]=tokenizer;
i++;
}while((tokenizer=strtok(NULL," \n"))!=NULL);
This clearly doesn't work and only approximates if the encapsulated double quoted string is at the end of the input line but I could input: word "this is the text to be entered by the user" filename.txt Tried to figure this out for a while, somewhere stuck. thank
source to share
A function strtok
is a terrible way to tokenize in C, except for one (common) case: simple words separated by spaces. (Even then it is still small due to the lack of re-entry and recursion capability, which is why we invented strsep
for BSD back when.)
It is best in this case to build your own simple state apparatus:
char *p;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;
for (p = buffer; *p != '\0'; p++) {
c = (unsigned char) *p; /* convert to unsigned char for is* functions */
switch (state) {
case DULL: /* not in a word, not in a double quoted string */
if (isspace(c)) {
/* still not in a word, so ignore this char */
continue;
}
/* not a space -- if it a double quote we go to IN_STRING, else to IN_WORD */
if (c == '"') {
state = IN_STRING;
start_of_word = p + 1; /* word starts at *next* char, not this one */
continue;
}
state = IN_WORD;
start_of_word = p; /* word starts here */
continue;
case IN_STRING:
/* we're in a double quoted string, so keep going until we hit a close " */
if (c == '"') {
/* word goes from start_of_word to p-1 */
... do something with the word ...
state = DULL; /* back to "not in word, not in string" state */
}
continue; /* either still IN_STRING or we handled the end above */
case IN_WORD:
/* we're in a word, so keep going until we get to a space */
if (isspace(c)) {
/* word goes from start_of_word to p-1 */
... do something with the word ...
state = DULL; /* back to "not in word, not in string" state */
}
continue; /* either still IN_WORD or we handled the end above */
}
}
Note that this does not take into account the possibility of a double quote inside a word, for example:
"some text in quotes" plus four simple words p"lus something strange"
Work on the state machine above and you will see what "some text in quotes"
turns into one token (which ignores double quotes), but is p"lus
also the only token (including the quote), something
one token, and strange"
is a token. Whether you want it or how you want to handle it is up to you. For more complex but thorough lexical tokenization, you can use a code builder tool like flex
.
Also, when the loop for
finishes, if state
not DULL
, you need to process the last word (I left that outside the code above) and decide what to do if state
IN_STRING
(which means there was no double double quote).
source to share
The Torek parts of the parsing code are excellent, but take a little more work to use.
For my own purpose, I ended up with function c.
Here I am sharing my work based on Torek code .
#include <stdio.h>
#include <string.h>
#include <ctype.h>
size_t split(char *buffer, char *argv[], size_t argv_size)
{
char *p, *start_of_word;
int c;
enum states { DULL, IN_WORD, IN_STRING } state = DULL;
size_t argc = 0;
for (p = buffer; argc < argv_size && *p != '\0'; p++) {
c = (unsigned char) *p;
switch (state) {
case DULL:
if (isspace(c)) {
continue;
}
if (c == '"') {
state = IN_STRING;
start_of_word = p + 1;
continue;
}
state = IN_WORD;
start_of_word = p;
continue;
case IN_STRING:
if (c == '"') {
*p = 0;
argv[argc++] = start_of_word;
state = DULL;
}
continue;
case IN_WORD:
if (isspace(c)) {
*p = 0;
argv[argc++] = start_of_word;
state = DULL;
}
continue;
}
}
if (state != DULL && argc < argv_size)
argv[argc++] = start_of_word;
return argc;
}
void test_split(const char *s)
{
char buf[1024];
size_t i, argc;
char *argv[20];
strcpy(buf, s);
argc = split(buf, argv, 20);
printf("input: '%s'\n", s);
for (i = 0; i < argc; i++)
printf("[%u] '%s'\n", i, argv[i]);
}
int main(int ac, char *av[])
{
test_split("\"some text in quotes\" plus four simple words p\"lus something strange\"");
return 0;
}
See the output of the program:
input: '"some text in quotes" plus four simple words p "lus something strange" "
[0]" some text in quotes "
[1]" plus "
[2]" four "
[3]" simple "
[ 4] "words"
[5] 'p "lus'
[6]" something "
[7]" strange "
source to share
I wrote a function qtok
a while ago that reads quoted words from a string. It's not a state machine, and it doesn't make you an array, but it's trivial to put the resulting tokens in one. It also handles escaped quotes and trailing and leading spaces:
#include <stdio.h>
#include <ctype.h>
#include <assert.h>
// Strips backslashes from quotes
char *unescapeToken(char *token)
{
char *in = token;
char *out = token;
while (*in)
{
assert(in >= out);
if ((in[0] == '\\') && (in[1] == '"'))
{
*out = in[1];
out++;
in += 2;
}
else
{
*out = *in;
out++;
in++;
}
}
*out = 0;
return token;
}
// Returns the end of the token, without chaning it.
char *qtok(char *str, char **next)
{
char *current = str;
char *start = str;
int isQuoted = 0;
// Eat beginning whitespace.
while (*current && isspace(*current)) current++;
start = current;
if (*current == '"')
{
isQuoted = 1;
// Quoted token
current++; // Skip the beginning quote.
start = current;
for (;;)
{
// Go till we find a quote or the end of string.
while (*current && (*current != '"')) current++;
if (!*current)
{
// Reached the end of the string.
goto finalize;
}
if (*(current - 1) == '\\')
{
// Escaped quote keep going.
current++;
continue;
}
// Reached the ending quote.
goto finalize;
}
}
// Not quoted so run till we see a space.
while (*current && !isspace(*current)) current++;
finalize:
if (*current)
{
// Close token if not closed already.
*current = 0;
current++;
// Eat trailing whitespace.
while (*current && isspace(*current)) current++;
}
*next = current;
return isQuoted ? unescapeToken(start) : start;
}
int main()
{
char text[] = " \"some text in quotes\" plus four simple words p\"lus something strange\" \"Then some quoted \\\"words\\\", and backslashes: \\ \\ \" Escapes only work insi\\\"de q\\\"uoted strings\\\" ";
char *pText = text;
printf("Original: '%s'\n", text);
while (*pText)
{
printf("'%s'\n", qtok(pText, &pText));
}
}
Outputs:
Original: ' "some text in quotes" plus four simple words p"lus something strange" "Then some quoted \"words\", and backslashes: \ \ " Escapes only work insi\"de q\"uoted strings\" '
'some text in quotes'
'plus'
'four'
'simple'
'words'
'p"lus'
'something'
'strange"'
'Then some quoted "words", and backslashes: \ \ '
'Escapes'
'only'
'work'
'insi\"de'
'q\"uoted'
'strings\"'
source to share
I think the answer to your question is actually quite simple, but I'm taking the guess that the other answers seem to have taken a different one. I am assuming that you want any quoted block of text to be separated on its own, no matter the distance, the rest of the text is separated by spaces.
So, for example:
"some text in quotes" plus four simple words p "lus something weird"
The output would be:
[0] some text in quotes
[1] plus
[2] four
[3] simple
[4] words
[5] p
[6] lus something strange
Given that this is the case, only a simple bit of code is required, not complex machines. First you have to check if there is a first quote for the first character and if so check the flag and remove the character. Also remove any quotes at the end of the string. Then mark the line based on quotes. Then label all other lines obtained earlier with spaces. Tokenized starting with the first line received if there was no main quote, or the second line received if there was a leading quote. Then each of the remaining lines from the first part will be added to an array of strings, alternating with the lines from the second part added in place of the lines they were marked from. This way you can get the above result. In code, it would look like this:
#include<string.h>
#include<stdlib.h>
char ** parser(char * input, char delim, char delim2){
char ** output;
char ** quotes;
char * line = input;
int flag = 0;
if(strlen(input) > 0 && input[0] == delim){
flag = 1;
line = input + 1;
}
int i = 0;
char * pch = strchr(line, delim);
while(pch != NULL){
i++;
pch = strchr(pch+1, delim);
}
quotes = (char **) malloc(sizeof(char *)*i+1);
char * token = strtok(input, delim);
int n = 0;
while(token != NULL){
quotes[n] = strdup(token);
token = strtok(NULL, delim);
n++;
}
if(delim2 != NULL){
int j = 0, k = 0, l = 0;
for(n = 0; n < i+1; n++){
if(flag & n % 2 == 1 || !flag & n % 2 == 0){
char ** new = parser(delim2, NULL);
l = sizeof(new)/sizeof(char *);
for(k = 0; k < l; k++){
output[j] = new[k];
j++;
}
for(k = l; k > -1; k--){
free(new[n]);
}
free(new);
} else {
output[j] = quotes[n];
j++;
}
}
for(n = i; n > -1; n--){
free(quotes[n]);
}
free(quotes);
} else {
return quotes;
}
return output;
}
int main(){
char * input;
char ** result = parser(input, '\"', ' ');
return 0;
}
(Can't be perfect, I haven't tested it)
source to share