Functional way to split a string into contiguous substrings

I am on rustc 1.0.0-beta (9854143cb 2015-04-02) (built 2015-04-02)

My goal is to divide the length of the string n

to the n-k+1

substring of length bias k

. I mean that if you have a line:

ABCDEF

I am trying to get a vector / iterator that contains offset substrings of arbitrary length k

. For example, k=3

will give

ABC
 BCD
  CDE
   DEF

      

And it k=2

will give:

AB
 BC
  CD
   DE
    EF

      

Note that spaces are included only for alignment of substrings to show how they are related. The output will include only the vector AB

, BC

, CD

etc. Also, it only supports ASCII support, although I would have preferred a safer, more general solution.

As painful as it looks, the following procedural code looks like:

fn offset_slices(s: &str, n: usize) -> Vec<&str> {
    let mut slices: Vec<&str> = Vec::new();
    for (i,_) in s.chars().enumerate() {
        if i > s.len() - n {
            break;
        }
        slices.push(&s[i..(i+n)]);
    }
    slices
}

      

But this is disgusting and I would prefer a more functional solution. I spent a couple of hours trying to find a way, and learned a lot from the process, but I'm stumped on this one.

Any ideas?

PS - I am very surprised that the above slices.push(&s[i..(i+n)])

compiles. Does it just return pointers to various input locations?

+3


source to share


2 answers


You really want an windows

iterator, but that only exists for chunks, not strings (see note below). Since you have ASCII data, we can create a type that enforces this constraint and then uses some unsafe codes. We, the programmer, can guarantee that the secure code is safe because we ensure that the data is ASCII only .

As huon-dbaupp points out, you should try using the ascii crate . It doesn't seem to have it right now windows

, but you have permission to post the following code (properly adapted) to this box if you like. ^ _ ^

use std::slice;
use std::str;

struct AsciiString {
    bytes: Vec<u8>,
}

impl AsciiString {
    fn new(s: &str) -> AsciiString {
        for b in s.bytes() {
            assert!((b as u8) < 128);
        }
        AsciiString { bytes: s.bytes().collect() }
    }

    fn windows(&self, n: usize) -> Windows {
        Windows { iter: self.bytes.windows(n) }
    }
}

struct Windows<'a> {
    iter: slice::Windows<'a, u8>,
}

impl<'a> Iterator for Windows<'a> {
    type Item = &'a str;

    fn next(&mut self) -> Option<&'a str> {
        self.iter.next().map(|bytes| {
            unsafe { str::from_utf8_unchecked(bytes) }
        })
    }
}

fn main() {
    let ascii = AsciiString::new("ABCDEF");
    for i in ascii.windows(3) {
        println!("{}", i);
    }
}

      

I'm really surprised that slices.push (& s [i .. (i + n)]) above even compiles. Does it just return pointers to various input locations?

It's tricky, but it makes sense once you figure it out (doesn't it always?)



When you use Index

, please note that it is implemented for str

, not &str

:

fn index(&'a self, index: Idx) -> &'a Self::Output;

impl Index<Range<usize>> for str { ... }

      

This means that the index returns a value with the same lifetime as the input. In this case, you start with &'foo str

and end with &'foo str

. Conceptually, yes, a &str

is a pointer to a chunk of memory and length. When you slice it, you just adjust the pointer and length, but the main storage will still live for the same lifespan.

Standard warning about line splitting

Be aware of byte / character / code point / graph issues whenever you start splitting strings. With anything more complex than ASCII characters, one character is not one byte at a time, and string slicing works in bytes ! There is also the concept of Unicode code points, but multiple Unicode characters can be combined to form what a person thinks of as a single character. This material is non - trivial .

+3


source


fn offset_slices(s: &str, n: usize) -> Vec<&str> {
    (0 .. s.len() - n + 1).map(|i| &s[i .. i + n]).collect()
}

      



+3


source







All Articles