Divide the line into 3 columns: text, number, text
I am working with a large dataset (~ 1500 rows) and when I built the dataset I didn't think about separating my IDs, so they are concentrated in one long line.
The identification string is in the column labeled "Polygon_Name". I would like to keep this column and split the string values ββin this column into 3 additional columns.
So, for example, if any cell "Polygon_Name" has a number embedded in it, for example Canker14B, I would like to get the following columns: (1) original name Polygon_Name, (2) all text before number, (3) number, ( 4) all text after the number.
A small subset of my data:
df <- structure(list(Bolt_ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = "N1T.3.4.15.0.C", class = "factor"),
Polygon_Name = structure(c(10L, 1L, 9L, 6L, 3L, 7L, 2L, 8L,
4L, 5L), .Label = c("C", "Canker15B", "Canker15Left", "Canker15Right",
"Canker16", "Canker17", "CankS15B", "CankS16", "CankS17",
"S"), class = "factor"), Measure = c(19.342, 25.962, 0.408,
0.008, 0.074, 0.41, 0.011, 0.251, 0.056, 0.034)), .Names = c("Bolt_ID",
"Polygon_Name", "Measure"), row.names = c(1L, 2L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L), class = "data.frame")
Current output:
End result (I built this by hand):
I figured out how to extract the number with the following code:
library(stringr)
regexp <- "[[:digit:]]+"
df$Poly_Num <- str_extract(df$Polygon_Name, regexp)
But I am still struggling to pull the text before and after the number. Any thoughts would be appreciated.
source to share
The idea tidyverse
would be through
library(tidyverse)
df %>%
mutate(Poly_num = gsub('\\D+', '', Polygon_Name)) %>%
separate(Polygon_Name, into = c('Poly_type', 'Poly_letter'), sep = '[0-9]+', remove = FALSE)
# Bolt_ID Polygon_Name Poly_type Poly_letter Measure Poly_num
#1 N1T.3.4.15.0.C S S <NA> 19.342
#2 N1T.3.4.15.0.C C C <NA> 25.962
#3 N1T.3.4.15.0.C CankS17 CankS 0.408 17
#4 N1T.3.4.15.0.C Canker17 Canker 0.008 17
#5 N1T.3.4.15.0.C Canker15Left Canker Left 0.074 15
#6 N1T.3.4.15.0.C CankS15B CankS B 0.410 15
#7 N1T.3.4.15.0.C Canker15B Canker B 0.011 15
#8 N1T.3.4.15.0.C CankS16 CankS 0.251 16
#9 N1T.3.4.15.0.C Canker15Right Canker Right 0.056 15
#10 N1T.3.4.15.0.C Canker16 Canker 0.034 16
One liner would have to use extract
from tidyr
(@docendodiscimus compliments)
tidyr::extract(df, Polygon_Name, c("a","b","c"), "^([^0-9]+)(\\d*)([^0-9]*)$",
remove = FALSE, convert = TRUE)
source to share
Since you are already using stringr, you can get them with str_match
str_match(df$Polygon_Name, "([[:alpha:]]*)([[:digit:]]*)([[:alpha:]]*)")[,2:4]
[,1] [,2] [,3]
[1,] "S" "" ""
[2,] "C" "" ""
[3,] "CankS" "17" ""
[4,] "Canker" "17" ""
[5,] "Canker" "15" "Left"
[6,] "CankS" "15" "B"
[7,] "Canker" "15" "B"
[8,] "CankS" "16" ""
[9,] "Canker" "15" "Right"
[10,] "Canker" "16" ""
To add this to your existing data.frame you can use
PName = str_match(df$Polygon_Name, "([[:alpha:]]*)([[:digit:]]*)([[:alpha:]]*)")[,2:4]
df = data.frame(df, PName)
names(df)[4:6] = c("Poly_Type", "Poly_Num", "Poly_Letter")
source to share