extracting data before a sign in R

Question

I need to extract all the text before a sign, in this case a dash. I have data like these:

  text1 <- "Médicos-Otros"
  text2 <- "Disturbio-Escándalo"
  text3 <- "Accidente-Choque"

The problem is that the words that i am trying to extract don't have the same lenght so i can't try some of these

extract <- substring(text1, 1, n)

desired results are:

extract1 <- "Médicos"
extract2 <- "Disturbio"
extract3 <- "Accidente"

[Remove part of string after “.”](https://stackoverflow.com/questions/10617702/remove-part-of-string-after), [Get the strings before the comma with R](https://stackoverflow.com/questions/19320966/get-the-strings-before-the-comma-with-r), [Extract part of string (till the first semicolon) in R](https://stackoverflow.com/questions/29752250/extract-part-of-string-till-the-first-semicolon-in-r), [How to extract everything until first occurrence of pattern](https://stackoverflow.com/questions/40113963/how-to-extract-everything-until-first-occurrence-of-pattern) — Henrik, Jan 03 '19 at 20:33

Julius Vainora · Accepted Answer · 2019-01-03T20:45:11.363

3

Using sub does the job:

sub("(.*)-.*", "\\1", c(text1, text2, text3))
# [1] "Médicos"   "Disturbio" "Accidente"

Here we split each character into: what goes before the dash ((.*)), the dash itself, and what goes after the dash (.*). Each character then is replaced by the first part (\\1).

Analogously you may extract the second half:

sub(".*-(.*)", "\\1", c(text1, text2, text3))
# [1] "Otros"     "Escándalo" "Choque"

edited Jan 03 '19 at 20:45

answered Jan 03 '19 at 20:27

Julius Vainora

47,421
9
90
102

Thank you. One more thing, just in order to understand how this work: if there were a lot of dashes and i need one in particular ¿How can i get the desired part of text? – Armando González Díaz Jan 03 '19 at 21:00
1

@ArmandoGonzálezDíaz, to extract, say, the 5th part (after the 4th dash), use `sub("(.*?-){4}(.*?)($|-.*)", "\\2", txt)`, and so on (need to change only `{4}` to something else). The pattern now is quite different because the total number of dashes is unknown. If you knew that there are four dashes in total, the fifth part would be `sub(".*-.*-.*-.*-(.*)", "\\1", txt) `, if we keep going in the same fashion, but clearly there are more concise ways once the situation gets more complex. For the future keep in mind to make sure that your initial question includes everything. – Julius Vainora Jan 03 '19 at 21:10

score 2 · Answer 2 · answered Jan 03 '19 at 20:26

2

You can use regular expressions:

text1 <-  "Médicos-Otros"
text2 <-  "Disturbio-Escándalo"
text3 <-  "Accidente-Choque"

extract1 <- gsub("\\-.*", "", text1)
extract2 <- gsub("\\-.*", "", text2)
extract3 <- gsub("\\-.*", "", text3)

This translates to match everything (and including) after dash ("-") and replace with nothing "".

answered Jan 03 '19 at 20:26

Khaynes

1,976
2
15
27

Thank you. Now i need to extract the second part of text ¿How i can do it? – Armando González Díaz Jan 03 '19 at 20:43
1

@ArmandoGonzálezDíaz: If you're interested in both parts of each string, but having them separate, you're better off with Jilbers `strsplit()` approach. Eg: `do.call(rbind, strsplit(c(text1, text2, text3), "-"))` – AkselA Jan 03 '19 at 20:50

score 2 · Answer 3 · answered Jan 03 '19 at 20:31

You can also use strsplit

> sapply(strsplit(c(text1, text2, text3), "-"), "[[", 1)
[1] "Médicos"   "Disturbio" "Accidente"

Consider str_extract from stringr package as another alternative

> library(stringr)
> str_extract(c(text1, text2, text3), "\\w+")
[1] "Médicos"   "Disturbio" "Accidente"

score 0 · Answer 4 · answered Jan 03 '19 at 20:49

Using regex with positive look-ahead

sapply(c(text1, text2, text3), 
  function(x)
    regmatches(x, regexpr(".*(?=-)", x, perl=TRUE))
)
#      Médicos-Otros Disturbio-Escándalo    Accidente-Choque 
#          "Médicos"         "Disturbio"         "Accidente"

extracting data before a sign in R

4 Answers4