1

I need to extract all the text before a sign, in this case a dash. I have data like these:

  text1 <- "Médicos-Otros"
  text2 <- "Disturbio-Escándalo"
  text3 <- "Accidente-Choque"

The problem is that the words that i am trying to extract don't have the same lenght so i can't try some of these

extract <- substring(text1, 1, n)

desired results are:

extract1 <- "Médicos"
extract2 <- "Disturbio"
extract3 <- "Accidente"
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • [Remove part of string after “.”](https://stackoverflow.com/questions/10617702/remove-part-of-string-after), [Get the strings before the comma with R](https://stackoverflow.com/questions/19320966/get-the-strings-before-the-comma-with-r), [Extract part of string (till the first semicolon) in R](https://stackoverflow.com/questions/29752250/extract-part-of-string-till-the-first-semicolon-in-r), [How to extract everything until first occurrence of pattern](https://stackoverflow.com/questions/40113963/how-to-extract-everything-until-first-occurrence-of-pattern) – Henrik Jan 03 '19 at 20:33

4 Answers4

3

Using sub does the job:

sub("(.*)-.*", "\\1", c(text1, text2, text3))
# [1] "Médicos"   "Disturbio" "Accidente"

Here we split each character into: what goes before the dash ((.*)), the dash itself, and what goes after the dash (.*). Each character then is replaced by the first part (\\1).

Analogously you may extract the second half:

sub(".*-(.*)", "\\1", c(text1, text2, text3))
# [1] "Otros"     "Escándalo" "Choque"   
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • Thank you. One more thing, just in order to understand how this work: if there were a lot of dashes and i need one in particular ¿How can i get the desired part of text? – Armando González Díaz Jan 03 '19 at 21:00
  • 1
    @ArmandoGonzálezDíaz, to extract, say, the 5th part (after the 4th dash), use `sub("(.*?-){4}(.*?)($|-.*)", "\\2", txt)`, and so on (need to change only `{4}` to something else). The pattern now is quite different because the total number of dashes is unknown. If you knew that there are four dashes in total, the fifth part would be `sub(".*-.*-.*-.*-(.*)", "\\1", txt) `, if we keep going in the same fashion, but clearly there are more concise ways once the situation gets more complex. For the future keep in mind to make sure that your initial question includes everything. – Julius Vainora Jan 03 '19 at 21:10
2

You can use regular expressions:

text1 <-  "Médicos-Otros"
text2 <-  "Disturbio-Escándalo"
text3 <-  "Accidente-Choque"

extract1 <- gsub("\\-.*", "", text1)
extract2 <- gsub("\\-.*", "", text2)
extract3 <- gsub("\\-.*", "", text3)

This translates to match everything (and including) after dash ("-") and replace with nothing "".

Khaynes
  • 1,976
  • 2
  • 15
  • 27
  • Thank you. Now i need to extract the second part of text ¿How i can do it? – Armando González Díaz Jan 03 '19 at 20:43
  • 1
    @ArmandoGonzálezDíaz: If you're interested in both parts of each string, but having them separate, you're better off with Jilbers `strsplit()` approach. Eg: `do.call(rbind, strsplit(c(text1, text2, text3), "-"))` – AkselA Jan 03 '19 at 20:50
2

You can also use strsplit

> sapply(strsplit(c(text1, text2, text3), "-"), "[[", 1)
[1] "Médicos"   "Disturbio" "Accidente"

Consider str_extract from stringr package as another alternative

> library(stringr)
> str_extract(c(text1, text2, text3), "\\w+")
[1] "Médicos"   "Disturbio" "Accidente"
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
0

Using regex with positive look-ahead

sapply(c(text1, text2, text3), 
  function(x)
    regmatches(x, regexpr(".*(?=-)", x, perl=TRUE))
)
#      Médicos-Otros Disturbio-Escándalo    Accidente-Choque 
#          "Médicos"         "Disturbio"         "Accidente" 
AkselA
  • 8,153
  • 2
  • 21
  • 34