how to use sed, awk, or gawk to print only what is matched?

Question

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.

But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:

Example regular expression:

.*abc([0-9]+)xyz.*

Example input file:

a
b
c
abc12345xyz
a
b
c

As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:

myvalue=$( sed <...something...> input.txt )

Things I've tried include:

sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

score 45 · Accepted Answer · answered Nov 14 '09 at 08:50

45

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:

sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt

For matching at least one numeric character without +, I would use:

sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

answered Nov 14 '09 at 08:50

mouviciel

66,855
13
106
140

Thank you, this worked for me as well once I used * instead of +. – Stéphane Nov 14 '09 at 08:59
3

...and the "p" option to print the the match, which I didn't know about either. Thanks again. – Stéphane Nov 14 '09 at 09:05
2

I had to escape the `+` and then it worked for me: `sed -n 's/^.*abc$[0-9]\+$xyz.*$/\1/p'` – Dennis Williamson Nov 14 '09 at 09:23
4

That's because you're not using modern RE format therefore + is a standard character and you're supposed to express that with {,} syntax. You can add use -E sed option to trigger modern RE format. Check re_format(7), specifically last paragraph of DESCRIPTION http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man7/re_format.7.html – anddam Mar 03 '13 at 16:33
As well as the `-E` option, you can use `\{1,\}` (in place of `*` or `+`) to count one or more repeats. You can specify a lower bound or an upper bound or both. – Jonathan Leffler Feb 04 '21 at 00:41

Ilia Choly · Answer 2 · 2016-02-03T19:44:36.697

40

You can use sed to do this

 sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'

-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result

I wrote a tool for myself that makes this easier

rip 'abc(\d+)xyz' '$1'

edited Feb 03 '16 at 19:44

answered Feb 03 '16 at 19:39

Ilia Choly

18,070
14
92
160

3

This is by far the best, and most well-explained answer so far! – Nik Reiman Aug 18 '16 at 09:02
With some explanation, it's way better to understand what's wrong with our issue. Thank you ! – r4phG Oct 11 '17 at 13:17
1. You don't need both the -n and the /p. You just need one of them. 2. There is no meaning for global, because sed is greedy, so with or without you will get same result for multi occurances: sed -r 's/.*abc([0-9]+)xyz.*/\1/' <<< abc12345xyzabc777xyz AND sed -r 's/.*abc([0-9]+)xyz.*/\1/g' <<< abc12345xyzabc777xyz Both yield: 777 – Avihai Marchiano Sep 17 '21 at 08:20
@AvihaiMarchiano I just tested and it seems like you're right about the `/g` flag. But removing either `-n` or `/p` results in no output being printed for me. – Ilia Choly Sep 17 '21 at 19:33

score 18 · Answer 3 · edited Jun 28 '13 at 13:51

18

I use perl to make this easier for myself. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'

This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.

The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).

You can do this will multiple file names on the end also. e.g.

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

edited Jun 28 '13 at 13:51

fedorqui

275,237
103
548
598

answered Nov 14 '09 at 08:44

PP.

10,764
7
45
59

Thanks, but we don't have access to perl, which is why I was asking about sed/awk/gawk. – Stéphane Nov 14 '09 at 08:50

score 5 · Answer 4 · answered Nov 14 '09 at 10:56

5

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.

If not then here's the best sed I could come up with:

sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).

The problem with something like:

sed -e 's/.*\([0-9]*\).*/&/'

.... or

sed -e 's/.*\([0-9]*\).*/\1/'

... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

answered Nov 14 '09 at 10:56

Jim Dennis

17,054
13
68
116

You can just combine two of your `sed` commands in this way: `sed -n 's/[^0-9]*$[0-9]\+$.*/\1/p'` – Dennis Williamson Nov 15 '09 at 04:10
Previously didn't know about -o option on grep. Nice to know. But it prints the entire match, not the "(...)". So if you are matching on "abc([[:digit:]]+)xyz" then you get the "abc" and "xyz" as well as the digits. – Stéphane Nov 16 '09 at 19:09
Thanks for reminding me of `grep -o`! I was trying to do this with `sed` and struggled with my need to find multiple matches on some lines. My solution is https://stackoverflow.com/a/58308239/117471 – Bruno Bronosky Oct 09 '19 at 16:13

score 5 · Answer 5 · answered Aug 22 '16 at 09:01

You can use awk with match() to access the captured group:

$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345

This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.

With grep you can use a look-behind and look-ahead:

$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345

$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345

This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

Mark Lakata · Answer 6 · 2016-08-22T22:19:55.460

2

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.

gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file

output of the sample input file will be

Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.

edited Aug 22 '16 at 22:19

answered Apr 29 '13 at 20:21

Mark Lakata

19,989
5
106
123

2

A clever, workable solution if you need to (or want to) use gawk. You noted this, but to be clear: non-GNU awk doesn't have gensub(), and therefore doesn't support this. – cincodenada Jan 09 '14 at 21:56
Nice! However, it may be best to use `match()` to access the captured groups. See [my answer](http://stackoverflow.com/a/39075261/1983854) for this. – fedorqui Aug 22 '16 at 10:31

paxdiablo · Answer 7 · 2009-11-14T08:59:30.843

1

If you want to select lines then strip out the bits you don't want:

egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'

It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.

You can see this in action here:

pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>

Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:

egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

edited Nov 14 '09 at 08:59

answered Nov 14 '09 at 08:46

paxdiablo

854,327
234
1,573
1,953

Interesting... So there isn't a simple way to apply a complex regular expression and get back just what is in the (...) section? Cause while I see what you did here first with grep then with sed, our real situation is much more complex than dropping "abc" and "xyz". The regular expression is used because lots of different text can appear on either side of the text I'd like to extract. – Stéphane Nov 14 '09 at 08:54
I'm sure there *is* a better way if the REs are really complex. Perhaps if you provided a few more examples or a more detailed description, we could adjust our answers to suit. – paxdiablo Nov 14 '09 at 08:56

score 1 · Answer 8 · answered Oct 09 '19 at 16:11

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.

Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.

$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT

$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz

$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512

Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp

RARE Kpop Manifesto · Answer 9 · 2021-02-04T00:30:31.643

0

why even need match group

gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'

Let FS collect away both ends of the line.

If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.

If you're extra cautious, confirm length of $1 and $3 both being zero.

** edited answer after realizing zero length $2 will trip up my previous solution

edited Feb 04 '21 at 00:30

answered Feb 04 '21 at 00:16

RARE Kpop Manifesto

2,453
3
11

score 0 · Answer 10 · answered May 05 '21 at 16:38

there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.

If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :

mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) { 

    alnumstr = sprintf("%s%c", alnumstr , x) 
 }; 
 gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr) 
                       
                    # resulting str should be 44-chars long :
                    # all digits, non-vowels, equal sign =, and underscore _

 x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)

 } while ( --x );   # you can pick any level of precision you need.
                    # 10 chars randomly among the set is approx. 54-bits 
                    #
                    # i prefer this set over all ASCII being these 
                    # just about never require escaping 
                    # feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
                    #
                    # now you've made a random nonce that can be 
                    # inserted right in the middle of just about ANYTHING
                    # -- ASCII, Unicode, binary data -- (1) which will always fully
                    # print out, (2) has extremely low chance of actually
                    # appearing inside any real word data, and (3) even lower chance
                    # it accidentally alters the meaning of the underlying data.
                    # (so intentionally leaving them in there and 
                    # passing it along unix pipes remains quite harmless)
                    #
                    # this is essentially the lazy man's approach to making nonces
                    # that kinda-sorta have some resemblance to base64
                    # encoded, without having to write such a module (unless u have
                    # one for awk handy)


    regex1 = (..);  # build whatever regex you want here

    FS = OFS = nonceFS;

 } $0 ~ regex1 { 

    gsub(regex1, nonceFS "&" nonceFS); $0 = $0;  

                   # now you've essentially replicated what gawk patsplit( ) does,
                   # or gawk's split(..., seps) tracking 2 arrays one for the data
                   # in between, and one for the seps.
                   #
                   # via this method, that can all be done upon the entire $0,
                   # without any of the hassle (and slow downs) of 
                   # reading from associatively-hashed arrays,
                   # 
                   # simply print out all your even numbered columns
                   # those will be the parts of "just the match"

if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.

Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.

score -1 · Answer 11 · answered Nov 28 '09 at 01:58

-1

you can do it with the shell

while read -r line
do
    case "$line" in
        *abc*[0-9]*xyz* ) 
            t="${line##abc}"
            echo "num is ${t%%xyz}";;
    esac
done <"file"

answered Nov 28 '09 at 01:58

ghostdog74

327,991
56
259
343

score -3 · Answer 12 · answered Nov 14 '09 at 08:54

-3

For awk. I would use the following script:

/.*abc([0-9]+)xyz.*/ {
            print $0;
            next;
            }
            {
            /* default, do nothing */
            }

answered Nov 14 '09 at 08:54

Pierre

34,472
31
113
192

This does not output the numeric value `([0-9+])`, this outputs the entire line. – Mark Lakata Apr 29 '13 at 20:03

score -3 · Answer 13 · answered Nov 14 '09 at 09:18

-3

gawk '/.*abc([0-9]+)xyz.*/' file

answered Nov 14 '09 at 09:18

ghostdog74

327,991
56
259
343

2

This doesn't seem to work. It prints the entire line instead of the match. – Stéphane Nov 14 '09 at 09:55
in your sample input file , that pattern is the whole line. right??? if you know the pattern is going to be in a specific field: use $1, $2 etc.. eg gawk '$1 ~ /.*abc([0-9]+)xyz.*/' file – ghostdog74 Nov 14 '09 at 15:43

how to use sed, awk, or gawk to print only what is matched?

13 Answers13

Linked