Skip to content
Snippets Groups Projects
Commit 08b3243a authored by gsingh58's avatar gsingh58
Browse files

lec14&15 updatd

parent f5b5fff2
No related branches found
No related tags found
No related merge requests found
Pipeline #803957 passed
%% Cell type:markdown id:e60c1c48 tags: %% Cell type:markdown id:e60c1c48 tags:
   
# Regex 1 # Regex 1
   
## Reading ## Reading
   
- New text: "Principles and Techniques of Data Science", by Sam Lau, Joey Gonzalez, and Deb Nolan - New text: "Principles and Techniques of Data Science", by Sam Lau, Joey Gonzalez, and Deb Nolan
- Used for Berkeley's DS100 Course. - Used for Berkeley's DS100 Course.
- Read Chapter 13: https://www.textbook.ds100.org/ch/13/text_regex.html - Read Chapter 13: https://www.textbook.ds100.org/ch/13/text_regex.html
   
   
## Regular expressions aka Regex ## Regular expressions aka Regex
   
- Regex: a small language for describing patterns to search for regex patterns are used in many different programming languages (like how many different languages might use SQL queries) - Regex: a small language for describing patterns to search for regex patterns are used in many different programming languages (like how many different languages might use SQL queries)
- https://blog.teamtreehouse.com/regular-expressions-10-languages - https://blog.teamtreehouse.com/regular-expressions-10-languages
- Inventor: Stephen Cole Kleene (UW-Madison mathematician) --- https://en.wikipedia.org/wiki/Stephen_Cole_Kleene - Inventor: Stephen Cole Kleene (UW-Madison mathematician) --- https://en.wikipedia.org/wiki/Stephen_Cole_Kleene
   
![Stephen_Cole_Kleene.png](attachment:Stephen_Cole_Kleene.png) ![Stephen_Cole_Kleene.png](attachment:Stephen_Cole_Kleene.png)
   
%% Cell type:markdown id:845cf426 tags: %% Cell type:markdown id:845cf426 tags:
   
### Review of `str.find(<search string>)` method ### Review of `str.find(<search string>)` method
   
- `str.find(<search string>)` method returns the index of the first matching occurrence of the search string - `str.find(<search string>)` method returns the index of the first matching occurrence of the search string
- `str.find` is VERY limited -- what if we want to: - `str.find` is VERY limited -- what if we want to:
- find all occurrences of "320" - find all occurrences of "320"
- find any 3-digit numbers? - find any 3-digit numbers?
- find any numbers at all? - find any numbers at all?
- find a number before the word "projects"? - find a number before the word "projects"?
- substitute a number for something else? - substitute a number for something else?
   
Regexes can do all these things! Regexes can do all these things!
   
%% Cell type:code id:e54596ac tags: %% Cell type:code id:e54596ac tags:
   
``` python ``` python
msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!" msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!"
   
# does the string contain "320"? # does the string contain "320"?
has_320 = msg.find("320") >= 0 has_320 = msg.find("320") >= 0
print(has_320, msg.find("320")) print(has_320, msg.find("320"))
``` ```
   
%% Output %% Output
   
True 6 True 6
   
%% Cell type:code id:52f4c6e0 tags: %% Cell type:code id:52f4c6e0 tags:
   
``` python ``` python
# prints tab between A and B # prints tab between A and B
print("A\tB") print("A\tB")
# what if we want to literally print \t between A and B? # what if we want to literally print \t between A and B?
# we need to use escape sequence (\) # we need to use escape sequence (\)
print("A\\tB") print("A\\tB")
``` ```
   
%% Output %% Output
   
A B A B
A\tB A\tB
   
%% Cell type:markdown id:1f152bcc tags: %% Cell type:markdown id:1f152bcc tags:
   
### Raw string ### Raw string
   
- easier way to tell Python to print content using raw format of the string than remembering to use escape sequence always - easier way to tell Python to print content using raw format of the string than remembering to use escape sequence always
- Syntax: `r"<some string>"` ---> add "r" in the front - Syntax: `r"<some string>"` ---> add "r" in the front
   
%% Cell type:code id:d277ace9 tags: %% Cell type:code id:d277ace9 tags:
   
``` python ``` python
print(r"A\tB") print(r"A\tB")
``` ```
   
%% Output %% Output
   
A\tB A\tB
   
%% Cell type:code id:0dba68b0 tags: %% Cell type:code id:0dba68b0 tags:
   
``` python ``` python
#import statements #import statements
import re import re
``` ```
   
%% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags: %% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags:
   
``` python ``` python
# Example strings # Example strings
# from DS100 book... # from DS100 book...
def reg(regex, text): def reg(regex, text):
""" """
Prints the string with the regex match highlighted. Prints the string with the regex match highlighted.
""" """
print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text)) print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))
s1 = " ".join(["A DAG is a directed graph without cycles.", s1 = " ".join(["A DAG is a directed graph without cycles.",
"A tree is a DAG where every node has one parent (except the root, which has none).", "A tree is a DAG where every node has one parent (except the root, which has none).",
"To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"]) "To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"])
print(s1) print(s1)
   
s2 = """1-608-123-4567 s2 = """1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
""" """
print(s2) print(s2)
   
s3 = "In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome!" s3 = "In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome!"
print(s3) print(s3)
   
s4 = """In CS 320, there are 11 quizzes, 6 projects, s4 = """In CS 320, there are 11 quizzes, 6 projects,
28 lectures, and 1000 things to learn. CS 320 is awesome!""" 28 lectures, and 1000 things to learn. CS 320 is awesome!"""
print(s4) print(s4)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome! In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome!
In CS 320, there are 11 quizzes, 6 projects, In CS 320, there are 11 quizzes, 6 projects,
28 lectures, and 1000 things to learn. CS 320 is awesome! 28 lectures, and 1000 things to learn. CS 320 is awesome!
   
%% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags: %% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags:
   
``` python ``` python
print(s1) print(s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:ed8bca65 tags: %% Cell type:markdown id:ed8bca65 tags:
   
### Regex: double escaping (use case for raw strings) ### Regex: double escaping (use case for raw strings)
   
- Regex does another level of formatting with special sequences like \t, \n, etc., - Regex does another level of formatting with special sequences like \t, \n, etc.,
   
![Double_escaping.png](attachment:Double_escaping.png) ![Double_escaping.png](attachment:Double_escaping.png)
   
#### Find the right arm "\". #### Find the right arm "\".
   
- `reg(<PATTERN>, <STRING>)` - `reg(<PATTERN>, <STRING>)`
   
%% Cell type:code id:003b6b2e-285f-4575-bf9f-6ecd0e086851 tags: %% Cell type:code id:003b6b2e-285f-4575-bf9f-6ecd0e086851 tags:
   
``` python ``` python
# Python will be unhappy # Python will be unhappy
# \ works as escape sequence here and it is trying to escape the second ", # \ works as escape sequence here and it is trying to escape the second ",
# meaning it thinks we are mentioning " literal # meaning it thinks we are mentioning " literal
# reg("\", s1) # uncomment to see error # reg("\", s1) # uncomment to see error
``` ```
   
%% Cell type:code id:b935eed3-322d-4b89-ae4a-ec10456e3fb6 tags: %% Cell type:code id:b935eed3-322d-4b89-ae4a-ec10456e3fb6 tags:
   
``` python ``` python
# Regex will be unhappy # Regex will be unhappy
# reg("\\", s1) # uncomment to see error # reg("\\", s1) # uncomment to see error
``` ```
   
%% Cell type:code id:a3a8ca37-9dba-4782-80f0-e6be5e3d98a1 tags: %% Cell type:code id:a3a8ca37-9dba-4782-80f0-e6be5e3d98a1 tags:
   
``` python ``` python
# Correct and cumbersome way to do this # Correct and cumbersome way to do this
reg("\\\\", s1) reg("\\\\", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:0636b9b1-159a-4597-a32f-e36510802753 tags: %% Cell type:code id:0636b9b1-159a-4597-a32f-e36510802753 tags:
   
``` python ``` python
# Better way would be to use raw string to avoid double escaping # Better way would be to use raw string to avoid double escaping
reg(r"\\", s1) reg(r"\\", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:6a2eff34 tags: %% Cell type:markdown id:6a2eff34 tags:
   
### Regex is case sensitive ### Regex is case sensitive
   
#### Find all occurrences of "a". #### Find all occurrences of "a".
   
%% Cell type:code id:20fc2d34 tags: %% Cell type:code id:20fc2d34 tags:
   
``` python ``` python
reg(r"a", s1) reg(r"a", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:f39d05a7 tags: %% Cell type:markdown id:f39d05a7 tags:
   
#### Find all occurrences of "A". #### Find all occurrences of "A".
   
%% Cell type:code id:e4604bfc tags: %% Cell type:code id:e4604bfc tags:
   
``` python ``` python
reg(r"A", s1) reg(r"A", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:10807f2b tags: %% Cell type:markdown id:10807f2b tags:
   
### Character classes ### Character classes
   
- Character classes can be mentioned within `[...]` - Character classes can be mentioned within `[...]`
- `^` means `NOT` of a character class - `^` means `NOT` of a character class
- `-` enables us to mention range of characters, for example `[A-Z]` - `-` enables us to mention range of characters, for example `[A-Z]`
- `|` enables us to perform `OR` - `|` enables us to perform `OR`
   
#### Find both "a" and "A". #### Find both "a" and "A".
   
%% Cell type:code id:5c07ad88-3019-4d75-aacf-743b3f01ef31 tags: %% Cell type:code id:5c07ad88-3019-4d75-aacf-743b3f01ef31 tags:
   
``` python ``` python
# Doesn't work - because we are trying to match literally for "aA" # Doesn't work - because we are trying to match literally for "aA"
reg("aA", s1) reg("aA", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:cc0694e1-474e-44e2-b1e4-a75d1b518858 tags: %% Cell type:code id:cc0694e1-474e-44e2-b1e4-a75d1b518858 tags:
   
``` python ``` python
reg("[aA]", s1) reg("[aA]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:44eb40b8 tags: %% Cell type:markdown id:44eb40b8 tags:
   
#### Find all the vowels. #### Find all the vowels.
   
%% Cell type:code id:dee65e40-a4ad-4ea7-92c0-4daded4af747 tags: %% Cell type:code id:dee65e40-a4ad-4ea7-92c0-4daded4af747 tags:
   
``` python ``` python
reg("[aeiouAEIOU]", s1) reg("[aeiouAEIOU]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:91cfecdd tags: %% Cell type:markdown id:91cfecdd tags:
   
#### Find everything except vowels. #### Find everything except vowels.
   
%% Cell type:code id:d27513a4-0774-4e43-bdaa-d777c5eb85a2 tags: %% Cell type:code id:d27513a4-0774-4e43-bdaa-d777c5eb85a2 tags:
   
``` python ``` python
reg("[^aeiouAEIOU]", s1) reg("[^aeiouAEIOU]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:0fe5a885 tags: %% Cell type:markdown id:0fe5a885 tags:
   
#### Find all capital letters. #### Find all capital letters.
   
%% Cell type:code id:f1015a5b-47b5-4e70-b6f1-e18842e54c0d tags: %% Cell type:code id:f1015a5b-47b5-4e70-b6f1-e18842e54c0d tags:
   
``` python ``` python
reg("[A-Z]", s1) reg("[A-Z]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:04a88df6 tags: %% Cell type:markdown id:04a88df6 tags:
   
#### What if we want to find "A", "Z", and "-"? #### What if we want to find "A", "Z", and "-"?
   
%% Cell type:code id:7b46d6d1-f230-45ff-adda-ff2a49ddad16 tags: %% Cell type:code id:7b46d6d1-f230-45ff-adda-ff2a49ddad16 tags:
   
``` python ``` python
# How can we change this to do that? # How can we change this to do that?
reg(r"[A\-Z]", s1) reg(r"[A\-Z]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:26a0f064 tags: %% Cell type:markdown id:26a0f064 tags:
   
#### Invalid ranges don't work. For example: `[Z-A]`. #### Invalid ranges don't work. For example: `[Z-A]`.
   
%% Cell type:code id:91edab5e tags: %% Cell type:code id:91edab5e tags:
   
``` python ``` python
# reg("[Z-A]", s1) # uncomment to see error # reg("[Z-A]", s1) # uncomment to see error
``` ```
   
%% Cell type:markdown id:5a0d0886 tags: %% Cell type:markdown id:5a0d0886 tags:
   
#### Find all words related to graphs. #### Find all words related to graphs.
   
%% Cell type:code id:4fef9e57 tags: %% Cell type:code id:4fef9e57 tags:
   
``` python ``` python
# | means OR # | means OR
reg(r"tree|directed|undirected|graph|DAG|node|child|parent|root|cycles", s1) reg(r"tree|directed|undirected|graph|DAG|node|child|parent|root|cycles", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:4c3fb3ee tags: %% Cell type:markdown id:4c3fb3ee tags:
   
### Metacharacters ### Metacharacters
   
- predefined character classes - predefined character classes
- `\d` => digits - `\d` => digits
- `\s` => whitespace (space, tab, newline) - `\s` => whitespace (space, tab, newline)
- `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`) - `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`)
- `.` => wildcard: anything except newline - `.` => wildcard: anything except newline
- capitalized version of character classes mean `NOT`, for example `\D` => everything except digits - capitalized version of character classes mean `NOT`, for example `\D` => everything except digits
   
#### Find all digits. #### Find all digits.
   
%% Cell type:code id:96923f55-e19c-40c8-ac19-1b14d1ecaae3 tags: %% Cell type:code id:96923f55-e19c-40c8-ac19-1b14d1ecaae3 tags:
   
``` python ``` python
# v1 # v1
reg(r"[0-9]", s1) reg(r"[0-9]", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:e332472f tags: %% Cell type:code id:e332472f tags:
   
``` python ``` python
# v2 - with metacharacters # v2 - with metacharacters
reg(r"\d", s1) reg(r"\d", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:71967170 tags: %% Cell type:markdown id:71967170 tags:
   
#### Find all whitespaces. #### Find all whitespaces.
   
%% Cell type:code id:7438bd72-c477-4b9b-990a-0de55019f02a tags: %% Cell type:code id:7438bd72-c477-4b9b-990a-0de55019f02a tags:
   
``` python ``` python
reg(r"\s", s1) reg(r"\s", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:a9dd37a5 tags: %% Cell type:markdown id:a9dd37a5 tags:
   
#### Find everything except whitespaces. #### Find everything except whitespaces.
   
%% Cell type:code id:2ecfed39-e744-4c70-b612-afdba56ad48b tags: %% Cell type:code id:2ecfed39-e744-4c70-b612-afdba56ad48b tags:
   
``` python ``` python
reg(r"\S", s1) reg(r"\S", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:647f8d70 tags: %% Cell type:markdown id:647f8d70 tags:
   
#### Find anything except newline. #### Find anything except newline.
   
%% Cell type:code id:51216a97-db8d-4bf6-bbb0-5dccd0699f9a tags: %% Cell type:code id:51216a97-db8d-4bf6-bbb0-5dccd0699f9a tags:
   
``` python ``` python
reg(r".", s1) reg(r".", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:f3c68051 tags: %% Cell type:markdown id:f3c68051 tags:
   
#### What if we want to actually match "."? #### What if we want to actually match "."?
   
%% Cell type:code id:382732ec-dec2-49ed-8854-8570d8c3f149 tags: %% Cell type:code id:382732ec-dec2-49ed-8854-8570d8c3f149 tags:
   
``` python ``` python
#How can we change this to do that? #How can we change this to do that?
reg(r"\.", s1) reg(r"\.", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:7dc24ebb-1749-4dc7-b386-d4e10e938068 tags: %% Cell type:markdown id:7dc24ebb-1749-4dc7-b386-d4e10e938068 tags:
   
### REPETITION ### REPETITION
   
- `<character>{<num matches>}` - for example: `w{3}` - `<character>{<num matches>}` - for example: `w{3}`
- matches cannot overlap - matches cannot overlap
   
#### Find all "www". #### Find all "www".
   
%% Cell type:code id:7a7fb583-c198-494a-abdb-f7ca668798a5 tags: %% Cell type:code id:7a7fb583-c198-494a-abdb-f7ca668798a5 tags:
   
``` python ``` python
# v1 # v1
reg(r"www", s1) reg(r"www", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:7e20ed80-de3f-45a6-be38-e21e9c6ffd66 tags: %% Cell type:code id:7e20ed80-de3f-45a6-be38-e21e9c6ffd66 tags:
   
``` python ``` python
# v2 - repitition # v2 - repitition
reg(r"w{3}", s1) reg(r"w{3}", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:c48d3d94-19b7-416e-b079-3f78bd3ce834 tags: %% Cell type:code id:c48d3d94-19b7-416e-b079-3f78bd3ce834 tags:
   
``` python ``` python
# Lesson: matches cannot overlap # Lesson: matches cannot overlap
reg(r"w{2}", s1) reg(r"w{2}", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:cc2bbdda tags: %% Cell type:markdown id:cc2bbdda tags:
   
### Variable length repitition operators ### Variable length repitition operators
   
- `*` => 0 or more (greedy: match as many characters as possible) - `*` => 0 or more (greedy: match as many characters as possible)
- `+` => 1 or more (greedy: match as many characters as possible) - `+` => 1 or more (greedy: match as many characters as possible)
- `?` => 0 or 1 - `?` => 0 or 1
- `*?` => 0 or more (non-greedy: match as few characters as possible) - `*?` => 0 or more (non-greedy: match as few characters as possible)
- `+?` => 1 or more (non-greedy: match as few characters as possible) - `+?` => 1 or more (non-greedy: match as few characters as possible)
   
#### Find everything inside of parentheses. #### Find everything inside of parentheses.
   
%% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags: %% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags:
   
``` python ``` python
# this doesn't work # this doesn't work
# it captures everything because () have special meaning (coming up) # it captures everything because () have special meaning (coming up)
reg(r"(.*)", s1) reg(r"(.*)", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:b488460e tags: %% Cell type:code id:b488460e tags:
   
``` python ``` python
# How can we change this to not use special meaning of ()? # How can we change this to not use special meaning of ()?
# * is greedy: match as many characters as possible # * is greedy: match as many characters as possible
reg(r"\(.*\)", s1) reg(r"\(.*\)", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags: %% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags:
   
``` python ``` python
# non-greedy: stop at the first possible spot instead of the last possible spot # non-greedy: stop at the first possible spot instead of the last possible spot
reg(r"\(.*?\)", s1) reg(r"\(.*?\)", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:0fd70cfd tags: %% Cell type:markdown id:0fd70cfd tags:
   
### Anchor characters ### Anchor characters
- `^` => start of string - `^` => start of string
- `^` is overloaded --- what was the other usage? - `^` is overloaded --- what was the other usage?
- `$` => end of string - `$` => end of string
   
#### Find everything in the first sentence. #### Find everything in the first sentence.
   
%% Cell type:code id:40fed5db tags: %% Cell type:code id:40fed5db tags:
   
``` python ``` python
# doesn't work because remember regex finds all possible matches # doesn't work because remember regex finds all possible matches
# so it matches every single sentence # so it matches every single sentence
# (even though we are doing non-greedy match) # (even though we are doing non-greedy match)
reg(r".*?\.", s1) reg(r".*?\.", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:code id:7e97abd8 tags: %% Cell type:code id:7e97abd8 tags:
   
``` python ``` python
reg(r"^.*?\.", s1) reg(r"^.*?\.", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:f66a4651 tags: %% Cell type:markdown id:f66a4651 tags:
   
#### Find everything in the first two sentences. #### Find everything in the first two sentences.
   
%% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags: %% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags:
   
``` python ``` python
reg(r"^(.*?\.){2}", s1) reg(r"^(.*?\.){2}", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:76570acd tags: %% Cell type:markdown id:76570acd tags:
   
#### Find last "word" in the sentence. #### Find last "word" in the sentence.
   
%% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags: %% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags:
   
``` python ``` python
reg(r"\S+$", s1) reg(r"\S+$", s1)
``` ```
   
%% Output %% Output
   
A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯ A DAG is a directed graph without cycles. A tree is a DAG where every node has one parent (except the root, which has none). To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯
   
%% Cell type:markdown id:4b25fb66 tags: %% Cell type:markdown id:4b25fb66 tags:
   
### Case study: find all phone numbers. ### Case study: find all phone numbers.
   
%% Cell type:code id:8ecbeaf0 tags: %% Cell type:code id:8ecbeaf0 tags:
   
``` python ``` python
print(s2) print(s2)
# The country code (1) in the front is optional # The country code (1) in the front is optional
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to match country code without area code though! # Doesn't make sense to match country code without area code though!
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags: %% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags:
   
``` python ``` python
# Full US phone numbers # Full US phone numbers
reg(r"\d-\d{3}-\d{3}-\d{4}", s2) reg(r"\d-\d{3}-\d{3}-\d{4}", s2)
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags: %% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags:
   
``` python ``` python
# The country code (1) in the front is optional # The country code (1) in the front is optional
reg(r"(\d-)?\d{3}-\d{3}-\d{4}", s2) reg(r"(\d-)?\d{3}-\d{3}-\d{4}", s2)
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags: %% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags:
   
``` python ``` python
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to have country code without area code though! # Doesn't make sense to have country code without area code though!
reg(r"(\d-)?(\d{3}-)?\d{3}-\d{4}", s2) reg(r"(\d-)?(\d{3}-)?\d{3}-\d{4}", s2)
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags: %% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags:
   
``` python ``` python
# This is good enough for 320 quizzes/tests # This is good enough for 320 quizzes/tests
# But clearly, the last match is not correct # But clearly, the last match is not correct
reg(r"((\d-)?\d{3}-)?\d{3}-\d{4}", s2) reg(r"((\d-)?\d{3}-)?\d{3}-\d{4}", s2)
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:markdown id:8a2ee4e2 tags: %% Cell type:markdown id:8a2ee4e2 tags:
   
Regex documentation link: https://docs.python.org/3/library/re.html. Regex documentation link: https://docs.python.org/3/library/re.html.
   
%% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags: %% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags:
   
``` python ``` python
# BONUS: negative lookbehind (I won't test this) # BONUS: negative lookbehind (I won't test this)
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2) reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2)
``` ```
   
%% Output %% Output
   
1-608-123-4567 1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
   
%% Cell type:markdown id:3973350b tags: %% Cell type:markdown id:3973350b tags:
   
There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested. There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested.
   
%% Cell type:code id:4988d765 tags: %% Cell type:code id:4988d765 tags:
   
``` python ``` python
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999") reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999")
``` ```
   
%% Output %% Output
   
608-123-4569999 608-123-4569999
   
%% Cell type:markdown id:b02ae9e0 tags: %% Cell type:markdown id:b02ae9e0 tags:
   
### Testing your regex ### Testing your regex
- you could use `reg(...)` function - you could use `reg(...)` function
- another useful resource: https://regex101.com/ - another useful resource: https://regex101.com/
   
%% Cell type:markdown id:4a973271 tags: %% Cell type:markdown id:4a973271 tags:
   
### `re` module ### `re` module
- `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches - `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches
- returns a list of strings - returns a list of strings
- `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution - `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution
- returns a new string with the substitutions (remember strings are immutable) - returns a new string with the substitutions (remember strings are immutable)
   
%% Cell type:code id:73ec525f tags: %% Cell type:code id:73ec525f tags:
   
``` python ``` python
print(msg) print(msg)
``` ```
   
%% Output %% Output
   
In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome! In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome!
   
%% Cell type:markdown id:34998a5e tags: %% Cell type:markdown id:34998a5e tags:
   
#### Find all digits. #### Find all digits.
   
%% Cell type:code id:7f42c25a tags: %% Cell type:code id:7f42c25a tags:
   
``` python ``` python
re.findall(r"\d+", msg) re.findall(r"\d+", msg)
``` ```
   
%% Output %% Output
   
['320', '28', '11', '3', '6', '1000', '320'] ['320', '28', '11', '3', '6', '1000', '320']
   
%% Cell type:markdown id:70b2f488 tags: %% Cell type:markdown id:70b2f488 tags:
   
### Groups ### Groups
- we can capture matches using `()` => this is the special meaning of `()` - we can capture matches using `()` => this is the special meaning of `()`
- returns a list of tuples, where length of the tuple will be number of groups - returns a list of tuples, where length of the tuple will be number of groups
   
#### Find all digits and the word that comes after that. #### Find all digits and the word that comes after that.
   
%% Cell type:code id:5309adee tags: %% Cell type:code id:5309adee tags:
   
``` python ``` python
re.findall(r"(\d+) (\w+)", msg) re.findall(r"(\d+) (\w+)", msg)
``` ```
   
%% Output %% Output
   
[('28', 'lectures'), [('28', 'lectures'),
('11', 'quizzes'), ('11', 'quizzes'),
('3', 'exams'), ('3', 'exams'),
('6', 'projects'), ('6', 'projects'),
('1000', 'things'), ('1000', 'things'),
('320', 'is')] ('320', 'is')]
   
%% Cell type:markdown id:c4b6b505 tags: %% Cell type:markdown id:c4b6b505 tags:
   
### Unlike matches, groups can overlap ### Unlike matches, groups can overlap
   
#### Find and group all digits and the word that comes after that. #### Find and group all digits and the word that comes after that.
   
%% Cell type:code id:491c3460 tags: %% Cell type:code id:491c3460 tags:
   
``` python ``` python
re.findall(r"((\d+) (\w+))", msg) re.findall(r"((\d+) (\w+))", msg)
``` ```
   
%% Output %% Output
   
[('28 lectures', '28', 'lectures'), [('28 lectures', '28', 'lectures'),
('11 quizzes', '11', 'quizzes'), ('11 quizzes', '11', 'quizzes'),
('3 exams', '3', 'exams'), ('3 exams', '3', 'exams'),
('6 projects', '6', 'projects'), ('6 projects', '6', 'projects'),
('1000 things', '1000', 'things'), ('1000 things', '1000', 'things'),
('320 is', '320', 'is')] ('320 is', '320', 'is')]
   
%% Cell type:markdown id:d2227e69 tags: %% Cell type:markdown id:d2227e69 tags:
   
#### Substitute all digits with "###". #### Substitute all digits with "###".
   
%% Cell type:code id:6d1fede1 tags: %% Cell type:code id:6d1fede1 tags:
   
``` python ``` python
re.sub(r"\d+", "###", msg) re.sub(r"\d+", "###", msg)
``` ```
   
%% Output %% Output
   
'In CS ###,\tthere are ### lectures, ### quizzes, ### exams,\t### projects, and ### things to learn. CS ### is awesome!' 'In CS ###,\tthere are ### lectures, ### quizzes, ### exams,\t### projects, and ### things to learn. CS ### is awesome!'
   
%% Cell type:markdown id:9d531122 tags: %% Cell type:markdown id:9d531122 tags:
   
#### Substitute all whitespaces with single white space. #### Substitute all whitespaces with single white space.
   
%% Cell type:code id:4becbe70 tags: %% Cell type:code id:4becbe70 tags:
   
``` python ``` python
print(msg) print(msg)
``` ```
   
%% Output %% Output
   
In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome! In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome!
   
%% Cell type:code id:72a6eb42 tags: %% Cell type:code id:72a6eb42 tags:
   
``` python ``` python
re.sub(r"\s+", " ", msg) re.sub(r"\s+", " ", msg)
``` ```
   
%% Output %% Output
   
'In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome!' 'In CS 320, there are 28 lectures, 11 quizzes, 3 exams, 6 projects, and 1000 things to learn. CS 320 is awesome!'
   
%% Cell type:markdown id:6faf33fd tags: %% Cell type:markdown id:6faf33fd tags:
   
### How to use groups is substitution? ### How to use groups is substitution?
- `\g<N>` gives you the result of the N'th grouping. - `\g<N>` gives you the result of the N'th grouping.
   
#### Substitute all whitespaces with single white space. #### Substitute all whitespaces with single white space.
   
%% Cell type:code id:8df577fd tags: %% Cell type:code id:8df577fd tags:
   
``` python ``` python
print(re.sub(r"(\d+)", "<b>\g<1></b>", msg)) print(re.sub(r"(\d+)", "<b>\g<1></b>", msg))
``` ```
   
%% Output %% Output
   
In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome! In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome!
   
%% Cell type:markdown id:35a15a41 tags: %% Cell type:markdown id:35a15a41 tags:
   
In CS <b>320</b>, there are <b>40</b> lectures, <b>10</b> quizzes, <b>3</b> exams, <b>7</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome! In CS <b>320</b>, there are <b>40</b> lectures, <b>10</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome!
......
This diff is collapsed.
%% Cell type:markdown id:e60c1c48 tags: %% Cell type:markdown id:e60c1c48 tags:
# Regex 2 # Regex 2
%% Cell type:code id:0dba68b0 tags: %% Cell type:code id:0dba68b0 tags:
``` python ``` python
#import statements #import statements
import re import re
from subprocess import check_output from subprocess import check_output
import pandas as pd import pandas as pd
``` ```
%% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags: %% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags:
``` python ``` python
# Example strings # Example strings
# from DS100 book... # from DS100 book...
def reg(regex, text): def reg(regex, text):
""" """
Prints the string with the regex match highlighted. Prints the string with the regex match highlighted.
""" """
print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text)) print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))
s1 = " ".join(["A DAG is a directed graph without cycles.", s1 = " ".join(["A DAG is a directed graph without cycles.",
"A tree is a DAG where every node has one parent (except the root, which has none).", "A tree is a DAG where every node has one parent (except the root, which has none).",
"To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"]) "To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"])
print(s1) print(s1)
s2 = """1-608-123-4567 s2 = """1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
""" """
print(s2) print(s2)
s3 = "In CS 320, there are 10 quizzes, 7 projects, 39 lectures, and 1000 things to learn. CS 320 is awesome!" s3 = "In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome!"
print(s3) print(s3)
s4 = """In CS 320, there are 14 quizzes, 7 projects, s4 = """In CS 320, there are 11 quizzes, 6 projects,
41 lectures, and 1000 things to learn. CS 320 is awesome!""" 28 lectures, and 1000 things to learn. CS 320 is awesome!"""
print(s4) print(s4)
``` ```
%% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags: %% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags:
``` python ``` python
print(s1) print(s1)
``` ```
%% Cell type:markdown id:6a2eff34 tags: %% Cell type:markdown id:6a2eff34 tags:
### Regex is case sensitive ### Regex is case sensitive
### Character classes ### Character classes
- Character classes can be mentioned within `[...]` - Character classes can be mentioned within `[...]`
- `^` means `NOT` of a character class - `^` means `NOT` of a character class
- `-` enables us to mention range of characters, for example `[A-Z]` - `-` enables us to mention range of characters, for example `[A-Z]`
- `|` enables us to perform `OR` - `|` enables us to perform `OR`
### Metacharacters ### Metacharacters
- predefined character classes - predefined character classes
- `\d` => digits - `\d` => digits
- `\s` => whitespace (space, tab, newline) - `\s` => whitespace (space, tab, newline)
- `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`) - `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`)
- `.` => wildcard: anything except newline - `.` => wildcard: anything except newline
- capitalized version of character classes mean `NOT`, for example `\D` => everything except digits - capitalized version of character classes mean `NOT`, for example `\D` => everything except digits
### REPETITION ### REPETITION
- `<character>{<num matches>}` - for example: `w{3}` - `<character>{<num matches>}` - for example: `w{3}`
- matches cannot overlap - matches cannot overlap
### Variable length repitition operators ### Variable length repitition operators
- `*` => 0 or more (greedy: match as many characters as possible) - `*` => 0 or more (greedy: match as many characters as possible)
- `+` => 1 or more (greedy: match as many characters as possible) - `+` => 1 or more (greedy: match as many characters as possible)
- `?` => 0 or 1 - `?` => 0 or 1
- `*?` => 0 or more (non-greedy: match as few characters as possible) - `*?` => 0 or more (non-greedy: match as few characters as possible)
- `+?` => 1 or more (non-greedy: match as few characters as possible) - `+?` => 1 or more (non-greedy: match as few characters as possible)
#### Find everything inside of parentheses. #### Find everything inside of parentheses.
%% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags: %% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags:
``` python ``` python
# this doesn't work # this doesn't work
# it captures everything because () have special meaning (coming up) # it captures everything because () have special meaning (coming up)
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:code id:b488460e tags: %% Cell type:code id:b488460e tags:
``` python ``` python
# How can we change this to not use special meaning of ()? # How can we change this to not use special meaning of ()?
# * is greedy: match as many characters as possible # * is greedy: match as many characters as possible
reg(r"(.*)", s1) reg(r"(.*)", s1)
``` ```
%% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags: %% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags:
``` python ``` python
# non-greedy: stop at the first possible spot instead of the last possible spot # non-greedy: stop at the first possible spot instead of the last possible spot
reg(r"\(.*\)", s1) reg(r"\(.*\)", s1)
``` ```
%% Cell type:markdown id:0fd70cfd tags: %% Cell type:markdown id:0fd70cfd tags:
### Anchor characters ### Anchor characters
- `^` => start of string - `^` => start of string
- `^` is overloaded --- what was the other usage? - `^` is overloaded --- what was the other usage?
- `$` => end of string - `$` => end of string
#### Find everything in the first sentence. #### Find everything in the first sentence.
%% Cell type:code id:40fed5db tags: %% Cell type:code id:40fed5db tags:
``` python ``` python
# doesn't work because remember regex finds all possible matches # doesn't work because remember regex finds all possible matches
# so it matches every single sentence # so it matches every single sentence
# (even though we are doing non-greedy match) # (even though we are doing non-greedy match)
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:code id:7e97abd8 tags: %% Cell type:code id:7e97abd8 tags:
``` python ``` python
reg(r".*?\.", s1) reg(r".*?\.", s1)
``` ```
%% Cell type:markdown id:f66a4651 tags: %% Cell type:markdown id:f66a4651 tags:
#### Find everything in the first two sentences. #### Find everything in the first two sentences.
%% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags: %% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags:
``` python ``` python
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:markdown id:76570acd tags: %% Cell type:markdown id:76570acd tags:
#### Find last "word" in the sentence. #### Find last "word" in the sentence.
%% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags: %% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags:
``` python ``` python
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:markdown id:4b25fb66 tags: %% Cell type:markdown id:4b25fb66 tags:
### Case study: find all phone numbers. ### Case study: find all phone numbers.
%% Cell type:code id:8ecbeaf0 tags: %% Cell type:code id:8ecbeaf0 tags:
``` python ``` python
print(s2) print(s2)
# The country code (1) in the front is optional # The country code (1) in the front is optional
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to match country code without area code though! # Doesn't make sense to match country code without area code though!
``` ```
%% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags: %% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags:
``` python ``` python
# Full US phone numbers # Full US phone numbers
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags: %% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags:
``` python ``` python
# The country code (1) in the front is optional # The country code (1) in the front is optional
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags: %% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags:
``` python ``` python
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to have country code without area code though! # Doesn't make sense to have country code without area code though!
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags: %% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags:
``` python ``` python
# This is good enough for 320 quizzes/tests # This is good enough for 320 quizzes/tests
# But clearly, the last match is not correct # But clearly, the last match is not correct
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:markdown id:8a2ee4e2 tags: %% Cell type:markdown id:8a2ee4e2 tags:
Regex documentation link: https://docs.python.org/3/library/re.html. Regex documentation link: https://docs.python.org/3/library/re.html.
%% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags: %% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags:
``` python ``` python
# BONUS: negative lookbehind (I won't test this) # BONUS: negative lookbehind (I won't test this)
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2) reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2)
``` ```
%% Cell type:markdown id:3973350b tags: %% Cell type:markdown id:3973350b tags:
There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested. There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested.
%% Cell type:code id:4988d765 tags: %% Cell type:code id:4988d765 tags:
``` python ``` python
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999") reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999")
``` ```
%% Cell type:markdown id:b02ae9e0 tags: %% Cell type:markdown id:b02ae9e0 tags:
### Testing your regex ### Testing your regex
- you could use `reg(...)` function - you could use `reg(...)` function
- another useful resource: https://regex101.com/ - another useful resource: https://regex101.com/
%% Cell type:markdown id:4a973271 tags: %% Cell type:markdown id:4a973271 tags:
### `re` module ### `re` module
- `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches - `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches
- returns a list of strings - returns a list of strings
- `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution - `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution
- returns a new string with the substitutions (remember strings are immutable) - returns a new string with the substitutions (remember strings are immutable)
%% Cell type:code id:73ec525f tags: %% Cell type:code id:73ec525f tags:
``` python ``` python
msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!" msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!"
print(msg) print(msg)
``` ```
%% Cell type:markdown id:34998a5e tags: %% Cell type:markdown id:34998a5e tags:
#### Find all digits. #### Find all digits.
%% Cell type:code id:7f42c25a tags: %% Cell type:code id:7f42c25a tags:
``` python ``` python
``` ```
%% Cell type:markdown id:70b2f488 tags: %% Cell type:markdown id:70b2f488 tags:
### Groups ### Groups
- we can capture matches using `()` => this is the special meaning of `()` - we can capture matches using `()` => this is the special meaning of `()`
- returns a list of tuples, where length of the tuple will be number of groups - returns a list of tuples, where length of the tuple will be number of groups
#### Find all digits and the word that comes after that. #### Find all digits and the word that comes after that.
%% Cell type:code id:5309adee tags: %% Cell type:code id:5309adee tags:
``` python ``` python
matches = re.findall(r"", msg) matches = re.findall(r"", msg)
matches matches
``` ```
%% Cell type:markdown id:bc6a982c tags: %% Cell type:markdown id:bc6a982c tags:
#### Goal: make a dict (course component => count, like "projects" => 7) #### Goal: make a dict (course component => count, like "projects" => 6)
%% Cell type:code id:c7f1a028 tags: %% Cell type:code id:c7f1a028 tags:
``` python ``` python
course_dict = {} course_dict = {}
for count, component in matches: for count, component in matches:
course_dict[component] = int(count) course_dict[component] = int(count)
course_dict course_dict
``` ```
%% Cell type:markdown id:c4b6b505 tags: %% Cell type:markdown id:c4b6b505 tags:
### Unlike matches, groups can overlap ### Unlike matches, groups can overlap
#### Find and group all digits and the word that comes after that. #### Find and group all digits and the word that comes after that.
%% Cell type:code id:491c3460 tags: %% Cell type:code id:491c3460 tags:
``` python ``` python
re.findall(r"(\d+) (\w+)", msg) re.findall(r"(\d+) (\w+)", msg)
``` ```
%% Cell type:markdown id:d2227e69 tags: %% Cell type:markdown id:d2227e69 tags:
#### Substitute all digits with "###". #### Substitute all digits with "###".
%% Cell type:code id:6d1fede1 tags: %% Cell type:code id:6d1fede1 tags:
``` python ``` python
re.sub(r"", , msg) re.sub(r"", , msg)
``` ```
%% Cell type:markdown id:9d531122 tags: %% Cell type:markdown id:9d531122 tags:
#### Goal: normalize whitespace (everything will be a single space) #### Goal: normalize whitespace (everything will be a single space)
%% Cell type:code id:4becbe70 tags: %% Cell type:code id:4becbe70 tags:
``` python ``` python
print(msg) print(msg)
``` ```
%% Cell type:code id:72a6eb42 tags: %% Cell type:code id:72a6eb42 tags:
``` python ``` python
re.sub(r"", , msg) re.sub(r"", , msg)
``` ```
%% Cell type:markdown id:6faf33fd tags: %% Cell type:markdown id:6faf33fd tags:
### How to use groups is substitution? ### How to use groups is substitution?
- `\g<N>` gives you the result of the N'th grouping. - `\g<N>` gives you the result of the N'th grouping.
#### Substitute all course component counts with HTML bold tags. #### Substitute all course component counts with HTML bold tags.
%% Cell type:code id:8df577fd tags: %% Cell type:code id:8df577fd tags:
``` python ``` python
print(re.sub(r"(\d+)", "<b></b>", msg)) print(re.sub(r"(\d+)", "<b></b>", msg))
``` ```
%% Cell type:markdown id:35a15a41 tags: %% Cell type:markdown id:35a15a41 tags:
In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome! In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome!
%% Cell type:markdown id:6b299526 tags: %% Cell type:markdown id:6b299526 tags:
### Git log example ### Git log example
%% Cell type:markdown id:a9b7261c tags: %% Cell type:markdown id:a9b7261c tags:
#### Run `git log` as a shell command #### Run `git log` as a shell command
%% Cell type:code id:10e459b4 tags: %% Cell type:code id:10e459b4 tags:
``` python ``` python
!git log !git log
``` ```
%% Cell type:code id:ef440fed tags: %% Cell type:code id:ef440fed tags:
``` python ``` python
git_log_output = str(check_output(["git", "log"]), encoding="utf-8") git_log_output = str(check_output(["git", "log"]), encoding="utf-8")
print(git_log_output[:500]) print(git_log_output[:500])
``` ```
%% Cell type:markdown id:5c154b46 tags: %% Cell type:markdown id:5c154b46 tags:
#### GOAL: find all the commit numbers #### GOAL: find all the commit numbers
%% Cell type:code id:5ea954bf tags: %% Cell type:code id:5ea954bf tags:
``` python ``` python
commits = re.findall(r"", git_log_output) commits = re.findall(r"", git_log_output)
# recent 10 commit numbers # recent 10 commit numbers
commits[:10] commits[:10]
``` ```
%% Cell type:markdown id:bc485b5f tags: %% Cell type:markdown id:bc485b5f tags:
#### What days of the week does the team push things into this repo? #### What days of the week does the team push things into this repo?
%% Cell type:code id:57353c44 tags: %% Cell type:code id:57353c44 tags:
``` python ``` python
print(git_log_output[:500]) print(git_log_output[:500])
``` ```
%% Cell type:code id:d1ea6f59 tags: %% Cell type:code id:d1ea6f59 tags:
``` python ``` python
days = re.findall(r"", git_log_output) days = re.findall(r"", git_log_output)
days days
``` ```
%% Cell type:markdown id:2c7efb55 tags: %% Cell type:markdown id:2c7efb55 tags:
#### Count unique days #### Count unique days
%% Cell type:code id:3c2d7207 tags: %% Cell type:code id:3c2d7207 tags:
``` python ``` python
day_counts = pd.Series(days).value_counts() day_counts = pd.Series(days).value_counts()
day_counts day_counts
``` ```
%% Cell type:markdown id:7317ca35 tags: %% Cell type:markdown id:7317ca35 tags:
#### Sort by day of the week #### Sort by day of the week
%% Cell type:code id:d5c5c58c tags: %% Cell type:code id:d5c5c58c tags:
``` python ``` python
sorted_day_counts = day_counts.loc[["Mon", "Tue", "Wed", "Thu", "Fri", "Sun"]] sorted_day_counts = day_counts.loc[["Mon", "Tue", "Wed", "Thu", "Fri", "Sun"]]
sorted_day_counts sorted_day_counts
``` ```
%% Cell type:markdown id:745afee5 tags: %% Cell type:markdown id:745afee5 tags:
#### Create a bar plot #### Create a bar plot
%% Cell type:code id:c7bb8f6f tags: %% Cell type:code id:c7bb8f6f tags:
``` python ``` python
ax = sorted_day_counts.plot.bar() ax = sorted_day_counts.plot.bar()
ax.set_ylabel("Commit counts") ax.set_ylabel("Commit counts")
ax.set_xlabel("Days of the week") ax.set_xlabel("Days of the week")
``` ```
%% Cell type:markdown id:ecfc71e6 tags: %% Cell type:markdown id:ecfc71e6 tags:
#### Find all commit autho names. #### Find all commit autho names.
%% Cell type:code id:6153035a tags: %% Cell type:code id:6153035a tags:
``` python ``` python
authors = re.findall(r"", git_log_output) authors = re.findall(r"", git_log_output)
authors[0] authors[0]
``` ```
%% Cell type:markdown id:3fa201fb tags: %% Cell type:markdown id:3fa201fb tags:
#### `git log` from projects repo #### `git log` from projects repo
%% Cell type:code id:e200a8b0 tags: %% Cell type:code id:e200a8b0 tags:
``` python ``` python
git_log_output = str(check_output(["git", "log"], cwd="../projects-and-labs"), encoding="utf-8") git_log_output = str(check_output(["git", "log"], cwd="../../projects-and-labs"), encoding="utf-8")
print(git_log_output[:1000]) print(git_log_output[:1000])
``` ```
%% Cell type:code id:053b2607 tags: %% Cell type:code id:053b2607 tags:
``` python ``` python
re.findall(r"", git_log_output) re.findall(r"", git_log_output)
``` ```
%% Cell type:markdown id:3ce53c79 tags: %% Cell type:markdown id:3ce53c79 tags:
### Emails example ### Emails example
%% Cell type:code id:1968c0ff tags: %% Cell type:code id:1968c0ff tags:
``` python ``` python
s = """ s = """
Gurmail [Instructor] - gsingh58(AT) cs.wisc.edu Gurmail [Instructor] - gsingh58(AT) cs.wisc.edu
Jinlang [Head TA] - jwang2775 (AT) wisc.edu Jinlang [Head TA] - jwang2775 (AT) wisc.edu
Elliot [TA] - eepickens (AT) cs.wisc.edu Elliot [TA] - eepickens (AT) cs.wisc.edu
Alex [TA] - aclinton (AT) wisc.edu Alex [TA] - aclinton (AT) wisc.edu
Bowman [TA] - bnbrown3 (AT) wisc.edu Bowman [TA] - bnbrown3 (AT) wisc.edu
Hafeez [TA] - aneesali (AT) wisc.edu Hafeez [TA] - aneesali (AT) wisc.edu
William [TA] - wycong (AT) wisc.edu William [TA] - wycong (AT) wisc.edu
""" """
print(s) print(s)
``` ```
%% Cell type:code id:5fbfdf12 tags: %% Cell type:code id:5fbfdf12 tags:
``` python ``` python
name = r"\w+" name = r"\w+"
at = r"@|([\(\[]?[Aa][Tt][\)\]]?)" at = r"@|([\(\[]?[Aa][Tt][\)\]]?)"
domain = r"\w+\.(\w+\.)?(edu|com|org|net|io|gov)" domain = r"\w+\.(\w+\.)?(edu|com|org|net|io|gov)"
full_regex = f"(({name})\s*({at})\s*({domain}))" full_regex = f"(({name})\s*({at})\s*({domain}))"
re.findall(full_regex, s) re.findall(full_regex, s)
``` ```
%% Cell type:code id:2257dbf1 tags: %% Cell type:code id:2257dbf1 tags:
``` python ``` python
print("REGEX:", full_regex) print("REGEX:", full_regex)
for match in re.findall(full_regex, s): for match in re.findall(full_regex, s):
print(match[1] + "@" + match[4]) print(match[1] + "@" + match[4])
``` ```
%% Cell type:markdown id:16c6c169 tags: %% Cell type:markdown id:16c6c169 tags:
### Self-practice ### Self-practice
Q1: Which regex will NOT match "123" Q1: Which regex will NOT match "123"
1. r"\d\d\d" 1. r"\d\d\d"
2. r"\d{3}" 2. r"\d{3}"
3. r"\D\D\D" 3. r"\D\D\D"
4. r"..." 4. r"..."
Q2: What will r"^A" match? Q2: What will r"^A" match?
1. "A" 1. "A"
2. "^A" 2. "^A"
3. "BA" 3. "BA"
4. "B" 4. "B"
5. "BB" 5. "BB"
Q3: Which one can match "HH"? Q3: Which one can match "HH"?
1. r"HA+H" 1. r"HA+H"
2. r"HA+?H" 2. r"HA+?H"
3. r"H(A+)?H" 3. r"H(A+)?H"
Q4: Which string(s) will match r"^(ha)*$" Q4: Which string(s) will match r"^(ha)*$"
1. "" 1. ""
2. "hahah" 2. "hahah"
3. "that" 3. "that"
4. "HAHA" 4. "HAHA"
Q5: What is the type of the following?re.findall(r"(\d) (\w+)", some_str)[0] Q5: What is the type of the following?re.findall(r"(\d) (\w+)", some_str)[0]
1. list 1. list
2. tuple 2. tuple
3. string 3. string
Q6: What will it do? Q6: What will it do?
```python ```python
re.sub(r"(\d{3})-(\d{3}-\d{4})", re.sub(r"(\d{3})-(\d{3}-\d{4})",
r"(\g<1>) \g<2>", r"(\g<1>) \g<2>",
"608-123-4567") "608-123-4567")
``` ```
%% Cell type:markdown id:f1184ba1 tags: %% Cell type:markdown id:f1184ba1 tags:
The answers of these questions can be found in self_practice.ipynb. You may want to try to answer these questions yourself and then verify your answers. The answers of these questions can be found in self_practice.ipynb. You may want to try to answer these questions yourself and then verify your answers.
......
%% Cell type:markdown id:e60c1c48 tags: %% Cell type:markdown id:e60c1c48 tags:
# Regex 2 # Regex 2
%% Cell type:code id:0dba68b0 tags: %% Cell type:code id:0dba68b0 tags:
``` python ``` python
#import statements #import statements
import re import re
from subprocess import check_output from subprocess import check_output
import pandas as pd import pandas as pd
``` ```
%% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags: %% Cell type:code id:b97c8008-8e39-4a9f-89a1-0d1ddbb1ac01 tags:
``` python ``` python
# Example strings # Example strings
# from DS100 book... # from DS100 book...
def reg(regex, text): def reg(regex, text):
""" """
Prints the string with the regex match highlighted. Prints the string with the regex match highlighted.
""" """
print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text)) print(re.sub(f'({regex})', r'\033[1;30;43m\1\033[m', text))
s1 = " ".join(["A DAG is a directed graph without cycles.", s1 = " ".join(["A DAG is a directed graph without cycles.",
"A tree is a DAG where every node has one parent (except the root, which has none).", "A tree is a DAG where every node has one parent (except the root, which has none).",
"To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"]) "To learn more, visit www.example.com or call 1-608-123-4567. :) ¯\_(ツ)_/¯"])
print(s1) print(s1)
s2 = """1-608-123-4567 s2 = """1-608-123-4567
a-bcd-efg-hijg (not a phone number) a-bcd-efg-hijg (not a phone number)
1-608-123-456 (not a phone number) 1-608-123-456 (not a phone number)
608-123-4567 608-123-4567
123-4567 123-4567
1-123-4567 (not a phone number) 1-123-4567 (not a phone number)
""" """
print(s2) print(s2)
s3 = "In CS 320, there are 10 quizzes, 7 projects, 39 lectures, and 1000 things to learn. CS 320 is awesome!" s3 = "In CS 320, there are 11 quizzes, 6 projects, 28 lectures, and 1000 things to learn. CS 320 is awesome!"
print(s3) print(s3)
s4 = """In CS 320, there are 14 quizzes, 7 projects, s4 = """In CS 320, there are 11 quizzes, 6 projects,
41 lectures, and 1000 things to learn. CS 320 is awesome!""" 28 lectures, and 1000 things to learn. CS 320 is awesome!"""
print(s4) print(s4)
``` ```
%% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags: %% Cell type:code id:924069c5-82be-4423-b659-2beee8e226be tags:
``` python ``` python
print(s1) print(s1)
``` ```
%% Cell type:markdown id:6a2eff34 tags: %% Cell type:markdown id:6a2eff34 tags:
### Regex is case sensitive ### Regex is case sensitive
### Character classes ### Character classes
- Character classes can be mentioned within `[...]` - Character classes can be mentioned within `[...]`
- `^` means `NOT` of a character class - `^` means `NOT` of a character class
- `-` enables us to mention range of characters, for example `[A-Z]` - `-` enables us to mention range of characters, for example `[A-Z]`
- `|` enables us to perform `OR` - `|` enables us to perform `OR`
### Metacharacters ### Metacharacters
- predefined character classes - predefined character classes
- `\d` => digits - `\d` => digits
- `\s` => whitespace (space, tab, newline) - `\s` => whitespace (space, tab, newline)
- `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`) - `\w` => "word" characters (digits, letters, underscores, etc) --- helpful for variable name matches and whole word matches (as it doesn't match whitespace --- `\s`)
- `.` => wildcard: anything except newline - `.` => wildcard: anything except newline
- capitalized version of character classes mean `NOT`, for example `\D` => everything except digits - capitalized version of character classes mean `NOT`, for example `\D` => everything except digits
### REPETITION ### REPETITION
- `<character>{<num matches>}` - for example: `w{3}` - `<character>{<num matches>}` - for example: `w{3}`
- matches cannot overlap - matches cannot overlap
### Variable length repitition operators ### Variable length repitition operators
- `*` => 0 or more (greedy: match as many characters as possible) - `*` => 0 or more (greedy: match as many characters as possible)
- `+` => 1 or more (greedy: match as many characters as possible) - `+` => 1 or more (greedy: match as many characters as possible)
- `?` => 0 or 1 - `?` => 0 or 1
- `*?` => 0 or more (non-greedy: match as few characters as possible) - `*?` => 0 or more (non-greedy: match as few characters as possible)
- `+?` => 1 or more (non-greedy: match as few characters as possible) - `+?` => 1 or more (non-greedy: match as few characters as possible)
#### Find everything inside of parentheses. #### Find everything inside of parentheses.
%% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags: %% Cell type:code id:11f75d92-f215-47f4-8804-5e5727d3be55 tags:
``` python ``` python
# this doesn't work # this doesn't work
# it captures everything because () have special meaning (coming up) # it captures everything because () have special meaning (coming up)
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:code id:b488460e tags: %% Cell type:code id:b488460e tags:
``` python ``` python
# How can we change this to not use special meaning of ()? # How can we change this to not use special meaning of ()?
# * is greedy: match as many characters as possible # * is greedy: match as many characters as possible
reg(r"(.*)", s1) reg(r"(.*)", s1)
``` ```
%% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags: %% Cell type:code id:42155b54-d27b-4418-81a1-c005c508d738 tags:
``` python ``` python
# non-greedy: stop at the first possible spot instead of the last possible spot # non-greedy: stop at the first possible spot instead of the last possible spot
reg(r"\(.*\)", s1) reg(r"\(.*\)", s1)
``` ```
%% Cell type:markdown id:0fd70cfd tags: %% Cell type:markdown id:0fd70cfd tags:
### Anchor characters ### Anchor characters
- `^` => start of string - `^` => start of string
- `^` is overloaded --- what was the other usage? - `^` is overloaded --- what was the other usage?
- `$` => end of string - `$` => end of string
#### Find everything in the first sentence. #### Find everything in the first sentence.
%% Cell type:code id:40fed5db tags: %% Cell type:code id:40fed5db tags:
``` python ``` python
# doesn't work because remember regex finds all possible matches # doesn't work because remember regex finds all possible matches
# so it matches every single sentence # so it matches every single sentence
# (even though we are doing non-greedy match) # (even though we are doing non-greedy match)
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:code id:7e97abd8 tags: %% Cell type:code id:7e97abd8 tags:
``` python ``` python
reg(r".*?\.", s1) reg(r".*?\.", s1)
``` ```
%% Cell type:markdown id:f66a4651 tags: %% Cell type:markdown id:f66a4651 tags:
#### Find everything in the first two sentences. #### Find everything in the first two sentences.
%% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags: %% Cell type:code id:cfa8353e-4402-4f64-b3f8-35e73b2d7a68 tags:
``` python ``` python
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:markdown id:76570acd tags: %% Cell type:markdown id:76570acd tags:
#### Find last "word" in the sentence. #### Find last "word" in the sentence.
%% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags: %% Cell type:code id:f35a6c81-47b5-48eb-8357-888fe832f85b tags:
``` python ``` python
reg(r"", s1) reg(r"", s1)
``` ```
%% Cell type:markdown id:4b25fb66 tags: %% Cell type:markdown id:4b25fb66 tags:
### Case study: find all phone numbers. ### Case study: find all phone numbers.
%% Cell type:code id:8ecbeaf0 tags: %% Cell type:code id:8ecbeaf0 tags:
``` python ``` python
print(s2) print(s2)
# The country code (1) in the front is optional # The country code (1) in the front is optional
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to match country code without area code though! # Doesn't make sense to match country code without area code though!
``` ```
%% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags: %% Cell type:code id:88a01725-790c-4f21-b334-e3d31bed24b5 tags:
``` python ``` python
# Full US phone numbers # Full US phone numbers
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags: %% Cell type:code id:ea8ce0be-98f8-4f9c-bcc0-1d8a0bb183a8 tags:
``` python ``` python
# The country code (1) in the front is optional # The country code (1) in the front is optional
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags: %% Cell type:code id:5befb23a-848f-44d4-aaac-6b21ac0bbf43 tags:
``` python ``` python
# The area code (608) is also optional # The area code (608) is also optional
# Doesn't make sense to have country code without area code though! # Doesn't make sense to have country code without area code though!
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags: %% Cell type:code id:34164fd1-a422-45a0-8e11-54199ca77120 tags:
``` python ``` python
# This is good enough for 320 quizzes/tests # This is good enough for 320 quizzes/tests
# But clearly, the last match is not correct # But clearly, the last match is not correct
reg(r"", s2) reg(r"", s2)
``` ```
%% Cell type:markdown id:8a2ee4e2 tags: %% Cell type:markdown id:8a2ee4e2 tags:
Regex documentation link: https://docs.python.org/3/library/re.html. Regex documentation link: https://docs.python.org/3/library/re.html.
%% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags: %% Cell type:code id:694a585b-b5a7-4a6f-a0f1-60521f7dfc47 tags:
``` python ``` python
# BONUS: negative lookbehind (I won't test this) # BONUS: negative lookbehind (I won't test this)
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2) reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", s2)
``` ```
%% Cell type:markdown id:3973350b tags: %% Cell type:markdown id:3973350b tags:
There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested. There is also a negative lookahead. For example, how to avoid matching "1-608-123-456" in "1-608-123-4569999". You can explore this if you are interested.
%% Cell type:code id:4988d765 tags: %% Cell type:code id:4988d765 tags:
``` python ``` python
reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999") reg(r"(?<!\d\-)((\d-)?\d{3}-)?\d{3}-\d{4}", "608-123-4569999")
``` ```
%% Cell type:markdown id:b02ae9e0 tags: %% Cell type:markdown id:b02ae9e0 tags:
### Testing your regex ### Testing your regex
- you could use `reg(...)` function - you could use `reg(...)` function
- another useful resource: https://regex101.com/ - another useful resource: https://regex101.com/
%% Cell type:markdown id:4a973271 tags: %% Cell type:markdown id:4a973271 tags:
### `re` module ### `re` module
- `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches - `re.findall(<PATTERN>, <SEARCH STRING>)`: regular expression matches
- returns a list of strings - returns a list of strings
- `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution - `re.sub(<PATTERN>, <REPLACEMENT>, <SEARCH STRING>)`: regular expression match + substitution
- returns a new string with the substitutions (remember strings are immutable) - returns a new string with the substitutions (remember strings are immutable)
%% Cell type:code id:73ec525f tags: %% Cell type:code id:73ec525f tags:
``` python ``` python
msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!" msg = "In CS 320,\tthere are 28 lectures, 11 quizzes, 3 exams,\t6 projects, and 1000 things to learn. CS 320 is awesome!"
print(msg) print(msg)
``` ```
%% Cell type:markdown id:34998a5e tags: %% Cell type:markdown id:34998a5e tags:
#### Find all digits. #### Find all digits.
%% Cell type:code id:7f42c25a tags: %% Cell type:code id:7f42c25a tags:
``` python ``` python
``` ```
%% Cell type:markdown id:70b2f488 tags: %% Cell type:markdown id:70b2f488 tags:
### Groups ### Groups
- we can capture matches using `()` => this is the special meaning of `()` - we can capture matches using `()` => this is the special meaning of `()`
- returns a list of tuples, where length of the tuple will be number of groups - returns a list of tuples, where length of the tuple will be number of groups
#### Find all digits and the word that comes after that. #### Find all digits and the word that comes after that.
%% Cell type:code id:5309adee tags: %% Cell type:code id:5309adee tags:
``` python ``` python
matches = re.findall(r"", msg) matches = re.findall(r"", msg)
matches matches
``` ```
%% Cell type:markdown id:bc6a982c tags: %% Cell type:markdown id:bc6a982c tags:
#### Goal: make a dict (course component => count, like "projects" => 7) #### Goal: make a dict (course component => count, like "projects" => 6)
%% Cell type:code id:c7f1a028 tags: %% Cell type:code id:c7f1a028 tags:
``` python ``` python
course_dict = {} course_dict = {}
for count, component in matches: for count, component in matches:
course_dict[component] = int(count) course_dict[component] = int(count)
course_dict course_dict
``` ```
%% Cell type:markdown id:c4b6b505 tags: %% Cell type:markdown id:c4b6b505 tags:
### Unlike matches, groups can overlap ### Unlike matches, groups can overlap
#### Find and group all digits and the word that comes after that. #### Find and group all digits and the word that comes after that.
%% Cell type:code id:491c3460 tags: %% Cell type:code id:491c3460 tags:
``` python ``` python
re.findall(r"(\d+) (\w+)", msg) re.findall(r"(\d+) (\w+)", msg)
``` ```
%% Cell type:markdown id:d2227e69 tags: %% Cell type:markdown id:d2227e69 tags:
#### Substitute all digits with "###". #### Substitute all digits with "###".
%% Cell type:code id:6d1fede1 tags: %% Cell type:code id:6d1fede1 tags:
``` python ``` python
re.sub(r"", , msg) re.sub(r"", , msg)
``` ```
%% Cell type:markdown id:9d531122 tags: %% Cell type:markdown id:9d531122 tags:
#### Goal: normalize whitespace (everything will be a single space) #### Goal: normalize whitespace (everything will be a single space)
%% Cell type:code id:4becbe70 tags: %% Cell type:code id:4becbe70 tags:
``` python ``` python
print(msg) print(msg)
``` ```
%% Cell type:code id:72a6eb42 tags: %% Cell type:code id:72a6eb42 tags:
``` python ``` python
re.sub(r"", , msg) re.sub(r"", , msg)
``` ```
%% Cell type:markdown id:6faf33fd tags: %% Cell type:markdown id:6faf33fd tags:
### How to use groups is substitution? ### How to use groups is substitution?
- `\g<N>` gives you the result of the N'th grouping. - `\g<N>` gives you the result of the N'th grouping.
#### Substitute all course component counts with HTML bold tags. #### Substitute all course component counts with HTML bold tags.
%% Cell type:code id:8df577fd tags: %% Cell type:code id:8df577fd tags:
``` python ``` python
print(re.sub(r"(\d+)", "<b></b>", msg)) print(re.sub(r"(\d+)", "<b></b>", msg))
``` ```
%% Cell type:markdown id:35a15a41 tags: %% Cell type:markdown id:35a15a41 tags:
In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome! In CS <b>320</b>, there are <b>28</b> lectures, <b>11</b> quizzes, <b>3</b> exams, <b>6</b> projects, and <b>1000</b> things to learn. CS <b>320</b> is awesome!
%% Cell type:markdown id:6b299526 tags: %% Cell type:markdown id:6b299526 tags:
### Git log example ### Git log example
%% Cell type:markdown id:a9b7261c tags: %% Cell type:markdown id:a9b7261c tags:
#### Run `git log` as a shell command #### Run `git log` as a shell command
%% Cell type:code id:10e459b4 tags: %% Cell type:code id:10e459b4 tags:
``` python ``` python
!git log !git log
``` ```
%% Cell type:code id:ef440fed tags: %% Cell type:code id:ef440fed tags:
``` python ``` python
git_log_output = str(check_output(["git", "log"]), encoding="utf-8") git_log_output = str(check_output(["git", "log"]), encoding="utf-8")
print(git_log_output[:500]) print(git_log_output[:500])
``` ```
%% Cell type:markdown id:5c154b46 tags: %% Cell type:markdown id:5c154b46 tags:
#### GOAL: find all the commit numbers #### GOAL: find all the commit numbers
%% Cell type:code id:5ea954bf tags: %% Cell type:code id:5ea954bf tags:
``` python ``` python
commits = re.findall(r"", git_log_output) commits = re.findall(r"", git_log_output)
# recent 10 commit numbers # recent 10 commit numbers
commits[:10] commits[:10]
``` ```
%% Cell type:markdown id:bc485b5f tags: %% Cell type:markdown id:bc485b5f tags:
#### What days of the week does the team push things into this repo? #### What days of the week does the team push things into this repo?
%% Cell type:code id:57353c44 tags: %% Cell type:code id:57353c44 tags:
``` python ``` python
print(git_log_output[:500]) print(git_log_output[:500])
``` ```
%% Cell type:code id:d1ea6f59 tags: %% Cell type:code id:d1ea6f59 tags:
``` python ``` python
days = re.findall(r"", git_log_output) days = re.findall(r"", git_log_output)
days days
``` ```
%% Cell type:markdown id:2c7efb55 tags: %% Cell type:markdown id:2c7efb55 tags:
#### Count unique days #### Count unique days
%% Cell type:code id:3c2d7207 tags: %% Cell type:code id:3c2d7207 tags:
``` python ``` python
day_counts = pd.Series(days).value_counts() day_counts = pd.Series(days).value_counts()
day_counts day_counts
``` ```
%% Cell type:markdown id:7317ca35 tags: %% Cell type:markdown id:7317ca35 tags:
#### Sort by day of the week #### Sort by day of the week
%% Cell type:code id:d5c5c58c tags: %% Cell type:code id:d5c5c58c tags:
``` python ``` python
sorted_day_counts = day_counts.loc[["Mon", "Tue", "Wed", "Thu", "Fri", "Sun"]] sorted_day_counts = day_counts.loc[["Mon", "Tue", "Wed", "Thu", "Fri", "Sun"]]
sorted_day_counts sorted_day_counts
``` ```
%% Cell type:markdown id:745afee5 tags: %% Cell type:markdown id:745afee5 tags:
#### Create a bar plot #### Create a bar plot
%% Cell type:code id:c7bb8f6f tags: %% Cell type:code id:c7bb8f6f tags:
``` python ``` python
ax = sorted_day_counts.plot.bar() ax = sorted_day_counts.plot.bar()
ax.set_ylabel("Commit counts") ax.set_ylabel("Commit counts")
ax.set_xlabel("Days of the week") ax.set_xlabel("Days of the week")
``` ```
%% Cell type:markdown id:ecfc71e6 tags: %% Cell type:markdown id:ecfc71e6 tags:
#### Find all commit autho names. #### Find all commit autho names.
%% Cell type:code id:6153035a tags: %% Cell type:code id:6153035a tags:
``` python ``` python
authors = re.findall(r"", git_log_output) authors = re.findall(r"", git_log_output)
authors[0] authors[0]
``` ```
%% Cell type:markdown id:3fa201fb tags: %% Cell type:markdown id:3fa201fb tags:
#### `git log` from projects repo #### `git log` from projects repo
%% Cell type:code id:e200a8b0 tags: %% Cell type:code id:e200a8b0 tags:
``` python ``` python
git_log_output = str(check_output(["git", "log"], cwd="../projects-and-labs"), encoding="utf-8") git_log_output = str(check_output(["git", "log"], cwd="../../projects-and-labs"), encoding="utf-8")
print(git_log_output[:1000]) print(git_log_output[:1000])
``` ```
%% Cell type:code id:053b2607 tags: %% Cell type:code id:053b2607 tags:
``` python ``` python
re.findall(r"", git_log_output) re.findall(r"", git_log_output)
``` ```
%% Cell type:markdown id:3ce53c79 tags: %% Cell type:markdown id:3ce53c79 tags:
### Emails example ### Emails example
%% Cell type:code id:1968c0ff tags: %% Cell type:code id:1968c0ff tags:
``` python ``` python
s = """ s = """
Gurmail [Instructor] - gsingh58(AT) cs.wisc.edu Gurmail [Instructor] - gsingh58(AT) cs.wisc.edu
Jinlang [Head TA] - jwang2775 (AT) wisc.edu Jinlang [Head TA] - jwang2775 (AT) wisc.edu
Elliot [TA] - eepickens (AT) cs.wisc.edu Elliot [TA] - eepickens (AT) cs.wisc.edu
Alex [TA] - aclinton (AT) wisc.edu Alex [TA] - aclinton (AT) wisc.edu
Bowman [TA] - bnbrown3 (AT) wisc.edu Bowman [TA] - bnbrown3 (AT) wisc.edu
Hafeez [TA] - aneesali (AT) wisc.edu Hafeez [TA] - aneesali (AT) wisc.edu
William [TA] - wycong (AT) wisc.edu William [TA] - wycong (AT) wisc.edu
""" """
print(s) print(s)
``` ```
%% Cell type:code id:5fbfdf12 tags: %% Cell type:code id:5fbfdf12 tags:
``` python ``` python
name = r"\w+" name = r"\w+"
at = r"@|([\(\[]?[Aa][Tt][\)\]]?)" at = r"@|([\(\[]?[Aa][Tt][\)\]]?)"
domain = r"\w+\.(\w+\.)?(edu|com|org|net|io|gov)" domain = r"\w+\.(\w+\.)?(edu|com|org|net|io|gov)"
full_regex = f"(({name})\s*({at})\s*({domain}))" full_regex = f"(({name})\s*({at})\s*({domain}))"
re.findall(full_regex, s) re.findall(full_regex, s)
``` ```
%% Cell type:code id:2257dbf1 tags: %% Cell type:code id:2257dbf1 tags:
``` python ``` python
print("REGEX:", full_regex) print("REGEX:", full_regex)
for match in re.findall(full_regex, s): for match in re.findall(full_regex, s):
print(match[1] + "@" + match[4]) print(match[1] + "@" + match[4])
``` ```
%% Cell type:markdown id:16c6c169 tags: %% Cell type:markdown id:16c6c169 tags:
### Self-practice ### Self-practice
Q1: Which regex will NOT match "123" Q1: Which regex will NOT match "123"
1. r"\d\d\d" 1. r"\d\d\d"
2. r"\d{3}" 2. r"\d{3}"
3. r"\D\D\D" 3. r"\D\D\D"
4. r"..." 4. r"..."
Q2: What will r"^A" match? Q2: What will r"^A" match?
1. "A" 1. "A"
2. "^A" 2. "^A"
3. "BA" 3. "BA"
4. "B" 4. "B"
5. "BB" 5. "BB"
Q3: Which one can match "HH"? Q3: Which one can match "HH"?
1. r"HA+H" 1. r"HA+H"
2. r"HA+?H" 2. r"HA+?H"
3. r"H(A+)?H" 3. r"H(A+)?H"
Q4: Which string(s) will match r"^(ha)*$" Q4: Which string(s) will match r"^(ha)*$"
1. "" 1. ""
2. "hahah" 2. "hahah"
3. "that" 3. "that"
4. "HAHA" 4. "HAHA"
Q5: What is the type of the following?re.findall(r"(\d) (\w+)", some_str)[0] Q5: What is the type of the following?re.findall(r"(\d) (\w+)", some_str)[0]
1. list 1. list
2. tuple 2. tuple
3. string 3. string
Q6: What will it do? Q6: What will it do?
```python ```python
re.sub(r"(\d{3})-(\d{3}-\d{4})", re.sub(r"(\d{3})-(\d{3}-\d{4})",
r"(\g<1>) \g<2>", r"(\g<1>) \g<2>",
"608-123-4567") "608-123-4567")
``` ```
%% Cell type:markdown id:f1184ba1 tags: %% Cell type:markdown id:f1184ba1 tags:
The answers of these questions can be found in self_practice.ipynb. You may want to try to answer these questions yourself and then verify your answers. The answers of these questions can be found in self_practice.ipynb. You may want to try to answer these questions yourself and then verify your answers.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment