A piece of data may contain letters, numbers as well as special characters. If we are interested in extracting only the letters form this string of data, then we can use various options available in python. Show
With isalphaThe isalpha function will check if the given character is an alphabet or not. We will use this inside a for loop which will fetch each character from the given string and check if it is an alphabet. The join method will capture only the valid characters into the result. ExampleLive Demo stringA = "Qwer34^&t%y" # Given string print("Given string : ", stringA) # Find characters res = "" for i in stringA: if i.isalpha(): res = "".join([res, i]) # Result print("Result: ", res) OutputRunning the above code gives us the following result − Given string : Qwer34^&t%y Result: Qwerty With Regular expressionWe can leverage the regular expression module and use the function findall giving the parameter value which represents only the characters. ExampleLive Demo import re stringA = "Qwer34^&t%y" # Given string print("Given string : ", stringA) # Find characters res = "".join(re.findall("[a-zA-Z]+", stringA)) # Result print("Result: ", res) OutputRunning the above code gives us the following result − Given string : Qwer34^&t%y Result: Qwerty
Updated on 05-May-2020 10:16:10
One place where the Python language really shines is in the manipulation of strings. This section will cover some of Python's built-in string methods and formatting operations, before moving on to a quick guide to the extremely useful subject of regular expressions. Such string manipulation patterns come up often in the context of data science work, and is one big perk of Python in this context. Strings in Python can be defined using either single or double
quotations (they are functionally equivalent): In [1]: In addition, it is possible to define multi-line strings using a triple-quote syntax: In [2]: multiline = """ one two three """ With this, let's take a quick tour of some of Python's string manipulation tools. Simple String Manipulation in Python¶For basic manipulation of strings, Python's built-in string methods can be extremely convenient. If you have a background working in C or another low-level language, you will likely find the simplicity of Python's methods extremely refreshing. We introduced Python's string type and a few of these methods earlier; here we'll dive a bit deeper Formatting strings: Adjusting case¶Python makes it quite easy to adjust the case of a string. Here we'll look at the In [3]: fox = "tHe qUICk bROWn fOx." To convert the entire string into upper-case or lower-case, you can use the A common formatting need is to capitalize just the first letter of each word, or perhaps the first letter of each sentence. This can be done with the The cases can be swapped using the Formatting strings: Adding and removing spaces¶Another common need is to remove spaces (or other characters) from the beginning or end of the string. The basic method of removing characters is the In [9]: line = ' this is the content ' line.strip() To
remove just space to the right or left, use To remove characters other than spaces, you can pass the desired character to the In [12]: num = "000000000000435" num.strip('0') The opposite of this operation, adding spaces or other characters, can be accomplished using the For example, we can use the In [13]: line = "this is the content" line.center(30) Similarly, All these methods additionally accept any character which will be used to fill the space. For example: Because zero-filling is such a common need, Python also provides Finding and replacing substrings¶If you want to find occurrences of a certain character in a string, the
In [18]: line = 'the quick brown fox jumped over a lazy dog' line.find('fox') The only difference between --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-21-4cbe6ee9b0eb> in <module>() ----> 1 line.index('bear') ValueError: substring not found The related For
the special case of checking for a substring at the beginning or end of a string, Python provides the To go one step further and replace a given substring with a new string, you can use the In [25]: line.replace('brown', 'red') Out[25]: 'the quick red fox jumped over a lazy dog' The Out[26]: 'the quick br--wn f--x jumped --ver a lazy d--g' Splitting and partitioning strings¶If you would like to find a substring and then split the string based on its location, the The Out[27]: ('the quick brown ', 'fox', ' jumped over a lazy dog') The The Out[28]: ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog'] A related method is In [29]: haiku = """matsushima-ya aah matsushima-ya matsushima-ya""" haiku.splitlines() Out[29]: ['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya'] Note that if you would like to undo a In [30]: '--'.join(['1', '2', '3']) A common pattern is to use the special character In [31]: print("\n".join(['matsushima-ya', 'aah matsushima-ya', 'matsushima-ya'])) matsushima-ya aah matsushima-ya matsushima-ya Format Strings¶In the preceding methods, we have learned how to extract values from strings, and to manipulate strings themselves into desired formats. Another use of string methods is to manipulate string representations of values of other types. Of course, string representations can always be found using the In [33]: "The value of pi is " + str(pi) Out[33]: 'The value of pi is 3.14159' A more flexible way to do this is to use format strings, which are strings with special markers (noted by curly braces) into which string-formatted values will be inserted. Here is a basic example: In [34]: "The value of pi is {}".format(pi) Out[34]: 'The value of pi is 3.14159' Inside the In [35]: """First letter: {0}. Last letter: {1}.""".format('A', 'Z') Out[35]: 'First letter: A. Last letter: Z.' If you include a string, it will refer to the key of any keyword argument: In [36]: """First letter: {first}. Last letter: {last}.""".format(last='Z', first='A') Out[36]: 'First letter: A. Last letter: Z.' Finally, for numerical inputs, you can include format codes which control how the value is converted to a string. For example, to print a number as a floating point with three digits after the decimal point, you can use the following: In [37]: "pi = {0:.3f}".format(pi) As before, here the " This style of format specification is very flexible, and the examples here barely scratch the surface of the formatting options available. For more information on the syntax of these format strings, see the Format Specification section of Python's online documentation. Flexible Pattern Matching with Regular Expressions¶The methods of Python's My goal here is to give you an idea of the types of problems that might be addressed using regular expressions, as well as a basic idea of how to use them in Python. I'll suggest some references for learning more in Further Resources on Regular Expressions. Fundamentally, regular expressions are a means of flexible pattern matching in strings. If you frequently use the command-line, you are probably familiar with this type of flexible matching with the " 01-How-to-Run-Python-Code.ipynb 02-Basic-Python-Syntax.ipynb Regular expressions generalize this "wildcard" idea to a wide range of flexible string-matching sytaxes. The Python interface to regular expressions is contained in the built-in In [39]: import re regex = re.compile('\s+') regex.split(line) Out[39]: ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog'] Here we've first compiled a regular expression, then used it to split a string. Just as Python's In this case, the input is The In [40]: for s in [" ", "abc ", " abc"]: if regex.match(s): print(repr(s), "matches") else: print(repr(s), "does not match") ' ' matches 'abc ' does not match ' abc' matches Like In [41]: line = 'the quick brown fox jumped over a lazy dog' With this, we can see that the In [43]: regex = re.compile('fox') match = regex.search(line) match.start() Similarly, the In [44]: line.replace('fox', 'BEAR') Out[44]: 'the quick brown BEAR jumped over a lazy dog' Out[45]: 'the quick brown BEAR jumped over a lazy dog' With a bit of thought, other native string operations can also be cast as regular expressions. A more sophisticated example¶But, you might ask, why would you want to use the more complicated and verbose syntax of regular expressions rather than the more intuitive and simple string methods? The advantage is that regular expressions offer far more flexibility. Here we'll consider a more complicated example: the common task of matching email addresses. I'll start by simply writing a (somewhat indecipherable) regular expression, and then walk through what is going on. Here it goes: In [46]: email = re.compile('\w+@\w+\.[a-z]{3}') Using this, if we're given a line from a document, we can quickly extract things that look like email addresses In [47]: text = "To email Guido, try or the older address ." email.findall(text) Out[47]: ['', ''] (Note that these addresses are entirely made up; there are probably better ways to get in touch with Guido). We can do further operations, like replacing these email addresses with another string, perhaps to hide addresses in the output: In [48]: email.sub('', text) Out[48]: 'To email Guido, try or the older address .' Finally, note that if you really want to match any email address, the preceding regular expression is far too simple. For example, it only allows addresses made of alphanumeric characters that end in one of several common domain suffixes. So, for example, the period used here means that we only find part of the address: In [49]: email.findall('') This goes to show how unforgiving regular expressions can be if you're not careful! If you search around online, you can find some suggestions for regular expressions that will match all valid emails, but beware: they are much more involved than the simple expression used here! Basics of regular expression syntax¶The syntax of regular expressions is much too large a topic for this short section. Still, a bit of familiarity can go a long way: I will walk through some of the basic constructs here, and then list some more complete resources from which you can learn more. My hope is that the following quick primer will enable you to use these resources effectively. Simple strings are matched directly¶If you build a regular expression on a simple string of characters or digits, it will match that exact string: In [50]: regex = re.compile('ion') regex.findall('Great Expectations') Some characters have special meanings¶While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:
We will discuss the meaning of some of these momentarily. In the meantime, you should know that if you'd like to match any of these characters directly, you can escape them with a back-slash: In [51]: regex = re.compile(r'\$') regex.findall("the cost is $20") The Such substitutions are not made in a raw string: For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string. Special characters can match character groups¶Just as the Putting these together, we can create a regular expression that will match any two letters/digits with whitespace between them: In [54]: regex = re.compile(r'\w\s\w') regex.findall('the fox is 9 years old') Out[54]: ['e f', 'x i', 's 9', 's o'] This example begins to hint at the power and flexibility of regular expressions. The following table lists a few of these characters that are commonly useful:
This is not a comprehensive list or description; for more details, see Python's regular expression syntax documentation. Square brackets match custom character groups¶If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in. For example, the following will match any lower-case vowel: In [55]: regex = re.compile('[aeiou]') regex.split('consequential') Out[55]: ['c', 'ns', 'q', '', 'nt', '', 'l'] Similarly, you can use a dash to specify a range: for example, In [56]: regex = re.compile('[A-Z][0-9]') regex.findall('1043879, G2, H6') Wildcards match repeated characters¶If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, In [57]: regex = re.compile(r'\w{3}') regex.findall('The quick brown fox') Out[57]: ['The', 'qui', 'bro', 'fox'] There are also markers available to match any number of repetitions – for example, the In [58]: regex = re.compile(r'\w+') regex.findall('The quick brown fox') Out[58]: ['The', 'quick', 'brown', 'fox'] The following is a table of the repetition markers available for use in regular expressions:
With these basics in mind, let's return to our email address matcher: In [59]: email = re.compile(r'\w+@\w+\.[a-z]{3}') We can now understand what this means: we want one or more alphanumeric character ( If we want to now modify this so that the Obama email address matches, we can do so using the square-bracket notation: In [60]: email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}') email2.findall('') Out[60]: [''] We have changed For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to group the results: In [61]: email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})') In [62]: text = "To email Guido, try or the older address ." email3.findall(text) Out[62]: [('guido', 'python', 'org'), ('guido', 'google', 'com')] As we see, this grouping actually extracts a list of the sub-components of the email address. We can
go a bit further and name the extracted components using the In [63]: email4 = re.compile(r'(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})') match = email4.match('') match.groupdict() Out[63]: {'domain': 'python', 'suffix': 'org', 'user': 'guido'} Combining these ideas (as well as some of the powerful regexp syntax that we have not covered here) allows you to flexibly and quickly extract information from strings in Python. Further Resources on Regular Expressions¶The above discussion is just a quick (and far from complete) treatment of this large topic. If you'd like to learn more, I recommend the following resources:
For some examples of string manipulation and regular expressions in action at a larger scale, see Pandas: Labeled Column-oriented Data, where we look at applying these sorts of expressions across tables of string data within the Pandas package. |