September 22nd, 2003, 08:03 PM
Python and Regular expressions
I am trying to learn python and regular expressions.
I am trying to figure out a way to pull the following line out of a string and then pull out the ip address. I have the file opened and put into a string, I am just having trouble matching the following line (in python with the re module).
IP Address:</td><td><font face=verdana size=2>anyipaddress</td>
Any help would be appreciated.
September 23rd, 2003, 02:10 AM
I'm a lil unclear here , which part of the string did you want? Oh, and does 'IP Address' appear at the beginning of the line ot was this just an example?
Anyway I made a very small regexp and matched it no problem of corse there are some improvments you might want to make to this i.e. replace the (.+?) with something like [0-9\.]
>>> import re
>>> s = 'IP Address:</td><td><font face=verdana size=2>anyipaddress</td>'
>>> re.findall('[a-zA-z]:<.*>(.+?)<', s)
Note: since i didn't know which part of the string you wanted i've just gone and grabbed the text in bold. You can change which part of the regex gets returned from findall by moving or adding more () groups
September 23rd, 2003, 08:25 AM
Hey, I'm kind of confused about this part that you posted:
the first paramter for the findall() function looks very cryptic, which is sort of boggling to me because i didn't think you could code in pythin like that. i've never seen that kind of notation, or whatever that's called.
could ya tell me what it is, or where I might be able to look that kind of notation up? i mean, when i was reading the coscarart's question i was thinking up a way to do it, but the way i was concocting in my head was a hell of a lot more complicated...
thanks for helping a python newbie learn!
September 23rd, 2003, 09:06 AM
In most of the doc's on regex you'll see re.compile() allot, it's a great object in itself but for such a small task i hardly see the point especially when Python allows you to write it this way (most of the functions in re allow this). But if your going to use the same regex over and over it's probably a good idea to compile it first
The re.findall() function is pretty simple itself, you pass it a pattern and it returns all the parts of the matched pattern within '()' definatly easier to use than match.. The regex i used is simple so i'm guessing you understood that?
Anyway i hope that this answers your questions If not feel free to ask more , always happy to help them if i can.
September 23rd, 2003, 09:19 AM
haha... i was more referring to the first parameter of the finall() function
that looks almost PERL-like to me or something... i have no clue what that does!
thanks again. i'm so glad i found this board, it's fun just learning tidbits of python here and there outside of what i'm using it for, ya kno? python is so great... i'm dismayed at the fact that this internship is going to end at some point and then i'll have to go back to school and use C or Java (which I used to love) in those programming courses =/
September 23rd, 2003, 09:27 AM
well, i neglected to say... i know what the function does... you've extracted 'anyipaddress' out of the string s and the function returns that string in a list. i'll assume that the string i was confused about is like a string template for telling the re.findall() function what to strip.
i'd just like to know how that works, or where i can find more information on that.
particularly, does [a-zA-Z] mean all characters lowercase and uppercase btwn a and z?
if i were to do something like [a-gH-K] would it denote all lowercase letters btwn a&g, and all uppercase letters btwn H&K?
also, what's with the '<.*>' and '(.+?)' ? oh, and '[0-9\.]' ?
i'd really love to know where i can learn this from, and how i can use this type of notation in diff ways. python never ceases to tickle my curiosity... so many cool features dude!
September 23rd, 2003, 09:48 AM
lol oops sorry, Python has perl style regex, which is prob' the reason it looks very perl like , bare with me people. regex are not the easiest thing to explain!
Match a char' regardless or chase or type followed by ':<' and 0 or more chars (not '\n') untill the last '>' that fits the pattern. The brackets around the '.+?' tell fetchall() to return 1 or more char's of any type untill the the first '<'. *breaths*
Ok hope that makes some sence to you. In any case if you learn how to do regex in perl or PHP you can carry them over to Python (and vies-versa) without a problem!
I know what you mean Cv, Python is a great lang, i havn't really touched much else since i picked it up .
But if you're gonna use Java and your missing Python you could always try Jython (ttp://www.jython.org/) just one of the tools in the Python programmers arsenal.. and i dont think Java has anything on Python anyway!
September 23rd, 2003, 09:51 AM
Note to netytan: \w is the same as [a-zA-Z] and you should really use <.*?> so that the * isn't greedy
September 23rd, 2003, 09:55 AM
haha ok... so all that is part of regular expressions (or regex) or something... cool, i'll look that up and try to learn. thanks!
September 23rd, 2003, 10:01 AM
i'd sugest you have a look on google for a good regex tutorial i.e. http://www.amk.ca/python/howto/regex/
you have the a-zA-Z consept down and the 0-9 thing woks in exactly the same way as that, so [0-9\.] will match any number and '.'
. = any char except a '\n' (unless told otherwise)
+ = 1 or more occureneces of a given char i.e. '.'
* = 0 or more, like + this will match as many as it can (greedy)
? = stops + and * from being greedy , kinda like a girl friend
\ = escapes a special char (like " or ' in strings)
Last edited by netytan; September 23rd, 2003 at 10:04 AM.
September 23rd, 2003, 10:08 AM
Note to Strike: it needed to be greedy, if it wasn't then the regex wouldn't work. Thanks for the \w though that totally slipped my mind
September 23rd, 2003, 12:24 PM
I am sorry if I was not clear and wasted your time, but thank you for trying to help!. The thing I am trying to parse is the HTML of a linksys router page. So the html is actually really really long. Here is a chunk.
somestuff....IP Address:</td><td><font face=verdana size=2>220.127.116.11</td></tr><tr><td bgcolor=6666cc> <font color=white face=verdana size=2>Subnet Mask:</td><td><font face=verdana size=2>18.104.22.168</td></tr><tr><td bgcolor=6666cc> <font color=white face=verdana size=2>Default Gateway:</td><td><font face=verdana size=2>22.214.171.124</td></tr><tr><td bgcolor=6666cc> <font color=white face=verdana size=2>DNS:</td><td><font face=verdana size=2>126.96.36.1994<br>188.8.131.52<br>0.0.0.0</td></tr><tr><td bgcolor=6666cc> <font></th></tr></table></center></body></html> more stuff.....
So this chunk is just part of the larger one. I have Bolded the chunk that I want extracted. What I want to do is extract the IP address after the words IP Address:
Any help would be appreciated. Also thanks for the link to the python regex howto!
September 23rd, 2003, 12:29 PM
I suggest that you just strip all the HTML out first and then use regexes to find the data based on the surrounding text.
To remove all the HTML tags in a string s, you would do re.sub('<.*?>', '', s).
Example (note: string breaks are my edits, weren't actually used in the code - simply done so that the page isn't a mile wide):
Note that the DNS entries are jumbled (and one is an invalid IP address ..), so you may want to put in spaces for all <br> tags as well.
>>> s = 'IP Address:</td><td><font face=verdana size=2>184.108.40.206</td></tr><tr><td bgcolor=6666cc> <font color=white face=verdana size=2>
Subnet Mask:</td><td><font face=verdana size=2>220.127.116.11</td></tr><tr><td bgcolor=6666cc>
<font color=white face=verdana size=2>Default Gateway:</td><td><font face=verdana size=2>
18.104.22.168</td></tr><tr><td bgcolor=6666cc> <font color=white face=verdana size=2>DNS:</td><td><font
face=verdana size=2>22.214.171.1244<br>126.96.36.199<br>0.0.0.0</td></tr><tr><td bgcolor=6666cc> <font></th></tr></table></center></body></html>'
>>> re.sub('<.*?>', '', s)
'IP Address:188.8.131.52 Subnet Mask:184.108.40.206 Default Gateway:220.127.116.11 DNS:18.104.22.168422.214.171.124.0.0.0 '
Last edited by Strike; September 23rd, 2003 at 12:31 PM.
September 23rd, 2003, 01:50 PM
Thanks! Getting rid of all of the html made it easy to get what I wanted! The code I have goes as follows
I then use this to update my dynamic dns service.
os.system ("wget -O/tmp/Status.htm --http-pass='nothing' --http-user='nothing' http://192.168.0.1/Status.htm")
status = open('/tmp/Status.htm').read()
os.system ('rm /tmp/Status.htm')
ipline = re.sub('<.*?>| ','',status)
ipline = re.sub(';','\n',ipline)
ipline = re.sub('\n[+]','\n',ipline)
ip = re.findall ('IP Address:.*',ipline
ipiwant = re.sub ('[a-zA-Z:]','',ip)
This is the first program I have ever written so I know it probably sucks, but it works! Thanks for the help!
By the way does anyone know if there is a python module that can pull a password protected file from a server? I looked at urllib and it couldn't do it so I used wget.
Last edited by coscarart; September 23rd, 2003 at 01:52 PM.
September 23rd, 2003, 03:02 PM
urllib2 can do it, what problems were you having? It's just a matter of how you pass the password in. I'm not sure how you do it, honestly, but I imagine it's just a header that you set.