• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Anyone a regex guru?

Octavo

Illuminator
Joined
Jun 19, 2007
Messages
3,485
Location
South Africa
HI all,

I am actively working on this, but I thought perhaps someone here can answer me quicker than I can learn, assimilate and produce the required regex.

I need a regular expression that will scan a html string and copy the source of any image tags into a group.

So in other words, the following html:
Code:
<img id="img1" src="1.jpg"></img> <br> <img src='2.jpg' />

Should should result in 1.jpg and 2.jpg as output.

Many thanks in advance -- I'll post the answer if I get it before any replies here :)

(for those still scratching their heads: Regex)
 
OK, so far I have this

Code:
src\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))

That returns me the contents of all the src= bits in any given string, but I want to further restrict it, to only do this inside an img tag - otherwise I'm going to be getting frame src's as well.
 
OK, so far I have this

Code:
src\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))

That returns me the contents of all the src= bits in any given string, but I want to further restrict it, to only do this inside an img tag - otherwise I'm going to be getting frame src's as well.

OK. It seems you need a Perl regular expression, right? Good. Do you need it to be multiline? If you don't need it to be absolutely foolproof, something like
Code:
<img\b[^>]+\bsrc=["']([^"']+)["']
might be enough (untested). For clarity, I removed some escapes that didn't belong there; assuming that you only put them there because of the requirements of the programming language you use. Hm, actually that should probably work even as a multiline regexp. Hm, now that I think of it, the regexp should actually be able to handle most real-world situations, due to the syntax requirements of HTML. It does mishandle src="foo'bar", but I believe this could be dealt with by using a back reference.

<shameless plug>If you're on Linux, you might like to give regexxer a try, since it makes playing around with Perl regexps interactively a bit easier.</shameless plug>

ETA: Oh, and you might want to make it ignore case, too. :)
 
Last edited:
What regexp library are you using? It looks like Perl or something similar. Anyway, one thing you'll have to take into account is that there might be line breaks, so you'll need to use multi-line regexps. Also, remember that img tags might contain Javascript code or CSS or other weird stuff.

Things like these are usually better solved by a real parser than trying to find a regular expression that fits. There are lots of HTML parsers, but of course it depends on what language you are using.
 
Things like these are usually better solved by a real parser than trying to find a regular expression that fits. There are lots of HTML parsers, but of course it depends on what language you are using.
I agree -- if this is a real program. For a one-off task or manually observed maintenance stuff the regex solution is fine, in my opinion. Parsing HTML for real is a pain, unless you restrict yourself to XHTML and use an XML parser; which would of course be great if you have that choice.
 
Why not just scan for the types of images you're trying to extract? Are you expecting anything other than .gif, .jpg, and .png? If not, something like /\w+\.gif|\w+\.jpg|\w+\.png/ should work.
 
Last edited:
http://xkcd.com/208/

No, sorry, I can't be of any real help here.

Depending on what it is you're trying to do it *might* be possible to use just a bit of Javascript, though, to list all the images of a page? (I realize this would not be a good suggestion in most cases, of course.)
 
Wow, thanks for the great responses from everyone! There's some good stuff here - and here I was thinking that this was probably the wrong forum to post this sort of question!! Next time I have a development issue, I'm coming here first! :)

OK. It seems you need a Perl regular expression, right? Good. Do you need it to be multiline? If you don't need it to be absolutely foolproof, something like
Code:
<img\b[^>]+\bsrc=["']([^"']+)["']
might be enough (untested). For clarity, I removed some escapes that didn't belong there; assuming that you only put them there because of the requirements of the programming language you use. Hm, actually that should probably work even as a multiline regexp. Hm, now that I think of it, the regexp should actually be able to handle most real-world situations, due to the syntax requirements of HTML. It does mishandle src="foo'bar", but I believe this could be dealt with by using a back reference.


ETA: Oh, and you might want to make it ignore case, too. :)

That's exactly what I was looking for - thanks danielk!

I'm using c# actually and your solution is pretty much exactly what I was looking for and yes it does handle multiline as it is! Thanks for your help.

skoob said:
Things like these are usually better solved by a real parser than trying to find a regular expression that fits.

That would be overkill for this particular requirement, but I agree with you in principal - if I were going to be doing anything more with the Html, I would seriously consider it, but for now the Regex will serve my needs nicely :)

bokonon said:
Why not just scan for the types of images you're trying to extract? Are you expecting anything other than .gif, .jpg, and .png? If not, something like /\w+\.gif|\w+\.jpg|\w+\.png/ should work.

That's a cool idea bokonon, but the danger is that there may happen to be .jpg or .gif in other tags (an href to an image for example) or even in the body copy of the page itself. I specifically need the images inside img tags.


Rasmus said:
Depending on what it is you're trying to do it *might* be possible to use just a bit of Javascript, though, to list all the images of a page? (I realize this would not be a good suggestion in most cases, of course.)

Thanks Rasmus -- if I was on the client-side, I would have considered javascript, but I'm actually writing a server-side component.

chulbert said:
suming you're not required to use a regular expression, this problem has been solved by HTML::LinkExtor:

http://search.cpan.org/dist/HTML-Par...L/LinkExtor.pm

Reliably performing this task using regular expressions can be difficult. You have to guard against things like embedded quotes and malicious image names that look like HTML.

Guess I should have mentioned in the OP that I was using C# :)

Thanks to all you guys for responding with some great tips and thanks to danielk for the solution!
 
Thank you for taking the time to respond to everyone who offered a solution. Even though my solution wasn't quite what you needed this time, I appreciate the feedback. Good luck with your development!
 
A reasonable programming language provides patterns in which spaces are ignored.

What will happen with: ... 'src="1.jpg"' ...

~~ Paul
 
A reasonable programming language provides patterns in which spaces are ignored.

What will happen with: ... 'src="1.jpg"' ...

~~ Paul

Either you have no idea how regex works and what precisely danielk's solution does, or I'm missing your point by quite a margin...

First off, Regex is supported by most good programming languages - Regex is not a technology specific to one langauge and getting it to ignore spaces is not at all a problem and more the point, danielk's solution does ignore spaces. In fact it ignores everything that does not fit the require pattern.

Which is to first look for an opening img tag "<img", then inside that tag to look for EITHER src=" or src=' and then to find the closing " or ' and return whatever was between the opening src=" and the closing "
 
Last edited:
Nah, I doubt it was Paul's intent to stir up a fight, so let's keep it civil. We're not in the Politics or CT subforums after all. :)
 
Octavo said:
First off, Regex is supported by most good programming languages - Regex is not a technology specific to one langauge and getting it to ignore spaces is not at all a problem and more the point, danielk's solution does ignore spaces. In fact it ignores everything that does not fit the require pattern.
Sorry, I didn't make myself clear. I meant that a reasonable programming language ignores spaces in the patterns themselves. It makes the patterns much easier to read.

Which is to first look for an opening img tag "<img", then inside that tag to look for EITHER src=" or src=' and then to find the closing " or ' and return whatever was between the opening src=" and the closing "
I don't think it works correctly on this chunk of HTML:

src="foo'bar.jpg"

danielk said:
Hm? I'm not sure what you're getting at.
Sorry, the first example I gave probably isn't valid HTML. I believe the example just above is valid.

~~ Paul
 
Sorry, I didn't make myself clear. I meant that a reasonable programming language ignores spaces in the patterns themselves. It makes the patterns much easier to read.
Yes, Perl has that option. No idea about C# though.

I don't think it works correctly on this chunk of HTML: src="foo'bar.jpg"
Indeed it doesn't. I actually pointed that out myself already, see above. :)

Sorry, the first example I gave probably isn't valid HTML. I believe the example just above is valid.
Yeah, I think it's valid. I'm sure this particular problem can be dealt with, but I just wanted to come up with something reasonable quickly.
 
danielk said:
Yeah, I think it's valid. I'm sure this particular problem can be dealt with, but I just wanted to come up with something reasonable quickly.
I think this would do the trick:
Code:
<img\b[^>]+\bsrc=(("[^"]+")|('[^']+'))

~~ Paul
 

Back
Top Bottom