Anyone a regex guru?

Octavo · Dec 4, 2007

HI all,

I am actively working on this, but I thought perhaps someone here can answer me quicker than I can learn, assimilate and produce the required regex.

I need a regular expression that will scan a html string and copy the source of any image tags into a group.

So in other words, the following html:

Code:

<img id="img1" src="1.jpg"></img> <br> <img src='2.jpg' />

Should should result in 1.jpg and 2.jpg as output.

Many thanks in advance -- I'll post the answer if I get it before any replies here

(for those still scratching their heads: Regex)

Octavo · Dec 4, 2007

OK, so far I have this

Code:

src\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))

That returns me the contents of all the src= bits in any given string, but I want to further restrict it, to only do this inside an img tag - otherwise I'm going to be getting frame src's as well.

danielk · Dec 4, 2007

Octavo said:
OK, so far I have this

Code:

src\s*=\s*(?:\"(?<1>[^\"]*)\"|(?<1>\S+))

That returns me the contents of all the src= bits in any given string, but I want to further restrict it, to only do this inside an img tag - otherwise I'm going to be getting frame src's as well.

OK. It seems you need a Perl regular expression, right? Good. Do you need it to be multiline? If you don't need it to be absolutely foolproof, something like

Code:

<img\b[^>]+\bsrc=["']([^"']+)["']

might be enough (untested). For clarity, I removed some escapes that didn't belong there; assuming that you only put them there because of the requirements of the programming language you use. Hm, actually that should probably work even as a multiline regexp. Hm, now that I think of it, the regexp should actually be able to handle most real-world situations, due to the syntax requirements of HTML. It does mishandle src="foo'bar", but I believe this could be dealt with by using a back reference.

<shameless plug>If you're on Linux, you might like to give regexxer a try, since it makes playing around with Perl regexps interactively a bit easier.</shameless plug>

ETA: Oh, and you might want to make it ignore case, too.

skoob · Dec 4, 2007

What regexp library are you using? It looks like Perl or something similar. Anyway, one thing you'll have to take into account is that there might be line breaks, so you'll need to use multi-line regexps. Also, remember that img tags might contain Javascript code or CSS or other weird stuff.

Things like these are usually better solved by a real parser than trying to find a regular expression that fits. There are lots of HTML parsers, but of course it depends on what language you are using.

danielk · Dec 4, 2007

skoob said:
Things like these are usually better solved by a real parser than trying to find a regular expression that fits. There are lots of HTML parsers, but of course it depends on what language you are using.

I agree -- if this is a real program. For a one-off task or manually observed maintenance stuff the regex solution is fine, in my opinion. Parsing HTML for real is a pain, unless you restrict yourself to XHTML and use an XML parser; which would of course be great if you have that choice.

bokonon · Dec 4, 2007

Why not just scan for the types of images you're trying to extract? Are you expecting anything other than .gif, .jpg, and .png? If not, something like /\w+\.gif|\w+\.jpg|\w+\.png/ should work.

Rasmus · Dec 4, 2007

http://xkcd.com/208/

No, sorry, I can't be of any real help here.

Depending on what it is you're trying to do it *might* be possible to use just a bit of Javascript, though, to list all the images of a page? (I realize this would not be a good suggestion in most cases, of course.)

chulbert · Dec 4, 2007

Assuming you're not required to use a regular expression, this problem has been solved by HTML::LinkExtor:

http://search.cpan.org/dist/HTML-Parser/lib/HTML/LinkExtor.pm

Reliably performing this task using regular expressions can be difficult. You have to guard against things like embedded quotes and malicious image names that look like HTML.

Octavo · Dec 5, 2007

Wow, thanks for the great responses from everyone! There's some good stuff here - and here I was thinking that this was probably the wrong forum to post this sort of question!! Next time I have a development issue, I'm coming here first!

danielk said:
OK. It seems you need a Perl regular expression, right? Good. Do you need it to be multiline? If you don't need it to be absolutely foolproof, something like

Code:

<img\b[^>]+\bsrc=["']([^"']+)["']

might be enough (untested). For clarity, I removed some escapes that didn't belong there; assuming that you only put them there because of the requirements of the programming language you use. Hm, actually that should probably work even as a multiline regexp. Hm, now that I think of it, the regexp should actually be able to handle most real-world situations, due to the syntax requirements of HTML. It does mishandle src="foo'bar", but I believe this could be dealt with by using a back reference.

ETA: Oh, and you might want to make it ignore case, too.

That's exactly what I was looking for - thanks danielk!

I'm using c# actually and your solution is pretty much exactly what I was looking for and yes it does handle multiline as it is! Thanks for your help.

skoob said:
Things like these are usually better solved by a real parser than trying to find a regular expression that fits.

That would be overkill for this particular requirement, but I agree with you in principal - if I were going to be doing anything more with the Html, I would seriously consider it, but for now the Regex will serve my needs nicely

bokonon said:
Why not just scan for the types of images you're trying to extract? Are you expecting anything other than .gif, .jpg, and .png? If not, something like /\w+\.gif|\w+\.jpg|\w+\.png/ should work.

That's a cool idea bokonon, but the danger is that there may happen to be .jpg or .gif in other tags (an href to an image for example) or even in the body copy of the page itself. I specifically need the images inside img tags.

Rasmus said:
Depending on what it is you're trying to do it *might* be possible to use just a bit of Javascript, though, to list all the images of a page? (I realize this would not be a good suggestion in most cases, of course.)

Thanks Rasmus -- if I was on the client-side, I would have considered javascript, but I'm actually writing a server-side component.

chulbert said:
suming you're not required to use a regular expression, this problem has been solved by HTML::LinkExtor:

http://search.cpan.org/dist/HTML-Par...L/LinkExtor.pm

Reliably performing this task using regular expressions can be difficult. You have to guard against things like embedded quotes and malicious image names that look like HTML.

Guess I should have mentioned in the OP that I was using C#

Thanks to all you guys for responding with some great tips and thanks to danielk for the solution!

bokonon · Dec 5, 2007

Thank you for taking the time to respond to everyone who offered a solution. Even though my solution wasn't quite what you needed this time, I appreciate the feedback. Good luck with your development!

Octavo · Dec 5, 2007

Hey - you guys took the time to help me out, the least I can do is acknowledge your help

danielk · Dec 5, 2007

Cool. I'm glad it worked out so well.

BenBurch · Dec 5, 2007

http://txt2regex.sourceforge.net/ <--- "English" to regex compiler.

Paul C. Anagnostopoulos · Dec 6, 2007

A reasonable programming language provides patterns in which spaces are ignored.

What will happen with: ... 'src="1.jpg"' ...

~~ Paul

danielk · Dec 6, 2007

Paul C. Anagnostopoulos said:
A reasonable programming language provides patterns in which spaces are ignored.

What will happen with: ... 'src="1.jpg"' ...

Hm? I'm not sure what you're getting at.

Octavo · Dec 7, 2007

Paul C. Anagnostopoulos said:
A reasonable programming language provides patterns in which spaces are ignored.

What will happen with: ... 'src="1.jpg"' ...

~~ Paul

Either you have no idea how regex works and what precisely danielk's solution does, or I'm missing your point by quite a margin...

First off, Regex is supported by most good programming languages - Regex is not a technology specific to one langauge and getting it to ignore spaces is not at all a problem and more the point, danielk's solution does ignore spaces. In fact it ignores everything that does not fit the require pattern.

Which is to first look for an opening img tag "<img", then inside that tag to look for EITHER src=" or src=' and then to find the closing " or ' and return whatever was between the opening src=" and the closing "

danielk · Dec 8, 2007

Nah, I doubt it was Paul's intent to stir up a fight, so let's keep it civil. We're not in the Politics or CT subforums after all.

Paul C. Anagnostopoulos · Dec 9, 2007

Octavo said:
First off, Regex is supported by most good programming languages - Regex is not a technology specific to one langauge and getting it to ignore spaces is not at all a problem and more the point, danielk's solution does ignore spaces. In fact it ignores everything that does not fit the require pattern.

Sorry, I didn't make myself clear. I meant that a reasonable programming language ignores spaces in the patterns themselves. It makes the patterns much easier to read.

Which is to first look for an opening img tag "<img", then inside that tag to look for EITHER src=" or src=' and then to find the closing " or ' and return whatever was between the opening src=" and the closing "

I don't think it works correctly on this chunk of HTML:

src="foo'bar.jpg"

danielk said:
Hm? I'm not sure what you're getting at.

Sorry, the first example I gave probably isn't valid HTML. I believe the example just above is valid.

~~ Paul

danielk · Dec 9, 2007

Paul C. Anagnostopoulos said:
Sorry, I didn't make myself clear. I meant that a reasonable programming language ignores spaces in the patterns themselves. It makes the patterns much easier to read.

Yes, Perl has that option. No idea about C# though.

Paul C. Anagnostopoulos said:
I don't think it works correctly on this chunk of HTML: src="foo'bar.jpg"

Indeed it doesn't. I actually pointed that out myself already, see above.

Paul C. Anagnostopoulos said:
Sorry, the first example I gave probably isn't valid HTML. I believe the example just above is valid.

Yeah, I think it's valid. I'm sure this particular problem can be dealt with, but I just wanted to come up with something reasonable quickly.

Paul C. Anagnostopoulos · Dec 9, 2007

danielk said:
Yeah, I think it's valid. I'm sure this particular problem can be dealt with, but I just wanted to come up with something reasonable quickly.

I think this would do the trick:

Code:

<img\b[^>]+\bsrc=(("[^"]+")|('[^']+'))

~~ Paul

Anyone a regex guru?

Illuminator

Illuminator

Graduate Poster

Thinker

Graduate Poster

Illuminator

Philosopher

Muse

Illuminator

Illuminator

Illuminator

Graduate Poster

Gatekeeper of The Left

Nap, interrupted.

Graduate Poster

Illuminator

Graduate Poster

Nap, interrupted.

Graduate Poster

Nap, interrupted.