Using Regular Expression To Extract All URLs

allsystems · 30.10.2015

Hi Guys,

I am trying to extract all URL on a webpage. I am trying to use regex but none seem to work. Is there a good regex to extract all URLs on page?

Thanks

VladZen · 30.10.2015

Regex can be easily create in ZennoPoster. You should get source code of web page, copy it to regex constructor and put start and end of text.

allsystems · 30.10.2015

Only issue with this is that I am visiting random URLs so I dont know the format on each page which is why I need a super regex that works in Zenno and will get me 99.9% of all URLs

lokiys · 30.10.2015

Extract all hrefs and you will be fine. You can try to look also on xpath for that. Not only regex and see what works best for you...

VladZen · 30.10.2015

allsystems написал(а):
Only issue with this is that I am visiting random URLs so I dont know the format on each page which is why I need a super regex that works in Zenno and will get me 99.9% of all URLs

regex given in my screen is universal. It parses all href's starting with http://

lokiys · 30.10.2015

VladZ написал(а):
regex given in my screen is universal. It parses all href's starting with http://

In most cases in source links are not with http://
Most probably removing http:// from your regex will do better...

shabbysquire · 10.11.2015

I use this one to capture urls in format:

Код:

http://domain.com/
https://domain.com/

http://www.domain.com/
https://www.domain.com/

My regex:

Код:

(?<=https?://(?:www\.)?)(?!www\.).*?(?=['/"]|</a>)

Maybe someone can improve it?

VladZen · 11.11.2015

shabbysquire написал(а):
My regex:

Код:

(?<=https?://(?:www\.)?)(?!www\.).*?(?=['/"]|</a>)

Maybe someone can improve it?

Try this -

Код:

(?<=href=")http.*?\.com

shabbysquire · 11.11.2015

Код:

(?<=href=")http.*?\.com

The only issue is the domain ext, some are: .co.uk, .mobi, etc.

VladZen · 11.11.2015

shabbysquire написал(а):
Код:

(?<=href=")http.*?\.com

The only issue is the domain ext, some are: .co.uk, .mobi, etc.

http.*?\.\w+(?=/)

CSS · 11.11.2015

shabbysquire написал(а):
I use this one to capture urls in format:

Код:

http://domain.com/ https://domain.com/ http://www.domain.com/ https://www.domain.com/

My regex:

Код:

(?<=https?://(?:www\.)?)(?!www\.).*?(?=['/"]|</a>)

Maybe someone can improve it?

Sure, try this:

Код:

(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?

Also you can view Russian sector, there is regex list for typical tasks

Поиск

Using Regular Expression To Extract All URLs

allsystems

Client

VladZen

Administrator

allsystems

Client

lokiys

Moderator

VladZen

Administrator

lokiys

Moderator

shabbysquire

Client

VladZen

Administrator

shabbysquire

Client

VladZen

Administrator

CSS

Client

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)