Using Regular Expression To Extract All URLs

  • Автор темы Автор темы allsystems
  • Дата начала Дата начала

allsystems

Client
Регистрация
08.07.2013
Сообщения
81
Реакции
1
Баллы
8
Hi Guys,

I am trying to extract all URL on a webpage. I am trying to use regex but none seem to work. Is there a good regex to extract all URLs on page?

Thanks
 
Regex can be easily create in ZennoPoster. You should get source code of web page, copy it to regex constructor and put start and end of text.
Create_regex.png
 
Only issue with this is that I am visiting random URLs so I dont know the format on each page which is why I need a super regex that works in Zenno and will get me 99.9% of all URLs
 
I use this one to capture urls in format:

Код:
Развернуть Свернуть Копировать
http://domain.com/
https://domain.com/

http://www.domain.com/
https://www.domain.com/

My regex:

Код:
Развернуть Свернуть Копировать
(?<=https?://(?:www\.)?)(?!www\.).*?(?=['/"]|</a>)

Maybe someone can improve it?
 
Код:
Развернуть Свернуть Копировать
(?<=href=")http.*?\.com

The only issue is the domain ext, some are: .co.uk, .mobi, etc.
 
I use this one to capture urls in format:

Код:
Развернуть Свернуть Копировать
http://domain.com/
https://domain.com/

http://www.domain.com/
https://www.domain.com/

My regex:

Код:
Развернуть Свернуть Копировать
(?<=https?://(?:www\.)?)(?!www\.).*?(?=['/"]|</a>)

Maybe someone can improve it?
Sure, try this:
Код:
Развернуть Свернуть Копировать
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?
Also you can view Russian sector, there is regex list for typical tasks
 
  • Спасибо
Реакции: kveldulv и VladZen

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)