Scrape from sitemap.xml

SeRf*X

Client
Регистрация
02.04.2011
Сообщения
35
Благодарностей
4
Баллы
8
scraping my own sitemap.xml is harder than i thought!

i start with recording webpage goto url... http://www.mydomain.com/sitemap.xml
select "page text" and select "DOM html" then copy to "macros builder" but in regular expression on (Procession results) ...i always gets blank screen...with no outcome....whatsoever

An example of the dome source text:
<TD><A href="http://www.mydomain.com/some-text-here/">http://www.mydomian.com/some-text-here/</A> </TD>


before required text there is always i put... <A href="http://

this goes after the required text i put... //

the required text always starts with i put.. http://

the required text ends with i put.. /

In the center....( i try tick each button)

I am trying only to scrape this text out (www.mydomain.com/some-text-here/)

what wrong i did? could anyone point that out...please! :confused:
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Try this one on for size:

(?<=\"\>)http:\/\/.*?(?=\<\/A\>)
 
  • Спасибо
Реакции: SeRf*X

SeRf*X

Client
Регистрация
02.04.2011
Сообщения
35
Благодарностей
4
Баллы
8
Try this one on for size:

(?<=\"\>)http:\/\/.*?(?=\<\/A\>)
hi bigcajones...thanks for helping me....your regex is the correct results that i get
(http://www.mydomain.com/)

my variable here that is just w/o ): (?<=\"\>http:\/\/.*?(?=\<\/A\>)

and with this i get results: (www.mydomain.com/)

but the only one thing that i do not understand is that how u got the extra Right bracket that i mark in red cos however i manipulate i only manage to come close to yours w/o the bracket just before http...
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Well, I don't know. Here's another. I scraped http://www.domain.com.sitemap.xml and here's the regular expression I parsed it with:

(?<=\>http:\/\/).*?(?=\<)

This gave me the required www.domain.com/hosting/...etc.

The ) that I had in the first example just happened to be on my sitemap that I tried the expression on, which was a Wordpress blog and a sitemap that was created with a plugin. As long as you get the required results, it doesn't matter how you get it. Sometimes you just have to play around with the regex builder to get what you want.
 
  • Спасибо
Реакции: SeRf*X

Кто просматривает тему: (Всего: 2, Пользователи: 0, Гости: 2)