Scraping links from multiple URL's BUT does not have the domain name in source code..

bekk1n

Client
Регистрация
26.09.2011
Сообщения
6
Благодарностей
0
Баллы
0
hey peeps, I am posting to one type of CMS and I want to collect the URL's of the pages that have been posted and store them in a text. I've seen the videos of the link scraper for google and bing so I get the idea, but when looking at the source of the links on these blog pages within the "macros builder" after adding it from the "Page Test" the links look like this: <A href="/folder1/folder2/Post-content-here">Page link here</A>

I am not sure how to identify the Full URL of those pages as I obviously need the www.whateverdomain.com as well..

any ideas?

thanks peeps
 

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
834
Баллы
113
If all link look like your example, so you should use this regexp to get them:
(?<=\<A href\=\"\/folder1\/folder2\/Post-content-here\"\>).*(?=\<\/A\>)
 

bekk1n

Client
Регистрация
26.09.2011
Сообщения
6
Благодарностей
0
Баллы
0
thanks for the help,

problem is its different for almost all of them. so for example:
<A href="/someotherfolder/anotherfolder/Post-content-there">Page link here</A>
<A href="/someotherfolder/aotherfolder-/folder2/Post-here">link here</A>
<A href="/afolder/folderhere44/somecontent/somelinkpagefolder/anotherone/morehere">linkpagehere</A>
<A href="/afolder4421/somefolder31/morepostcontentpage">Page link 2</A>
<A href="/folder1/folder2/Post-content-here">Page link here</A>
 

bekk1n

Client
Регистрация
26.09.2011
Сообщения
6
Благодарностей
0
Баллы
0
best way to explain as a whole I guess,

I'm posting to a list of domains
each domain url structure is different:
on each page there is links to other pages on the same domain
so
Will have
So when looking at the source code in the macros builder on the page:
you see
<A href="/folder1/folder2/Post-content-here/thepostcontentpage1">Page link here</A>
<A href="/folder1/folder2/Post-content-here/thecontentpage2">Page link 2</A>
<A href="/folder1/folder2/Post-content-here/contentpage3">linkpagehere</A>
<A href="/folder1/folder2/Post-content-here/postcontentpage4">link here</A>

So I am not sure how to get a list of the full link(s) from each page
 

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
834
Баллы
113
(?<=\<A href\=\").*(?=\"\>) try this one
 

bekk1n

Client
Регистрация
26.09.2011
Сообщения
6
Благодарностей
0
Баллы
0
unfortunately that only gives me the link like this: /folder1/folder2/Post-content-here/thepostcontentpage1">linkhere</A>

it doesn't provide the full domainname

I'd assume scraping would be the only way to collect these links?
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Try this. Get the page url (Get=>Webrowser=>Get page URL)
Parse with regular expression .*\/ which will give you the domain of the root...http://www.thisdomain.com and then when you save your file, take the result of that branch and the result of your parsing branch and put them together...
{-FieldData.FieldData-|-Blah-|-BlahBlah-}{-FieldData.FieldData-|-Next Blah-|-OtherBlahBlahBlah-}
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)