Links grabber

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
834
Баллы
113
How to scrape links from www.bing.com



<iframe id="fr" src="http://www.youtube.com/subscribe_widget?p=ZennoLab" style="overflow: hidden; height: 105px; width: 300px; border: 0;" scrolling="no" frameBorder="0">If You are unable to see the YouTube Subscribe button <a target='_blank' href="http://www.youtube.com/user/ZennoLab">Click here</a></iframe>
 
Последнее редактирование модератором:

Jangoz

Client
Регистрация
14.01.2011
Сообщения
11
Благодарностей
2
Баллы
3
awesome, thanks
 

SPO

Client
Регистрация
04.02.2011
Сообщения
16
Благодарностей
0
Баллы
1
Another helpful video Thanks again !
 

risaharada

Новичок
Регистрация
18.03.2011
Сообщения
3
Благодарностей
0
Баллы
0
Now, bing chang html format, changed from %2F%2Furl%2F to A onmousedown="return si_T('&amp;ID=SERP,(a random num).1')" href="url" target=_blank>, we could use this regular expression"

(?<=\<H3\>\<A onmousedown\=\"return si_T\(\'&amp;ID\=SERP\,\d+\.1\'\)\" href\=\"http:\/\/).*?(?=\" target\=_blank\>)
"
 

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Благодарностей
834
Баллы
113
Sure, it changes often, you may just try different regular expressions to get results that you need
 

SeRf*X

Client
Регистрация
02.04.2011
Сообщения
35
Благодарностей
4
Баллы
8
i have a problem here.....whatever kind of expression i put (before the required text there is always) and (this goes after the required text) but the procession results box is just blank. By the way i installed the 3.4.5.255 beta pro on vista machine
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Hi thanks for the video very helpful. But Bing seems a lot easier to scrape than say google. I take one example of a link I got in google to start regexp parsing. In order to scrape google I put 'before the required text there is always':

PHP:
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A','','0CFwQFjAI')" href="
and after there is:


This gets me one link, the one I copied it directly from. Now, to get other H3class links I'm starting to do stuff like this:
I am substituting the before text with this:

PHP:
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','[a-z, A-Z,0-9]','','0CFwQFjAI')" href="
In order to start elaborating regexp that'll get me other links. I figure that at least it should be getting me the link above. But it doesn't!!! Which is extremely frustrating and confusing seeing as all I've done is turn

AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A
into

To have my regexp fail at such a basic level is just weird. Also, for the word to search, if I change the tag from input:text to input:name or input:id, I can get it to recognise numbers and not only letters we write into the search bar. The problem is it doesn't recognise non alpha numeric characters, no matter what html tag attribute values I use. So scraping for inurl:deportes/futbol is impossible.

And what makes it even weirder is that scraping for marca.com/deportes does work, which may mean it was a regexp issue 'in the center' all along. But I've tried both the [\w\W]* and the .*? but no luck.

So, very exasperating, and I would greatly appreciate a little push in the right direction.

(i used the php code quote here in the forum because quote and html screw up the post if i use them for some reason)
 

lazlink

Новичок
Регистрация
11.07.2011
Сообщения
11
Благодарностей
2
Баллы
0
i find new somethink.. tanks guy..
 

roadhog

Client
Регистрация
29.07.2011
Сообщения
76
Благодарностей
4
Баллы
8
This works for me - (?<=\)\" href\=\").*?(?=\" )
 

zennopower

Client
Регистрация
31.10.2011
Сообщения
6
Благодарностей
0
Баллы
1
Somebody could please post a tutorial for scraping data from Google, I'm also getting problem scraping data from Google.

I'm using rgex
(?<=\)\" href\=\")http.*?(?=\"\>)
but it also scrape lots of unwanted information including Google web cache.
 

bartjan

Client
Регистрация
01.02.2011
Сообщения
29
Благодарностей
2
Баллы
3
Simply first grab all h3

PHP:
<h3[\w\W]*?\<\/h3\>
Then look for hrefs in the above results:

PHP:
(?<=href\=\")[\w\W]*?(?=\")
 

albertt

Новичок
Регистрация
09.07.2011
Сообщения
2
Благодарностей
0
Баллы
0
Hi guys,

What about scraping results from deeper pages of Bing or Google?

For example, if i want to scrape all the results for the keyword "seo", in the template do i need to manually go to every SERP and record the action?

Thanks!
 

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
in that case you just need to add button for next page and loop back to part when scraping starts
 

flexfanatic

Client
Регистрация
03.11.2011
Сообщения
19
Благодарностей
3
Баллы
0
When scraping Bing results how do I exclude paid ads (I only want the organic search results).
 

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
you just need to do some magic on your regular expression, adds always have some extra mark around...
 

flexfanatic

Client
Регистрация
03.11.2011
Сообщения
19
Благодарностей
3
Баллы
0

Кто просматривает тему: (Всего: 2, Пользователи: 0, Гости: 2)