Links grabber

Hungry Bulldozer

Moderator
Регистрация
12.01.2011
Сообщения
3 441
Реакции
837
Баллы
113
How to scrape links from www.bing.com



<iframe id="fr" src="http://www.youtube.com/subscribe_widget?p=ZennoLab" style="overflow: hidden; height: 105px; width: 300px; border: 0;" scrolling="no" frameBorder="0">If You are unable to see the YouTube Subscribe button <a target='_blank' href="http://www.youtube.com/user/ZennoLab">Click here</a></iframe>
 
Последнее редактирование модератором:
awesome, thanks
 
Another helpful video Thanks again !
 
Now, bing chang html format, changed from %2F%2Furl%2F to A onmousedown="return si_T('&amp;ID=SERP,(a random num).1')" href="url" target=_blank>, we could use this regular expression"

(?<=\<H3\>\<A onmousedown\=\"return si_T\(\'&amp;ID\=SERP\,\d+\.1\'\)\" href\=\"http:\/\/).*?(?=\" target\=_blank\>)

"
 
Sure, it changes often, you may just try different regular expressions to get results that you need
 
i have a problem here.....whatever kind of expression i put (before the required text there is always) and (this goes after the required text) but the procession results box is just blank. By the way i installed the 3.4.5.255 beta pro on vista machine
 
Hi thanks for the video very helpful. But Bing seems a lot easier to scrape than say google. I take one example of a link I got in google to start regexp parsing. In order to scrape google I put 'before the required text there is always':

PHP:
Развернуть Свернуть Копировать
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A','','0CFwQFjAI')" href="

and after there is:



This gets me one link, the one I copied it directly from. Now, to get other H3class links I'm starting to do stuff like this:
I am substituting the before text with this:

PHP:
Развернуть Свернуть Копировать
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','[a-z, A-Z,0-9]','','0CFwQFjAI')" href="

In order to start elaborating regexp that'll get me other links. I figure that at least it should be getting me the link above. But it doesn't!!! Which is extremely frustrating and confusing seeing as all I've done is turn

AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A
into


To have my regexp fail at such a basic level is just weird. Also, for the word to search, if I change the tag from input:text to input:name or input:id, I can get it to recognise numbers and not only letters we write into the search bar. The problem is it doesn't recognise non alpha numeric characters, no matter what html tag attribute values I use. So scraping for inurl:deportes/futbol is impossible.

And what makes it even weirder is that scraping for marca.com/deportes does work, which may mean it was a regexp issue 'in the center' all along. But I've tried both the [\w\W]* and the .*? but no luck.

So, very exasperating, and I would greatly appreciate a little push in the right direction.

(i used the php code quote here in the forum because quote and html screw up the post if i use them for some reason)
 
i find new somethink.. tanks guy..
 
This works for me - (?<=\)\" href\=\").*?(?=\" )
 
Somebody could please post a tutorial for scraping data from Google, I'm also getting problem scraping data from Google.

I'm using rgex
(?<=\)\" href\=\")http.*?(?=\"\>)
but it also scrape lots of unwanted information including Google web cache.
 
Simply first grab all h3

PHP:
Развернуть Свернуть Копировать
<h3[\w\W]*?\<\/h3\>

Then look for hrefs in the above results:

PHP:
Развернуть Свернуть Копировать
(?<=href\=\")[\w\W]*?(?=\")
 
Hi guys,

What about scraping results from deeper pages of Bing or Google?

For example, if i want to scrape all the results for the keyword "seo", in the template do i need to manually go to every SERP and record the action?

Thanks!
 
in that case you just need to add button for next page and loop back to part when scraping starts
 
you just need to do some magic on your regular expression, adds always have some extra mark around...
 

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)