Links grabber

Hungry Bulldozer · 16.02.2011

How to scrape links from www.bing.com

Jangoz · 02.03.2011

awesome, thanks

SPO · 02.03.2011

Another helpful video Thanks again !

risaharada · 22.03.2011

Now, bing chang html format, changed from %2F%2Furl%2F to A onmousedown="return si_T('&ID=SERP,(a random num).1')" href="url" target=_blank>, we could use this regular expression"

(?<=\<H3\>\<A onmousedown\=\"return si_T\(\'&ID\=SERP\,\d+\.1\'\)\" href\=\"http:\/\/).*?(?=\" target\=_blank\>)

"

Hungry Bulldozer · 22.03.2011

Sure, it changes often, you may just try different regular expressions to get results that you need

SeRf*X · 06.04.2011

i have a problem here.....whatever kind of expression i put (before the required text there is always) and (this goes after the required text) but the procession results box is just blank. By the way i installed the 3.4.5.255 beta pro on vista machine

jp1 · 19.04.2011

Hi thanks for the video very helpful. But Bing seems a lot easier to scrape than say google. I take one example of a link I got in google to start regexp parsing. In order to scrape google I put 'before the required text there is always':

PHP:

<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A','','0CFwQFjAI')" href="

and after there is:

''>

This gets me one link, the one I copied it directly from. Now, to get other H3class links I'm starting to do stuff like this:
I am substituting the before text with this:

PHP:

<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','[a-z, A-Z,0-9]','','0CFwQFjAI')" href="

In order to start elaborating regexp that'll get me other links. I figure that at least it should be getting me the link above. But it doesn't!!! Which is extremely frustrating and confusing seeing as all I've done is turn

AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A

into

[a-z, A-Z,0-9]

To have my regexp fail at such a basic level is just weird. Also, for the word to search, if I change the tag from input:text to input:name or input:id, I can get it to recognise numbers and not only letters we write into the search bar. The problem is it doesn't recognise non alpha numeric characters, no matter what html tag attribute values I use. So scraping for inurl:deportes/futbol is impossible.

And what makes it even weirder is that scraping for marca.com/deportes does work, which may mean it was a regexp issue 'in the center' all along. But I've tried both the [\w\W]* and the .*? but no luck.

So, very exasperating, and I would greatly appreciate a little push in the right direction.

(i used the php code quote here in the forum because quote and html screw up the post if i use them for some reason)

lazlink · 11.07.2011

i find new somethink.. tanks guy..

roadhog · 01.09.2011

This works for me - (?<=\)\" href\=\").*?(?=\" )

zennopower · 12.11.2011

Somebody could please post a tutorial for scraping data from Google, I'm also getting problem scraping data from Google.

I'm using rgex

(?<=\)\" href\=\")http.*?(?=\"\>)

but it also scrape lots of unwanted information including Google web cache.

bartjan · 14.11.2011

Simply first grab all h3

PHP:

<h3[\w\W]*?\<\/h3\>

Then look for hrefs in the above results:

PHP:

(?<=href\=\")[\w\W]*?(?=\")

albertt · 14.12.2011

Hi guys,

What about scraping results from deeper pages of Bing or Google?

For example, if i want to scrape all the results for the keyword "seo", in the template do i need to manually go to every SERP and record the action?

Thanks!

drvosjeca · 14.12.2011

in that case you just need to add button for next page and loop back to part when scraping starts

flexfanatic · 22.12.2011

When scraping Bing results how do I exclude paid ads (I only want the organic search results).

drvosjeca · 22.12.2011

you just need to do some magic on your regular expression, adds always have some extra mark around...

flexfanatic · 22.12.2011

drvosjeca написал(а):
you just need to do some magic on your regular expression, adds always have some extra mark around...

Currently this regular expression is working

(?<=u\=\"http%3A%2F%2F).*?(?=%2F)

Поиск

Links grabber

Hungry Bulldozer

Moderator

Jangoz

Client

SPO

Client

risaharada

Новичок

Hungry Bulldozer

Moderator

SeRf*X

Client

jp1

Client

lazlink

Новичок

roadhog

Client

zennopower

Client

bartjan

Client

albertt

Новичок

drvosjeca

Client

flexfanatic

Client

drvosjeca

Client

flexfanatic

Client

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)