Hi thanks for the video very helpful. But Bing seems a lot easier to scrape than say google. I take one example of a link I got in google to start regexp parsing. In order to scrape google I put 'before the required text there is always':
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A','','0CFwQFjAI')" href="
and after there is:
This gets me one link, the one I copied it directly from. Now, to get other H3class links I'm starting to do stuff like this:
I am substituting the before text with this:
<H3 class=r><A class=l onmousedown="return rwt(this,'','','','9','[a-z, A-Z,0-9]','','0CFwQFjAI')" href="
In order to start elaborating regexp that'll get me other links. I figure that at least it should be getting me the link above. But it doesn't!!! Which is extremely frustrating and confusing seeing as all I've done is turn
AFQjCNFEmxJOfk3p4bAvUfQgvKvlQOwb7A
into
To have my regexp fail at such a basic level is just weird. Also, for the word to search, if I change the tag from input:text to input:name or input:id, I can get it to recognise numbers and not only letters we write into the search bar. The problem is it doesn't recognise non alpha numeric characters, no matter what html tag attribute values I use. So scraping for inurl:deportes/futbol is impossible.
And what makes it even weirder is that scraping for marca.com/deportes does work, which may mean it was a regexp issue 'in the center' all along. But I've tried both the [\w\W]* and the .*? but no luck.
So, very exasperating, and I would greatly appreciate a little push in the right direction.
(i used the php code quote here in the forum because quote and html screw up the post if i use them for some reason)