How to scrape images?

  • Автор темы Автор темы jp1
  • Дата начала Дата начала

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Реакции
2
Баллы
0
Does anyone have an example template of how to do this?

The problem with things like facebook profiles or google images is they seem to be hidden in something known as body onloads and thats javascript I'm not familiar with. The alternative for a retard like me would be to click on all the links to get to the jpg, but I really would love to learn how to go the short way and scrape directly off the facebook or google page what I see is what I get style.

Helping hand, anyone? thanks :-)
 
If you do not know the image path and you can't download it try this way
Click on the image with the right button.
Select this is a captcha.
Select Recognition module CaptchaSaver.dll
Set the name of the picture as the parameter for the recognition module.
 
It's all in the search result of e.g. a google image search. Send the DOM text to the regex builder and watch out for the image-urls, I don't remember what settings I used, but getting the images works reliably. I let Zenno write all the matches of the regex into a file (list of image urls) and sent it to wget (free external commandline tool that can download images by using a url-list that you provide); you fire off wget with 'own script' branch.
 
It's all in the search result of e.g. a google image search. Send the DOM text to the regex builder and watch out for the image-urls, I don't remember what settings I used, but getting the images works reliably. I let Zenno write all the matches of the regex into a file (list of image urls) and sent it to wget (free external commandline tool that can download images by using a url-list that you provide); you fire off wget with 'own script' branch.


I'm not sure I follow, sorry. When typing 'stuff' into google images, i try with the first image. I get

if i left click on the image on the page. But when I look for that in the DOM or the Source HTML it's nowhere to be found.

Even if I reduce it to


I still can't find it. Clicking on all the images to get them in tab would be very messy for a template and this wouldnt work in facebook where the profile images lead to a profile and not to the picture. So I'm afraid I'm missing something here.
 
If you do not know the image path and you can't download it try this way
Click on the image with the right button.
Select this is a captcha.
Select Recognition module CaptchaSaver.dll
Set the name of the picture as the parameter for the recognition module.

That would be a good solution for me if it were one image or a few, but I'm afraid there wouldnt be any regexp to parse a number of images on varying pages.
 
To get the images, read source text (sorry, weren't DOM), then parse it with the following macro:

In the macro builder go to: Regular Expression ->Parse with regular expression.
In the 'input string' field you put the id of the step that has read the page source.
In the 'regular expression' field you put (?<=imgurl\=).*?(?=&amp).
In the '# of match' field you put 0;end (that tells the macro to fetch every match it finds, from start (0) to end).

About the expression:
it looks for anything, that has 'imgurl=' before it and '&amp' after it.

<A href="/imgres?imgurl=http:/ landscaping.savvy-cafe.com/wp-content/uploads/2007/03/irish-landscape.jpg&amp;imgrefurl=http://landscaping.savvy-cafe.com/category/landscaping-photos/&amp;usg=__AcFkrCS4tc73b1bLe0rshqUzpRI=&amp;h=375&amp;w=500&amp;sz=114&amp;hl=en&amp;start=7&amp;zoom=1&amp;tbnid=0g_5fmzm73ep4M:&amp;tbnh=98&amp;tbnw=130&amp;ei=QAW9TaKXGJLG8QPt9NTABg&amp;prev=/search%3Fq%3Dlandscape%26hl%3Den%26gbv%3D1%26tbm%3Disch&amp;itbs=1">

(had to delete a '/' in that code above, otherwise it would get autoformatted as link)

Here's the finished expression btw:
{-RegExp.RegExp-|-YOUR SOURCE HERE-|-(?<=imgurl\=).*?(?=&amp)-|-0;end-}

Afterwards write everrything to a file.
Send this file to wget with this command:
d:\yourApplicationPath\Wget\bin\wget.exe -i "d:\yourInputFilePath\googleImg.txt"

I had to write that to a .bat file, cause the 'binary path' field of the 'own program' object does get confused about using " in it. So you write that above line to a file named fetchimages.bat and start this file with the 'own program' object.

(You have to install wget btw)

The images will download into your zenno folder.

Hope it gets you off the ground.
Once you understand how the expression builder and the logic branch works, it all makes sense and nothing seems impossible :)
 
  • Спасибо
Реакции: schooly
works here too. Nice site, bookmarked it :)
 
thanks guys that helped very well indeed.

i think i'll go with the captchasaver example right now because i need the images small anyways and its simpler :-)
 
Have one problem here, i was trying to download photos using captchasaver.dll method but on right clicking it it shows no presence of a captcha so i can't select it.
Only when i click through it and it was enlarge to one single image, here when right click it could see "This is a captcha" but upon running debug it's save into .jpeg but only i can see is a small black square image.

Tried also the wget but in facebook using DOM source the photos url not seen using regex, only those smaller images 150X150 url on the sidebar can be found whereas those Profile Pictures can't be seen.

Anyone tried this in facebook yet?
 
I am having hard times with another image based solution. I need to grab data displayed in flash from ocr result. How can I send notification with some sort of unique code that I need a verification code from an specific account? I need all threads have right result, in case I buy zenno one day. I can not do ocr in zennoposter, but there are loads of solutions to do ocr, also free and online. I just cant figure out how to trigger another party to do that and how to get the result back in zenno. It's a brilliant software, but can cause headaches :).

BR,

Elsa
 

Кто просматривает тему: (Всего: 0, Пользователи: 0, Гости: 0)