Scraper sometimes repeats same pages

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
hi i have a quick question... i have a scraper that is built and runs great but periodically seems to rescrape the same pages even though I have a PAUSE of 4 seconds built into it after I call the new page.... is there another way to avoid the issue of rescraping the same page? Anyone ever have a similar problem with any solutions... i have the routine running on a couple embedded counters that seem to be counting the pages correctly but it seems the new page does not always load..... i am using variables for the counter embedded in the URL as well but it never errors so it doesnt appear to be that as an issue... Thanks for any advice or help...
 
Регистрация
28.07.2012
Сообщения
51
Благодарностей
9
Баллы
8
i'm pretty sure i've seen the same bug. its probably due to the zp cache bug.
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 715
Баллы
113
You can delete duplicates from result list/table for example
 
Регистрация
01.02.2011
Сообщения
99
Благодарностей
15
Баллы
0
1. Pull the page URL before reload
2. Pull the page URL after reload
3. Add URL variables to logic action {oldurl}=={newurl}
4 If URLS ==, pass thru a counter loop before retrying page reload (saves your ass from a infinite loop).
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
Thanks all for your input... I have noted the responses and will try. However, I painstakingly walked through the process step by step and noticed on the query to GO TO {PageVariable}-{PageCounter} it puts the correct URL into the URL field but does not actually call that page, even with a PAUSE of 10 seconds or more to wait for the page to render. It also seems to have problems on LONGER URL strings, though I cant confirm that just yet... I have noticed it is URL specific...

Is this related to the cache bug that I have seen others mention and talk about? For me it just doesnt render the page even though I see it in the URL field... would it help turning off javascript or something else on those pages to scrape them? Perhaps some kind of script is running to prevent page rendering? Never seen such a thing... theorizing... but the end result seems to be BECAUSE the pages do not move to the next page and render the next URL it repeats the same content from the same page, thus creating duplicates...

Has anyone seen such a thing in rendering? I have tried to pull the page URL BEFORE load by pushing the total URL to a single variable and then calling it... I do not know of another way to pull the URLs but would be interested...

Another strange bug I noticed was when I assign URL's to a LIST I was only able to pull 14 URLs into the list... I would swap around the URL's in the list to see what was causing the list truncation but no matter what i did I could only get 14 loaded... this was very strange but separate from the other issues... I have a very, very fast and heavy duty notebook i run this on so I couldnt see this as a memory limitation....

Let me know... and thanks!
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 715
Баллы
113
Another strange bug I noticed was when I assign URL's to a LIST I was only able to pull 14 URLs into the list... I would swap around the URL's in the list to see what was causing the list truncation but no matter what i did I could only get 14 loaded... this was very strange but separate from the other issues... I have a very, very fast and heavy duty notebook i run this on so I couldnt see this as a memory limitation....
I think you mean previews is settings.
These are just previews.
==
Try to use GET requests instead of opening pages and parse body that you get.
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
Newb question --- how would I get a webpage? I see GET has a "What to take?" dropdown... the elements in it are page elements but how do you tell it to take from an actual webpage? I dont see a field where you enter the URL to 'get' from... ? Can you paste an example here of how I would pull from a webpage?

(and you are correct Rostonix, they are just previews... sorry... it does pull the full list I believe after further inspection on the unrelated point...)
 

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
you are looking at wrong get... You need to scroll down and use HTTP: Get Request action block
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
:o ------ *face-palm*

Thank you! Talk about newbie... *sigh*
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
OK - so i have it pulling the page but what operators interact with the page? Is there a tutorial on this anywhere that can walk me through this method ? The GET DOM command deals with open tabs - but the GET command you show me pushes the HTML to a variable... how do you create regular expressions to work within a variable?

I am sorry for such basic expressions but I only saw tutorials on how to work with regular expressions in HTML pages... Any help and the consideration you have already given is VERY MUCH appreciated - so thank you!
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 715
Баллы
113
You question in unclear. If you just need o scrape Data from URL, you use GET request and then parse the result with regex just like DOM code but via
Text processing - Regex
 

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
it is exactley the same...

Even in video tutorial i have made you will see that all grabed text is putted to variable, and then you work from there.

Variable is just a shorter term for us to see, program see everything, all content of variable, so you dont need to worry about that. Check that scraping video again, and now have in mind that here is same, just that you dont see it at start, but you put it to variable same way.
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
OK - but i thought the GET command forces you to put the HTML to a variable... how do I then use a REGEX on a variable? Normally the REGEX is pulling from an opened browser. In this case it needs to pull from a variable... can you explain a very brief example? I think I am confused maybe as to how REGEX works... I thought it could only work within an opened browsers page ... how else do you get REGEX to call from a file or call from a variable?
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
I mean I understand the GET has the dropdown with TEXT... but where is that TEXT coming from? Does it look at the last step before the command and pull from that?
 

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
You thought it all wrong...

1. Regex is not pulling anything, it is just like cleaning data (from data you have already pulled, you extract what you need with it)

2. GET is not forcing nothing, it is just making simple request for data, just like when you open site in normal browser.

3. Like said before, variable is same as data, just shortened for your eyes so it can be manipulated with less troubless (like when you put all your cookies in a jar, it is easier to move jar around with all the cookies then moving one by one cookie)

4. GET have no dropdowns with txt, it is encoding! Txt is comming from url which you add there... Again, it is same as opening in browser, just faster.

5. Please try suggested before jumping out with questions, otherwise we can talk for ever, and you will still not have anything done. Trying ==> Learning
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
Show me where a specific tutorial or video or FAQ is on using GET versus GOTO PAGE??? I have been trying which is why I am asking for help. There is not downloadable manual for MP and I havent found any information in version 3 that has helped me understand this. I have been trying and trying... it is totally unclear how you use GET once you 'GET' a page and assign it to a variable. Then what? please tell me how I can use a regular expression on the HTML i get from a page that is assigned to a variable???? If it was straightforward I would figure it out.... but its not. Yes I am new to your software but not to Boolean logic and linear step programming.

PLEASE help me and explain what I am doing? I cant figure it out. If its written somewhere, show me where to read or watch. It is NOT explained in your scraper video.

Thanks again in advance for the time and effort taken. I will guarantee you I am not the only one wondering about this but I have yet to see it explained in a thread in this forum.

Thanks for your time and help.
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 715
Баллы
113

drvosjeca

Client
Регистрация
26.10.2011
Сообщения
512
Благодарностей
455
Баллы
63
Show me where a specific tutorial or video or FAQ is on using GET versus GOTO PAGE??? I have been trying which is why I am asking for help. There is not downloadable manual for MP and I havent found any information in version 3 that has helped me understand this. I have been trying and trying... it is totally unclear how you use GET once you 'GET' a page and assign it to a variable. Then what? please tell me how I can use a regular expression on the HTML i get from a page that is assigned to a variable???? If it was straightforward I would figure it out.... but its not. Yes I am new to your software but not to Boolean logic and linear step programming.

PLEASE help me and explain what I am doing? I cant figure it out. If its written somewhere, show me where to read or watch. It is NOT explained in your scraper video.

Thanks again in advance for the time and effort taken. I will guarantee you I am not the only one wondering about this but I have yet to see it explained in a thread in this forum.

Thanks for your time and help.
Hey... im sorry if you didnt understand what i was trying to say there...

You can contact me on skype and i will explain you how that works :-)

my skype id: dejan.jugovic1
 

genetrader

Client
Регистрация
31.03.2011
Сообщения
26
Благодарностей
0
Баллы
1
Thank you, I will reach out to you. I totally understand how much you and the Zenno team has put into this product, and having used this and Ubot it is very clear that this is by far a superior solution in every way - having every element thought out. I appreciate your time and efforts you and Rostonix spend in the forums as it is enormously helpful.

Thank you both again for everything. I and many others truly appreciate it.
 

Stroks

Client
Регистрация
09.02.2012
Сообщения
219
Благодарностей
14
Баллы
18
Just to add i found scraping google usinng get request is way faster that usual way.
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)