- Регистрация
- 09.02.2011
- Сообщения
- 1 216
- Благодарностей
- 683
- Баллы
- 113
After seeing a WSO on how to scrape content from YouTube vids I decided that I would try to make a template for it instead of purchasing the software.
I've also seen a lot of questions about certain things like C# code, GAC references and HTTP Requests on the forum lately. This template has it all for your reference.
The template takes your keyword(s) from a file and goes to YouTube and searches for videos that have Closed Captioning. It will scrape all video ID's and then go to videos.Google to scrape the content of the CC text. You won't see this happening because I used HTTP GET to scrape.
The page text is in XML so I have included a C# action and added a reference to the XML library to clean up the XML into readable text. I also did a little cleaning up of the text when the action is finished with it.
A folder is created named by your keyword so you will know what text goes with what keyword. The files are saved by the video watch ID so that if there is some text that doesn't make sense, you can go to the video it was pulled from and clean up the text if needed.
Just a few points here:
This content is probably not good for your main site since a lot of the vids are published on other platforms that include the transcript with the video, so you would have duplicate content on your site.
You will need a proxy.txt file and a keyword.txt file in the project directory. You can take the proxy action out, but we all know what happens if you scrape Google too much from one IP.
The template is open source, free for everyone to use so I don't want to see it sold on here as your own.
The content is not perfect, but it can be fixed. It would be good for spinning and using for GSA, SENuke or Zenno blasts on whatever tiers you might be building.
Good luck with it and if you have problems, just let me know.
Посмотреть вложение YTCC.xmlz
I've also seen a lot of questions about certain things like C# code, GAC references and HTTP Requests on the forum lately. This template has it all for your reference.
The template takes your keyword(s) from a file and goes to YouTube and searches for videos that have Closed Captioning. It will scrape all video ID's and then go to videos.Google to scrape the content of the CC text. You won't see this happening because I used HTTP GET to scrape.
The page text is in XML so I have included a C# action and added a reference to the XML library to clean up the XML into readable text. I also did a little cleaning up of the text when the action is finished with it.
A folder is created named by your keyword so you will know what text goes with what keyword. The files are saved by the video watch ID so that if there is some text that doesn't make sense, you can go to the video it was pulled from and clean up the text if needed.
Just a few points here:
This content is probably not good for your main site since a lot of the vids are published on other platforms that include the transcript with the video, so you would have duplicate content on your site.
You will need a proxy.txt file and a keyword.txt file in the project directory. You can take the proxy action out, but we all know what happens if you scrape Google too much from one IP.
The template is open source, free for everyone to use so I don't want to see it sold on here as your own.
The content is not perfect, but it can be fixed. It would be good for spinning and using for GSA, SENuke or Zenno blasts on whatever tiers you might be building.
Good luck with it and if you have problems, just let me know.
Посмотреть вложение YTCC.xmlz