Article Grabber Questions

Mankush

Client
Регистрация
31.10.2011
Сообщения
189
Благодарностей
17
Баллы
28
Hi guys. I`m currently building an article grabber.
Input - keywords.
Action - search on ezine, click on first 10 results using regex (one at a time), then grabbing the article using the "Article Extraction" icon command in Project maker.
Output - each article is saved to a designated file baring the keyword name + sequential number.

I have two main problems:

1. Regex - currently the regex that I have build is:
(?<=<a\ href="/\?)[\w\W]*?(?=">)
Which gives me the wanted result - name of article - with the unwanted result - name of the writer.

Example:

Corporate-Video-Production---Corporate-Videos-High-Impact-on-Business-Audience&amp;id=439810
expert=Shakir_A.
Corporate-Video-Productions---Need-and-Importance&amp;id=439816
expert=Shakir_A.


How can I omit the writer (named expert on ezine) line in one go?

2. Using the "Article Extraction" I get a lot of junk at the beginning of the file, without the headline. I did not see any parameters for "Article Extraction".
If I want to create my own regex for grabbing, how can I take the headline with the content of the article in one go?

I want to understand regex better, even downloaded a proggi called "regexmagic" but that didnt do any magic at all, just got me banging my head on my keyboard.
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 708
Баллы
113

Mankush

Client
Регистрация
31.10.2011
Сообщения
189
Благодарностей
17
Баллы
28
About the article extraction, got it : )
About the sites, already been there. just wasnt able to actually understand (since I`m not an uber programmer).

I have my own workaround the first problem - just call every second line (thus skipping the unwanted info).

I thought about two options (just dont know how to implement it) of how to solve it in regex

1. The headline of the articles have <h3> tag with "tab" before them (that is why I didnt manage to put it in the regex)
2. The word count of the headline will always be longer (or almost) than the writer...

Maybe you could give me a hint?
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 708
Баллы
113
I dont really understand what you mean. If you post a piece of code i can look and maybe suggest you regular expression for your task.
 

Mankush

Client
Регистрация
31.10.2011
Сообщения
189
Благодарностей
17
Баллы
28
<div class="result-title">
<h3>
<a href="/?The-8-Mistakes-Guaranteed-to-Doom-Your-Corporate-Video-Production&amp;id=6993909">The 8 Mistakes Guaranteed to Doom Your Corporate Video Production</a>
<span class="result-author"> by


<a href="/?expert=Jim_Penrose">Jim Penrose</a>


</span>
</h3>
</div>
 

Mankush

Client
Регистрация
31.10.2011
Сообщения
189
Благодарностей
17
Баллы
28
Have tried

(?<=title"><h3><a\ href="/\?).*(?=</a>)


But it does not seem to work for me : (

I just want the a href of the title, not the author. . .
 

rostonix

Известная личность
Регистрация
23.12.2011
Сообщения
29 067
Благодарностей
5 708
Баллы
113
Something like this
 

Вложения

  • 38,5 КБ Просмотры: 240
  • Спасибо
Реакции: Mankush

Mankush

Client
Регистрация
31.10.2011
Сообщения
189
Благодарностей
17
Баллы
28
Cool. works like magic : )
Have to loop it though, to collect all of the lines to a list (instead of a one time data grabbing like I did, then skipping every 2nd result)
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)