A bit of regexp trouble!

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
This seems quite a tough and long post, I've posted in two forums already to no avail. Regexp is either frustrating or i'm a noob, I think it's a bit of both.

1. I can't quite seem to catch the question mark in the first line. And it inexplicably captures the second string too, which doesnt make any sense to me as I thought I had the lookbehind covered. Here are the two lines.



HTML:
<H3><A href="/question/index;_ylt=AuceFBRGAkkNJn5iiu3ZDYYjzKIX;_ylv=3?qid=20070704123624AA9H28e"><STRONG class=highlight>Accountant</STRONG>?</A></H3>
<P>...to do to get into university to be an <STRONG class=highlight>accountant</STRONG> ? what requirement do I need? how about the average...</P>


with:



Код:
(?<=<H3><A href="/question/index;\w+={[a-z, A-Z, 0-9]*_[a-z, A-Z, 0-9]|[a-z, A-Z, 0-9]*}*;_\w+=\d\?\w+=[a-z, A-Z, 0-9]*">){[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>}?(?=\<\</A></H3>)


in order to get only first line. However I get both, not as one match but as two. It baffles me as to why I'm picking up the second line up too because it's clearly aimed only at the first line. What on earth leads it to think it's found a <P> when clearly there's just an <A> beats me... I can't help thinking the issue may be somewhere in the spin but the more I look the more I feel I'm gonna go nuts. To my mind all the stuff inside the spin doesnt lead off to some error at all. And I've put the A and the H3 there, glaringly so and yet it still matches it all. The key is in what it captures of the second line:

----------------------------------- match # 0 -----------------------------------
<STRONG class=highlight>Accountant</STRONG>
----------------------------------- match # 1 -----------------------------------
to do to get into university to be an <STRONG class=highlight>accountant</STRONG>


It seems to think there's an (?=\<\</A></H3>) after that </STRONG> but all there is is a space bar, and besides, when I leave only letters with no space bars it comes up with the same result. And there's certainly no H3 to be seen, so I dunno what match 1 is referring to.



All the spin inside is because I'm matching variations in a bigger file, which I've got covered. I'm also surprised I'm not picking up the question mark in the real text at the end of what I'm looking for. I've tried sticking it all over the place, inside the spin, outside, with and without line breaks, to no avail. Would appreciate a hand, thanks.


2. I'm having trouble with line breaks, trying to match line breaks of a certain kind. I'd like to match all the strings that are before other strings that have phrases like '0 stars', '1 star' and so on.



An example of this is the following:



Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty?

0 Stars In United States - Asked by monaya - 6 answers - 3 years ago

I want the line in the middle. So I thought about the following:



Код:
(?<=\^)$(?=\^\\d)
without an inexplicable excape double in front of the d: (?<=\^)$(?=\^\d)

but it doesn't work. I tried a ? in front of the line break, like this:

Код:
 (?<=\?\^)$(?=\^\\d)
and without the escape, but that didn't work either. What am i doing wrong?
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Got this: (?<=\<H3\>\<A href\=\"\/question\/index;_ylt\=.*\=\d+\?qid\=.*\"\>).*(?=\<\/A\>\<\/H3\>)

Returns:
-----------------------match#0------------------------
<STRONG class=highlight>Accountant</STRONG>?

For the first problem. Is that what you were looking for?
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
And the second one I got: (?<=.*\r\n).*

Returns:

----------------------------------- match # 0 -----------------------------------


----------------------------------- match # 1 -----------------------------------
An accountant who handled...it ok to sue the accountant for the penalty?

----------------------------------- match # 2 -----------------------------------


----------------------------------- match # 3 -----------------------------------
0 Stars In United States - Asked by monaya - 6 answers - 3 years ago

Then in your Template editor just choose match #1 for the result
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Got this: (?<=\<H3\>\<A href\=\"\/question\/index;_ylt\=.*\=\d+\?qid\=.*\"\>).*(?=\<\/A\>\<\/H3\>)

Returns:
-----------------------match#0------------------------
<STRONG class=highlight>Accountant</STRONG>?

For the first problem. Is that what you were looking for?
Got this: (?<=\<H3\>\<A href\=\"\/question\/index;_ylt\=.*\=\d+\?qid\=.*\"\>).*(?=\<\/A\>\<\/H3\>)

Returns:
-----------------------match#0------------------------
<STRONG class=highlight>Accountant</STRONG>?

For the first problem. Is that what you were looking for?
thanks man, that looks great. However I have to ask you a couple of more questions so I can get onto learning regexp next time an issue comes up. I'll use your example of course for this one because it neatly seems to be using .* where I'd gone nuts trying to find alphanumeric AND nonalphanumeric characters. But I thought you'd covered the fact that the first line is caught by itself only when you use a what seems like a corrected look forward in (?=\<\/A\>\<\/H3\>), where the difference with my (?=\<\</A></H3>) is that you're escaping all its /, which I'd forgotten to escape and what seemed more important allowing the A and H3 to come into play. However pasting that onto an after my regex to be found expression of lookahead, like this:

(?<=\<H3\>\<A href\=\"\/question\/index;_ylt\=.*\=\d+\?qid\=.*\"\>){[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>\?}(?=\<\/A\>\<\/H3\>)

gave me both matches. It also happened if I pasted your expression into the lookahead beginning of the expression, which leads me to believe the issue lies in the stuff I was hoping was to be found. The:

{[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>\?}

I now realise though that the escapes are put in there because you escaped the =, which leads you to having to escape all the / ones. So, long winded as my statement was, indeed, what you did was substitute that spintax with your *.

Now, I thought the first line was covered by the lookforward that stops it(?=\<\/A\>\<\/H3\>), so how come it thinks the lines we don't want are also stopped by this expression when it apparently isn't so. After all, they're stopped by a </P> instead. Cheers.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
And the second one I got: (?<=.*\r\n).*

Returns:

----------------------------------- match # 0 -----------------------------------


----------------------------------- match # 1 -----------------------------------
An accountant who handled...it ok to sue the accountant for the penalty?

----------------------------------- match # 2 -----------------------------------


----------------------------------- match # 3 -----------------------------------
0 Stars In United States - Asked by monaya - 6 answers - 3 years ago

Then in your Template editor just choose match #1 for the result
Hi, in this case I was looking for all the lines that came after lines like the one with the '3 years ago' phrase. (Even though I said I wanted the midline), like this:

1 Stars In Higher Education (University +) - Asked by Ylz - 3 answers - 3 years ago


Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

1 Stars In Higher Education (University +) - Asked by Kimmi N - 1 answer - 4 years ago

Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty?
matches all the lines with (?<=.*\r\n).* giving me this:

----------------------------------- match # 61 -----------------------------------
Accountant?


----------------------------------- match # 62 -----------------------------------
...to do to get into university to be an accountant ? what requirement do I need? how about the average...


----------------------------------- match # 63 -----------------------------------
1 Stars In Higher Education (University +) - Asked by Kimmi N - 1 answer - 4 years ago


----------------------------------- match # 64 -----------------------------------
Sue an accountant who filed your taxes incorrectly when penalty is involved?


----------------------------------- match # 65 -----------------------------------
An accountant who handled...it ok to sue the accountant for the penalty?


----------------------------------- match # 66 -----------------------------------
0 Stars In United States - Asked by monaya - 6 answers - 3 years ago
This is why I'm having trouble with ^, as I thought that was a line break. I thought maybe just writing that in with a few words or escaped terms before would be enough to cover the fact the lookahead is the string that comes before the string we always wanted. And then I thought a $ would cover the string I want, and that another plain old ^ as lookbehind would mean I want nothing more than that just one string. Thanks for your help and if you can illuminate this issue to it would be grand. cheers.
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Код:
http://www.youtube.com/watch?v=oJtR5A4B6aQ
Vid I did with your example so you can see what I did. You have to remember that .* is wildcard not *

If you use the regex builder in PM it will put this in between your look ahead and look behind if that makes sense.

Let me work on your other example and get back to you.
 

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Here's what I got for the second. I can't figure out how to do a conditional on this. It doesn't work, but I guess you could use both statements. Here's what I put in the regex builder.

Код:
1 Stars In Higher Education (University +) - Asked by Ylz - 3 answers - 3 years ago


Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

1 Stars In Higher Education (University +) - Asked by Kimmi N - 1 answer - 4 years ago

Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty? 

1 Stars In Higher Education (University +) - Asked by Ylz - 3 answers - 3 years ago


Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

1 Stars In Higher Education (University +) - Asked by Kimmi N - 1 answer - 4 years ago

Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty? 

1 Stars In Higher Education (University +) - Asked by Ylz - 3 answers - 3 years ago


Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

1 Stars In Higher Education (University +) - Asked by Kimmi N - 1 answer - 4 years ago

Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty?
Here's the first expression:

(?<=years ago\r\n\r\n\r\n).*\r\n\r\n.*

And this is what I got:

Код:
----------------------------------- match # 0 -----------------------------------
Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

----------------------------------- match # 1 -----------------------------------
Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...

----------------------------------- match # 2 -----------------------------------
Accountant?

...to do to get into university to be an accountant ? what requirement do I need? how about the average...
For the second set I put in this:

(?<=years ago\r\n\r\n).*\r\n\r\n.*

And this is what I got:

Код:
----------------------------------- match # 0 -----------------------------------
Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty? 

----------------------------------- match # 1 -----------------------------------
Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty? 

----------------------------------- match # 2 -----------------------------------
Sue an accountant who filed your taxes incorrectly when penalty is involved?

An accountant who handled...it ok to sue the accountant for the penalty?
I had to look this up, but the carriage return and new line is : \r\n

I'm still trying to figure out if you can do a conditional like or(||) or and(&&). I got results when I put this in, but it was like 1500 matches still only had the results like I got with using the two short expressions above. It would be nice if someone with a little more experience in expression building would put up some tutorials on how to parse the text more effectively.
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0
Код:
http://www.youtube.com/watch?v=oJtR5A4B6aQ
Vid I did with your example so you can see what I did. You have to remember that .* is wildcard not *

If you use the regex builder in PM it will put this in between your look ahead and look behind if that makes sense.

Let me work on your other example and get back to you.
wow that was great, very kind of you to do that video. I understand the logic behind some of this regexp and it just shows how messy I write when I kept on talking about the issue like I did. However I could have thought that with text for procession:

<H3><A href="/question/index;_ylt=AuceFBRGAkkNJn5iiu3ZDYYjzKIX;_ylv=3?qid=20070704123624AA9H28e"><STRONG class=highlight>Accountant</STRONG>?</A></H3>
<P>...to do to get into university to be an <STRONG class=highlight>accountant</STRONG> ? what requirement do I need? how about the average...</P>

and using:
{[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>}?
wouldn't cover both the lines blows my mind since it obviously covers them both, in its own complicated way and erroneous way. It's a rubbish statement I did while i was trying to learn regexp and it doesnt even match the full second line, and clearly you came up with something much better. What I ask now purely concerns the nature of regexp, so I can go on learning it. It concerns the lookahead and why it isn't stopping the second line from being matched. Since

matches

----------------------------------- match # 0 -----------------------------------
<H3><A href="/question/index;_ylt=AuceFBRGAkkNJn5iiu3ZDYYjzKIX;_ylv=3?qid=20070704123624AA9H28e"><STRONG class=highlight>Accountant</STRONG>?
----------------------------------- match # 1 -----------------------------------
Why is there an empty second line anyways? It should be only matching one line. I'd have thought the lookahead would have stopped there from being any second match at all since there's no (?=\<\/A\>\<\/H3\>) in the second line.

Look at this expression, too:

{[a-z, A-Z, 0-9]* <STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG> [a-z, A-Z, 0-9]*\?|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>[a-z, A-Z, 0-9]*|[a-z, A-Z, 0-9]*<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>|<STRONG class=highlight>[a-z, A-Z, 0-9]*</STRONG>}?(?=\<\/A\>\<\/H3\>)
Here I'd have thought, again that the lookahead would have stopped it from getting to the second line. But it matches:

----------------------------------- match # 0 -----------------------------------
<STRONG class=highlight>Accountant</STRONG>
----------------------------------- match # 1 -----------------------------------
to do to get into university to be an <STRONG class=highlight>accountant</STRONG>
 

jp1

Client
Регистрация
23.01.2011
Сообщения
234
Благодарностей
2
Баллы
0

bigcajones

Client
Регистрация
09.02.2011
Сообщения
1 216
Благодарностей
683
Баллы
113
Yeah, by looking at your expression more closely I can see that since you don't have your expression right, it will return everything since there is nothing to stop it from going on to the second line. Try to simplify your expressions a little more. I'm no expert, but I found out that it's a lot easier to go with the .* wildcard than it is to try and match with [a-z, A-Z, 0-9] because you will have to put that behind every ' or / that your parse text has. When doing scraping, especially on Google, you will be killing yourself trying to match something like this: "return clk(this.href,'','','','2','','0CDwQ0gIoADAB') although it can be done. I like simpler and when it comes to regular expressions from what I've seen, it's better.
 

Кто просматривает тему: (Всего: 1, Пользователи: 0, Гости: 1)