Parsing optional blocks with RegEx

I have to parse data contaning blocks that are not always present. Here's example

<block id="aaa"> aaa data </block>

<block id="bbb"> bbb data <block > xxx data </block> </block>

<block id="ccc"> ccc data </block>

I want to catch each (aaa, bbb and ccc) blocks in corresponding variables. The problem is that some blocks can be omitted.

I tried this regex:

"( s)\s*" & _

"( :<block id=""aaa"">\s*( <aaa>.*)\s*</block>) \s*" & _

"( :<block id=""bbb"">\s*( <bbb>.*)\s*</block>) \s*" & _

"( :<block id=""ccc"">\s*( <ccc>.*)\s*</block>) \s*"

But it doesn't work. The first (aaa) block consumes other blocks.

How to change that



Answer this question

Parsing optional blocks with RegEx

  • Hydra20010

    Thnks. That really helped.

    P.S. I was parsing <div> - based html pages


  • ManishPPPP

    what are you looking to get from the pattern Do you want the id values from <block id="aaa">, etc (that is, "aaa", "bbb", etc) or the data between the start and end <block> tags (that is, aaa data, bbb data, xxx data, etc)

    Imran.

  • J.Douglas

    Your capture groups are greedy. Use non-greedy capturing, replacing * by * for all three capture groups.


  • Iainr

    I want to get:
    <aaa>= aaa data
    <bbb>= bbb data <block > xxx data </block>
    <ccc>= ccc data


  • sbogollu

    is 1 or 0 times; is 0 or 1 times. I suggest you read up on regular expressions. A document I always found very useful is perlre, although there are probably lots of other manuals and tutorials you could read more specific to .NET regexes.

    I didn't notice that your input data contained a nested block. Dealing with those can be quite tricky when using regular expressions... Have a look at this topic to see how to deal with nested elements.



  • parsec

    "( s)\s*" & _
    "( :<block id=""aaa"">\s*( <aaa>.* )\s*</block>) \s*" & _
    "( :<block id=""bbb"">\s*( <bbb>.* )\s*</block>) \s*" & _
    "( :<block id=""ccc"">\s*( <ccc>.* )\s*</block>) \s*"

    This helps but partially. It results in

    <aaa>="aaa data"
    <bbb>="bbb data <block > xxx data"
    <ccc>=""

    Is there any "greedy " which would mean 0 or 1 times, but 1 preferred


  • Parsing optional blocks with RegEx