Space-Separated Tag Parser
1 2 /** 3 * Parses a String of Tags 4 * 5 * Tags are space delimited. Either single or double quotes mark a phrase. 6 * Odd quotes will cause everything on their right to reflect as one single 7 * tag or phrase. All white-space within a phrase is converted to single 8 * space characters. Quotes burried within tags are ignored! Duplicate tags 9 * are ignored, even duplicate phrases that are equivalent. 10 * 11 * Returns an array of tags. 12 */ 13 function ParseTagString($sTagString) 14 { 15 $arTags = array(); // Array of Output 16 $cPhraseQuote = null; // Record of the quote that opened the current phrase 17 $sPhrase = null; // Temp storage for the current phrase we are building 18 19 // Define some constants 20 static $sTokens = " \r\n\t"; // Space, Return, Newline, Tab 21 static $sQuotes = "'\""; // Single and Double Quotes 22 23 // Start the State Machine 24 do 25 { 26 // Get the next token, which may be the first 27 $sToken = isset($sToken)? strtok($sTokens) : strtok($sTagString, $sTokens); 28 29 // Are there more tokens? 30 if ($sToken === false) 31 { 32 // Ensure that the last phrase is marked as ended 33 $cPhraseQuote = null; 34 } 35 else 36 { 37 // Are we within a phrase or not? 38 if ($cPhraseQuote !== null) 39 { 40 // Will the current token end the phrase? 41 if (substr($sToken, -1, 1) === $cPhraseQuote) 42 { 43 // Trim the last character and add to the current phrase, with a single leading space if necessary 44 if (strlen($sToken) > 1) $sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . substr($sToken, 0, -1); 45 $cPhraseQuote = null; 46 } 47 else 48 { 49 // If not, add the token to the phrase, with a single leading space if necessary 50 $sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . $sToken; 51 } 52 } 53 else 54 { 55 // Will the current token start a phrase? 56 if (strpos($sQuotes, $sToken[0]) !== false) 57 { 58 // Will the current token end the phrase? 59 if ((strlen($sToken) > 1) && ($sToken[0] === substr($sToken, -1, 1))) 60 { 61 // The current token begins AND ends the phrase, trim the quotes 62 $sPhrase = substr($sToken, 1, -1); 63 } 64 else 65 { 66 // Remove the leading quote 67 $sPhrase = substr($sToken, 1); 68 $cPhraseQuote = $sToken[0]; 69 } 70 } 71 else 72 $sPhrase = $sToken; 73 } 74 } 75 76 // If, at this point, we are not within a phrase, the prepared phrase is complete and can be added to the array 77 if (($cPhraseQuote === null) && ($sPhrase != null)) 78 { 79 $sPhrase = strtolower($sPhrase); 80 if (!in_array($sPhrase, $arTags)) $arTags[] = $sPhrase; 81 $sPhrase = null; 82 } 83 } 84 while ($sToken !== false); // Stop when we receive FALSE from strtok() 85 return $arTags; 86 }
The string can be recreated from the array with the use of this reverse function:
1 2 /** 3 * Reverses ParseTagString() 4 */ 5 function CreateTagString($arTags) 6 { 7 // Prepare each tag to be imploded 8 for ($i = 0; $i < sizeof($arTags); $i++) 9 { 10 // Record findings 11 $bContainsWhitespace = false; // Was whitespace found? 12 $cRequiredQuote = '"'; // Use double-quote by default 13 $cLastChar = null; 14 15 // Search the tag 16 for ($j = 0; $j < strlen($arTags[$i]); $j++) 17 { 18 $c = $arTags[$i][$j]; 19 20 // If the current character is a space 21 if ($c === ' ') 22 { 23 $bContainsWhitespace = true; 24 25 // If the previous char was a double quote, we require single quotes round our phrase 26 if ($cLastChar === '"') 27 { 28 $cRequiredQuote = "'"; 29 break; // There is no more point in continuing our search, we cant handle double-mixed quotes 30 } 31 } 32 33 // Record this char as the last char 34 $cLastChar = $c; 35 } 36 37 // Quote if necessary 38 if ($bContainsWhitespace) $arTags[$i] = $cRequiredQuote . $arTags[$i] . $cRequiredQuote; 39 } 40 return implode(' ', $arTags); 41 }
To test the whole system, use the following array of test cases:
1 2 $arTestInputs = array( 3 "this test ensures that words are correctly split", 4 "in this test \"phrases\" and \"multi-word phrases\" are tested", 5 "this test shows the behaviour if an \"odd quote is detected", 6 "this test shows that 'different quotes' work too", 7 "but mixed quotes fail: \"test phrase' does not stop on the quote", 8 "which can be usefull in some cases where \"the systems' requirements\" state that it is necessary", 9 "quotes need not be attached to \" their phrase \"", 10 "embedded\"quotes are ignored!", 11 "this is also usefull and demonstrates the system's coolness", 12 "redundant white-space is removed from \" tags and phrases\"", 13 "\"\"double quotes\"\" will result in single quotes!", 14 "remember that 'double-quotes\" may be nested within single quotes'", 15 "TaGs ArE NOT case SENsITiVE!", 16 "a duplicate tag will be removed from the tag list", 17 "even a \" complex phrase\" that is equivalent to another 'compleX PHrASe '" 18 ); 19 20 foreach ($arTestInputs as $sTest) 21 { 22 print ("<pre>$sTest</pre>"); 23 print "<pre>"; 24 print_r (ParseTagString($sTest)); 25 print "</pre>"; 26 print "<pre>"; 27 print CreateTagString(ParseTagString($sTest)); 28 print "</pre>"; 29 print "<hr />"; 30 }
2006-03-09 0.1.0 - 0.2.0 Duplicate phrases are now ignored.
--
Version 0.2.0 - 2006-03-09
STEM: The STEM Cells of PHP
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License
http://creativecommons.org/licenses/by-sa/2.5/