Never been to DZone Snippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

About this user

Stephen Martindale http://blue-wildebeest.blogspot.com

« Newer Snippets
Older Snippets »
Showing 1-1 of 1 total  RSS 

Space-Separated Tag Parser

Here is a function that accepts a string containing tags and returns an array of extracted tags. (Updated to ignore duplicates)
   1  
   2  /**
   3   * Parses a String of Tags
   4   *
   5   * Tags are space delimited. Either single or double quotes mark a phrase.
   6   * Odd quotes will cause everything on their right to reflect as one single
   7   * tag or phrase. All white-space within a phrase is converted to single
   8   * space characters. Quotes burried within tags are ignored! Duplicate tags
   9   * are ignored, even duplicate phrases that are equivalent.
  10   *
  11   * Returns an array of tags.
  12   */
  13  function ParseTagString($sTagString)
  14  {
  15  	$arTags = array();		// Array of Output
  16  	$cPhraseQuote = null;	// Record of the quote that opened the current phrase
  17  	$sPhrase = null;		// Temp storage for the current phrase we are building
  18  	
  19  	// Define some constants
  20  	static $sTokens = " \r\n\t";	// Space, Return, Newline, Tab
  21  	static $sQuotes = "'\"";		// Single and Double Quotes
  22  	
  23  	// Start the State Machine
  24  	do
  25  	{
  26  		// Get the next token, which may be the first
  27  		$sToken = isset($sToken)? strtok($sTokens) : strtok($sTagString, $sTokens);
  28  		
  29  		// Are there more tokens?
  30  		if ($sToken === false)
  31  		{
  32  			// Ensure that the last phrase is marked as ended
  33  			$cPhraseQuote = null;
  34  		}
  35  		else
  36  		{		
  37  			// Are we within a phrase or not?
  38  			if ($cPhraseQuote !== null)
  39  			{
  40  				// Will the current token end the phrase?
  41  				if (substr($sToken, -1, 1) === $cPhraseQuote)
  42  				{
  43  					// Trim the last character and add to the current phrase, with a single leading space if necessary
  44  					if (strlen($sToken) > 1) $sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . substr($sToken, 0, -1);
  45  					$cPhraseQuote = null;
  46  				}
  47  				else
  48  				{
  49  					// If not, add the token to the phrase, with a single leading space if necessary
  50  					$sPhrase .= ((strlen($sPhrase) > 0)? ' ' : null) . $sToken;
  51  				}
  52  			}
  53  			else
  54  			{
  55  				// Will the current token start a phrase?
  56  				if (strpos($sQuotes, $sToken[0]) !== false)
  57  				{
  58  					// Will the current token end the phrase?
  59  					if ((strlen($sToken) > 1) && ($sToken[0] === substr($sToken, -1, 1)))
  60  					{
  61  						// The current token begins AND ends the phrase, trim the quotes
  62  						$sPhrase = substr($sToken, 1, -1);
  63  					}
  64  					else
  65  					{
  66  						// Remove the leading quote
  67  						$sPhrase = substr($sToken, 1);
  68  						$cPhraseQuote = $sToken[0];
  69  					}
  70  				}
  71  				else
  72  					$sPhrase = $sToken;
  73  			}
  74  		}
  75  		
  76  		// If, at this point, we are not within a phrase, the prepared phrase is complete and can be added to the array
  77  		if (($cPhraseQuote === null) && ($sPhrase != null))
  78  		{
  79  			$sPhrase = strtolower($sPhrase);
  80  			if (!in_array($sPhrase, $arTags)) $arTags[] = $sPhrase;
  81  			$sPhrase = null;
  82  		}
  83  	}
  84  	while ($sToken !== false);	// Stop when we receive FALSE from strtok()
  85  	return $arTags;
  86  }


The string can be recreated from the array with the use of this reverse function:
   1  
   2  /**
   3   * Reverses ParseTagString()
   4   */
   5  function CreateTagString($arTags)
   6  {
   7  	// Prepare each tag to be imploded
   8  	for ($i = 0; $i < sizeof($arTags); $i++)
   9  	{
  10  		// Record findings
  11  		$bContainsWhitespace = false;	// Was whitespace found?
  12  		$cRequiredQuote = '"';			// Use double-quote by default
  13  		$cLastChar = null;
  14  	
  15  		// Search the tag
  16  		for ($j = 0; $j < strlen($arTags[$i]); $j++)
  17  		{
  18  			$c = $arTags[$i][$j];
  19  			
  20  			// If the current character is a space
  21  			if ($c === ' ')
  22  			{
  23  				$bContainsWhitespace = true;
  24  				
  25  				// If the previous char was a double quote, we require single quotes round our phrase
  26  				if ($cLastChar === '"')
  27  				{
  28  					$cRequiredQuote = "'";
  29  					break;	// There is no more point in continuing our search, we cant handle double-mixed quotes
  30  				}
  31  			}
  32  			
  33  			// Record this char as the last char
  34  			$cLastChar = $c;
  35  		}
  36  		
  37  		// Quote if necessary
  38  		if ($bContainsWhitespace) $arTags[$i] = $cRequiredQuote . $arTags[$i] . $cRequiredQuote;
  39  	}
  40  	return implode(' ', $arTags);
  41  }


To test the whole system, use the following array of test cases:
   1  
   2  $arTestInputs = array(
   3  	"this test ensures that words are correctly split",
   4  	"in this test \"phrases\" and \"multi-word phrases\" are tested",
   5  	"this test shows the behaviour if an \"odd quote is detected",
   6  	"this test shows that 'different quotes' work too",
   7  	"but mixed quotes fail: \"test phrase' does not stop on the quote",
   8  	"which can be usefull in some cases where \"the systems' requirements\" state that it is necessary",
   9  	"quotes need not be attached to \" their phrase \"",
  10  	"embedded\"quotes are ignored!",
  11  	"this is also usefull and demonstrates the system's coolness",
  12  	"redundant   white-space is   removed from \"  tags    and phrases\"",
  13  	"\"\"double quotes\"\" will result in single quotes!",
  14  	"remember that 'double-quotes\" may be nested within single quotes'",
  15  	"TaGs ArE NOT case SENsITiVE!",
  16  	"a duplicate tag will be removed from the tag list",
  17  	"even a \" complex phrase\" that is equivalent to another 'compleX   PHrASe   '"
  18  );
  19  
  20  foreach ($arTestInputs as $sTest)
  21  {
  22  	print ("<pre>$sTest</pre>");
  23  	print "<pre>";
  24  	print_r (ParseTagString($sTest));
  25  	print "</pre>";
  26  	print "<pre>";
  27  	print CreateTagString(ParseTagString($sTest));
  28  	print "</pre>";
  29  	print "<hr />";
  30  }


2006-03-09 0.1.0 - 0.2.0 Duplicate phrases are now ignored.

--
Version 0.2.0 - 2006-03-09
STEM: The STEM Cells of PHP
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License
http://creativecommons.org/licenses/by-sa/2.5/
« Newer Snippets
Older Snippets »
Showing 1-1 of 1 total  RSS