<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DZone Snippets: unicode code</title>
    <link>http://snippets.dzone.com/posts</link>
    <pubDate>Sun, 27 Jul 2008 00:24:13 GMT</pubDate>
    <description>DZone Snippets: unicode code</description>
    <item>
      <title>Strip accents</title>
      <link>http://snippets.dzone.com/posts/show/5499</link>
      <description>// Strip accents from a string. For example, "Sigur R&#243;s" =&gt; "Sigur Ros".&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;def strip_accents(string):&lt;br /&gt;  import unicodedata&lt;br /&gt;  return unicodedata.normalize('NFKD', unicode(string)).encode('ASCII', 'ignore')&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Fri, 16 May 2008 00:19:40 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/5499</guid>
      <author>pyninja (Adeel Khan)</author>
    </item>
    <item>
      <title>Initial caps with ruby on rails</title>
      <link>http://snippets.dzone.com/posts/show/4972</link>
      <description>&lt;code&gt;&lt;br /&gt;class String&lt;br /&gt;  # unicode_str.initial_caps =&gt; new_str&lt;br /&gt;  # returns a copy of a string with initial capitals&lt;br /&gt;  # "Jules-&#201;douard".initial_caps =&gt; "J.&#201;."&lt;br /&gt;  def initial_caps&lt;br /&gt;    self.tr('-', ' ').split(' ').map { |word| word.chars.first.upcase.to_s + "." }.join&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Thu, 10 Jan 2008 12:41:22 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4972</guid>
      <author>jerome ()</author>
    </item>
    <item>
      <title>Some problems with charset in UTF-8 ?</title>
      <link>http://snippets.dzone.com/posts/show/4814</link>
      <description>So you can use this request MySQL before all others, for fix your problems :&lt;br /&gt;&lt;code&gt;&lt;br /&gt;...&lt;br /&gt;mysql_query( "SET NAMES 'utf8' " );&lt;br /&gt;...&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.ab-d.fr/"&gt;Source: ab-d.fr&lt;br /&gt;Languages: PHP and MySQL&lt;/a&gt;</description>
      <pubDate>Fri, 23 Nov 2007 22:07:58 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4814</guid>
      <author>ki4ngel (Benoit Asselin)</author>
    </item>
    <item>
      <title>Unicode words from online dictionary</title>
      <link>http://snippets.dzone.com/posts/show/4708</link>
      <description>// list words from unicode dictionary &lt;br /&gt;// you need to add this line in the head section&lt;br /&gt;//   &lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8" /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;for ( $i = 1; $i &lt;= 45; $i++) {&lt;br /&gt;$url="http://dsal.uchicago.edu/cgi-bin/romadict.pl?page=$i&amp;table=molesworth&amp;display=utf8";&lt;br /&gt;$text=file_get_contents($url);&lt;br /&gt;$myarray = preg_match_all('#&lt;font size="\+1"&gt;(.*?)&lt;/font&gt;#i', $text, $matches);&lt;br /&gt;echo implode(' ',$matches[1]); &lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Sun, 28 Oct 2007 08:55:52 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4708</guid>
      <author>shantanuo (shantanu oak)</author>
    </item>
    <item>
      <title>Punycoded URLs in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/4575</link>
      <description>This is just a proof-of-concept snippet for how to internationalize domain names using &lt;a href="http://raa.ruby-lang.org/project/punycode4r/"&gt;punycode4r&lt;/a&gt; (sudo gem install punycode4r).&lt;br /&gt;&lt;br /&gt;For more information please see:&lt;br /&gt;- &lt;a href="http://en.wikipedia.org/wiki/Punycode"&gt;Punycode&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://en.wikipedia.org/wiki/Internationalizing_Domain_Names_in_Applications"&gt;Internationalized domain name&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;#!/usr/local/bin/ruby -Ku&lt;br /&gt;&lt;br /&gt;# NOTE: The following is not the complete source code by Kazuhiro NISHIYAMA.&lt;br /&gt;#       For the full source code with more features, comments &amp; test cases please see: &lt;br /&gt;#       open -e `gem environment gemdir`/gems/punycode4r-0.2.0/lib/punycode.rb&lt;br /&gt;#&lt;br /&gt;# This is pure Ruby implementing Punycode (RFC 3492).&lt;br /&gt;# (original ANSI C code (C89) implementing Punycode is in RFC 3492)&lt;br /&gt;#&lt;br /&gt;# copyright (c) 2005 Kazuhiro NISHIYAMA&lt;br /&gt;# You can redistribute it and/or modify it under the same terms as Ruby.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;require "unicode"     # sudo gem install unicode&lt;br /&gt;&lt;br /&gt;module Punycode&lt;br /&gt;&lt;br /&gt;  module Status&lt;br /&gt;    class Error &lt; StandardError; end&lt;br /&gt;    class PunycodeSuccess; end&lt;br /&gt;    # Input is invalid.&lt;br /&gt;    class PunycodeBadInput &lt; Error; end&lt;br /&gt;    # Output would exceed the space provided.&lt;br /&gt;    class PunycodeBigOutput&lt; Error; end&lt;br /&gt;    # Input needs wider integers to process.&lt;br /&gt;    class PunycodeOverflow &lt; Error; end&lt;br /&gt;  end&lt;br /&gt;  include Status&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  BASE = 36; TMIN = 1; TMAX = 26; SKEW = 38; DAMP = 700&lt;br /&gt;  INITIAL_BIAS = 72; INITIAL_N = 0x80; DELIMITER = 0x2D&lt;br /&gt;&lt;br /&gt;  module_function&lt;br /&gt;&lt;br /&gt;  def basic(cp)&lt;br /&gt;    cp &lt; 0x80&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def delim(cp)&lt;br /&gt;    cp == DELIMITER&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def decode_digit(cp)&lt;br /&gt;    cp - 48 &lt; 10 ? cp - 22 :  cp - 65 &lt; 26 ? cp - 65 :&lt;br /&gt;      cp - 97 &lt; 26 ? cp - 97 : BASE&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def encode_digit(d, flag)&lt;br /&gt;    return d + 22 + 75 * ((d &lt; 26) ? 1 : 0) - ((flag ? 1 : 0) &lt;&lt; 5)&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def flagged(bcp)&lt;br /&gt;    (0...26) === (bcp - 65)&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def encode_basic(bcp, flag)&lt;br /&gt;    # bcp -= (bcp - 97 &lt; 26) &lt;&lt; 5;&lt;br /&gt;    if (0...26) === (bcp - 97)&lt;br /&gt;      bcp -= 1 &lt;&lt; 5&lt;br /&gt;    end&lt;br /&gt;    # return bcp + ((!flag &amp;&amp; (bcp - 65 &lt; 26)) &lt;&lt; 5);&lt;br /&gt;    if !flag and (0...26) === (bcp - 65)&lt;br /&gt;      bcp += 1 &lt;&lt; 5&lt;br /&gt;    end&lt;br /&gt;    bcp&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  MAXINT = 1 &lt;&lt; 64&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;  def adapt(delta, numpoints, firsttime)&lt;br /&gt;    delta = firsttime ? delta / DAMP : delta &gt;&gt; 1&lt;br /&gt;    delta += delta / numpoints&lt;br /&gt;&lt;br /&gt;    k = 0&lt;br /&gt;    while delta &gt; ((BASE - TMIN) * TMAX) / 2&lt;br /&gt;      delta /= BASE - TMIN&lt;br /&gt;      k += BASE&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    k + (BASE - TMIN + 1) * delta / (delta + SKEW)&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def punycode_encode(input_length, input, case_flags, output_length, output)&lt;br /&gt;&lt;br /&gt;    n = INITIAL_N&lt;br /&gt;    delta = out = 0&lt;br /&gt;    max_out = output_length[0]&lt;br /&gt;    bias = INITIAL_BIAS&lt;br /&gt;&lt;br /&gt;    input_length.times do |j|&lt;br /&gt;      if basic(input[j])&lt;br /&gt;        raise PunycodeBigOutput if max_out - out &lt; 2&lt;br /&gt;        output[out] =&lt;br /&gt;          if case_flags&lt;br /&gt;            encode_basic(input[j], case_flags[j])&lt;br /&gt;          else&lt;br /&gt;            input[j]&lt;br /&gt;          end&lt;br /&gt;        out+=1&lt;br /&gt;      # elsif (input[j] &lt; n)&lt;br /&gt;      #   raise PunycodeBadInput&lt;br /&gt;      # (not needed for Punycode with unsigned code points)&lt;br /&gt;      end&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    h = b = out&lt;br /&gt;&lt;br /&gt;    if b &gt; 0&lt;br /&gt;      output[out] = DELIMITER&lt;br /&gt;      out+=1&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;   while h &lt; input_length&lt;br /&gt;&lt;br /&gt;      m = MAXINT&lt;br /&gt;      input_length.times do |j|&lt;br /&gt;        # next if basic(input[j])&lt;br /&gt;        # (not needed for Punycode)&lt;br /&gt;        m = input[j] if (n...m) === input[j]&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      raise PunycodeOverflow if m - n &gt; (MAXINT - delta) / (h + 1)&lt;br /&gt;      delta += (m - n) * (h + 1)&lt;br /&gt;      n = m&lt;br /&gt;&lt;br /&gt;      input_length.times do |j|&lt;br /&gt;        # Punycode does not need to check whether input[j] is basic:&lt;br /&gt;        if input[j] &lt; n # || basic(input[j])&lt;br /&gt;          delta+=1&lt;br /&gt;          raise PunycodeOverflow if delta == 0&lt;br /&gt;        end&lt;br /&gt;&lt;br /&gt;        if input[j] == n&lt;br /&gt;&lt;br /&gt;          q = delta; k = BASE&lt;br /&gt;          while true&lt;br /&gt;            raise PunycodeBigOutput if out &gt;= max_out&lt;br /&gt;            t = if k &lt;= bias # + TMIN # +TMIN not needed&lt;br /&gt;                  TMIN&lt;br /&gt;                elsif k &gt;= bias + TMAX&lt;br /&gt;                  TMAX&lt;br /&gt;                else&lt;br /&gt;                  k - bias&lt;br /&gt;                end&lt;br /&gt;            break if q &lt; t&lt;br /&gt;            output[out] = encode_digit(t + (q - t) % (BASE - t), false)&lt;br /&gt;            out+=1&lt;br /&gt;            q = (q - t) / (BASE - t)&lt;br /&gt;            k += BASE&lt;br /&gt;          end&lt;br /&gt;&lt;br /&gt;          output[out] = encode_digit(q, case_flags &amp;&amp; case_flags[j])&lt;br /&gt;          out+=1&lt;br /&gt;          bias = adapt(delta, h + 1, h == b)&lt;br /&gt;          delta = 0&lt;br /&gt;          h+=1&lt;br /&gt;        end&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      delta+=1; n+=1&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    output_length[0] = out&lt;br /&gt;    return PunycodeSuccess&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def punycode_decode(input_length, input, output_length, output, case_flags)&lt;br /&gt;&lt;br /&gt;    n = INITIAL_N&lt;br /&gt;&lt;br /&gt;    out = i = 0&lt;br /&gt;    max_out = output_length[0]&lt;br /&gt;    bias = INITIAL_BIAS&lt;br /&gt;&lt;br /&gt;    b = 0&lt;br /&gt;    input_length.times do |j|&lt;br /&gt;      b = j if delim(input[j])&lt;br /&gt;    end&lt;br /&gt;    raise PunycodeBigOutput if b &gt; max_out&lt;br /&gt;&lt;br /&gt;    b.times do |j|&lt;br /&gt;      case_flags[out] = flagged(input[j]) if case_flags&lt;br /&gt;      raise PunycodeBadInput unless basic(input[j])&lt;br /&gt;      output[out] = input[j]&lt;br /&gt;      out+=1&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    in_ = b &gt; 0 ? b + 1 : 0&lt;br /&gt;    while in_ &lt; input_length&lt;br /&gt;&lt;br /&gt;      oldi = i; w = 1; k = BASE&lt;br /&gt;      while true&lt;br /&gt;        raise PunycodeBadInput if in_ &gt;= input_length&lt;br /&gt;        digit = decode_digit(input[in_])&lt;br /&gt;        in_+=1&lt;br /&gt;        raise PunycodeBadInput if digit &gt;= BASE&lt;br /&gt;        raise PunycodeOverflow if digit &gt; (MAXINT - i) / w&lt;br /&gt;        i += digit * w&lt;br /&gt;        t = if k &lt;= bias # + TMIN # +TMIN not needed&lt;br /&gt;              TMIN&lt;br /&gt;            elsif k &gt;= bias + TMAX&lt;br /&gt;              TMAX&lt;br /&gt;            else&lt;br /&gt;              k - bias&lt;br /&gt;            end&lt;br /&gt;        break if digit &lt; t&lt;br /&gt;        raise PunycodeOverflow if w &gt; MAXINT / (BASE - t)&lt;br /&gt;        w *= BASE - t&lt;br /&gt;        k += BASE&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      bias = adapt(i - oldi, out + 1, oldi == 0)&lt;br /&gt;&lt;br /&gt;      raise PunycodeOverflow if i / (out + 1) &gt; MAXINT - n&lt;br /&gt;      n += i / (out + 1)&lt;br /&gt;      i %= out + 1&lt;br /&gt;&lt;br /&gt;      # not needed for Punycode:&lt;br /&gt;      # raise PUNYCODE_INVALID_INPUT if decode_digit(n) &lt;= base&lt;br /&gt;      raise PunycodeBigOutput if out &gt;= max_out&lt;br /&gt;&lt;br /&gt;      if case_flags&lt;br /&gt;        #memmove(case_flags + i + 1, case_flags + i, out - i)&lt;br /&gt;        case_flags[i + 1, out - i] = case_flags[i, out - i]&lt;br /&gt;&lt;br /&gt;        # Case of last character determines uppercase flag:&lt;br /&gt;        case_flags[i] = flagged(input[in_ - 1])&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      #memmove(output + i + 1, output + i, (out - i) * sizeof *output)&lt;br /&gt;      output[i + 1, out - i] = output[i, out - i]&lt;br /&gt;      output[i] = n&lt;br /&gt;      i+=1&lt;br /&gt;&lt;br /&gt;      out+=1&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    output_length[0] = out&lt;br /&gt;    return PunycodeSuccess&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def encode(unicode_string, case_flags=nil, print_ascii_only=false)&lt;br /&gt;    input = unicode_string.unpack('U*')&lt;br /&gt;    output = [0] * (ACE_MAX_LENGTH+1)&lt;br /&gt;    output_length = [ACE_MAX_LENGTH]&lt;br /&gt;&lt;br /&gt;    punycode_encode(input.size, input, case_flags, output_length, output)&lt;br /&gt;&lt;br /&gt;    outlen = output_length[0]&lt;br /&gt;    outlen.times do |j|&lt;br /&gt;      c = output[j]&lt;br /&gt;      unless c &gt;= 0 &amp;&amp; c &lt;= 127&lt;br /&gt;        raise Error, "assertion error: invalid output char"&lt;br /&gt;      end&lt;br /&gt;      unless PRINT_ASCII[c]&lt;br /&gt;        raise PunycodeBadInput&lt;br /&gt;      end&lt;br /&gt;      output[j] = PRINT_ASCII[c] if print_ascii_only&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    output[0..outlen].map{|x|x.chr}.join('').sub(/\0+\z/, '')&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  def decode(punycode, case_flags=[])&lt;br /&gt;    input = []&lt;br /&gt;    output = []&lt;br /&gt;&lt;br /&gt;    if ACE_MAX_LENGTH*2 &lt; punycode.size&lt;br /&gt;      raise PunycodeBigOutput&lt;br /&gt;    end&lt;br /&gt;    punycode.each_byte do |c|&lt;br /&gt;      unless c &gt;= 0 &amp;&amp; c &lt;= 127&lt;br /&gt;        raise PunycodeBadInput&lt;br /&gt;      end&lt;br /&gt;      input.push(c)&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    output_length = [UNICODE_MAX_LENGTH]&lt;br /&gt;    Punycode.punycode_decode(input.length, input, output_length,&lt;br /&gt;                             output, case_flags)&lt;br /&gt;    output.pack('U*')&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;  UNICODE_MAX_LENGTH = 256&lt;br /&gt;  ACE_MAX_LENGTH = 256&lt;br /&gt;&lt;br /&gt;  # The following string is used to convert printable&lt;br /&gt;  # characters between ASCII and the native charset:&lt;br /&gt;&lt;br /&gt;  PRINT_ASCII =&lt;br /&gt;    "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" \&lt;br /&gt;    "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" \&lt;br /&gt;    " !\"\#$%&amp;'()*+,-./" \&lt;br /&gt;    "0123456789:;&lt;=&gt;?" \&lt;br /&gt;    "@ABCDEFGHIJKLMNO" \&lt;br /&gt;    "PQRSTUVWXYZ[\\]^_" \&lt;br /&gt;    "`abcdefghijklmno" \&lt;br /&gt;    "pqrstuvwxyz{|}~\n"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# cf. http://snippets.dzone.com/posts/show/4527&lt;br /&gt;&lt;br /&gt;UTF8REGEX = /\A(?:                                                            &lt;br /&gt;              [\x09\x0A\x0D\x20-\x7E]            # ASCII&lt;br /&gt;            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte&lt;br /&gt;            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs&lt;br /&gt;            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte&lt;br /&gt;            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates&lt;br /&gt;            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3&lt;br /&gt;            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15&lt;br /&gt;            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16&lt;br /&gt;            )*\z/mnx&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;UTF8_REGEX_MBYTE = /(?:                                 &lt;br /&gt;                 [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte&lt;br /&gt;               |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs&lt;br /&gt;               | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte&lt;br /&gt;               |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates&lt;br /&gt;               |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3&lt;br /&gt;               | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15&lt;br /&gt;               |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16&lt;br /&gt;               )/mnx&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# cf. http://demo.icu-project.org/icu-bin/idnbrowser (samples)&lt;br /&gt;# on Mac OS X you can check the Ruby conversions with the GUI app PunyCode, http://software.dibomedia.de/products/show/2&lt;br /&gt;&lt;br /&gt;str = "http://www.&#65201;&#65202;&#65207;.com/"&lt;br /&gt;str = "www.&#1089;&#1076;&#1077;&#1083;&#1072;&#1090; &#1082;&#1072;&#1088;&#1090;&#1080;&#1085;&#1082;&#1080;.com"&lt;br /&gt;str = "http://www.&#1089;&#1076;&#1077;&#1083;&#1072;&#1090;&#1082;&#1072;&#1088;&#1090;&#1080;&#1085;&#1082;&#1080;.com/"&lt;br /&gt;str = "http://t&#363;dali&#326;.lv/"&lt;br /&gt;str = "http://www.z&#252;rich.com/"&lt;br /&gt;str = "http://www.h&#246;ren.at/"&lt;br /&gt;str = "http://www.&#382;lut&#253; k&#367;&#328;.com/"&lt;br /&gt;str = "www.f&#228;rgbolaget.nu"&lt;br /&gt;str = "www.br&#230;ndendek&#230;rlighed.com"&lt;br /&gt;str = "www.m&#228;kitorppa.com"&lt;br /&gt;str = "www.f&#228;rjestadsbk.net"&lt;br /&gt;str = "&#12354;&#12540;&#12427;&#12356;&#12435;.com"&lt;br /&gt;str = "www.&#50696;&#48708;&#44368;&#49324;.com"&lt;br /&gt;str = "www.&#12495;&#12531;&#12489;&#12508;&#12540;&#12523;&#12469;&#12512;&#12474;.com"&lt;br /&gt;str = "www.&#26085;&#26412;&#24179;.jp"&lt;br /&gt;str = "www.r&#228;ksm&#246;rg&#229;s.se"&lt;br /&gt;str = "www.r&#243;&#380;yczka.pl/"&lt;br /&gt;str = "&#29702;&#23481;&#12490;&#12459;&#12512;&#12521;.com"&lt;br /&gt;str = "http://B&#252;cher.ch/"&lt;br /&gt;str = "t&#363;dali&#326;.lv"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;if str =~ UTF8REGEX &amp;&amp; str =~ UTF8_REGEX_MBYTE&lt;br /&gt;&lt;br /&gt;   s1 = str.gsub(/^(http:\/\/www\.|http:\/\/|).*?\.[^\.\/]+\/?$/n, '\1')&lt;br /&gt;   s2 = str.gsub(/^(?:http:\/\/www\.|http:\/\/|)(www\.|).*?\.[^\.\/]+\/?$/n, '\1')&lt;br /&gt;   s3 = str.gsub(/^(?:http:\/\/www\.|http:\/\/|www\.|)(.*?)\.[^\.\/]+\/?$/n, '\1')&lt;br /&gt;   s4 = str.gsub(/^(?:http:\/\/www\.|http:\/\/|www\.|).*?(\.[^\.\/]+\/?)$/n, '\1')&lt;br /&gt;&lt;br /&gt;   if s1.empty? then s1 = 'http://' end&lt;br /&gt;&lt;br /&gt;   s3 = Punycode.encode(Unicode::normalize_KC(Unicode::downcase(s3)))&lt;br /&gt;&lt;br /&gt;   punycoded_url = s1 &lt;&lt; s2 &lt;&lt; "xn--" &lt;&lt; s3 &lt;&lt; s4&lt;br /&gt;&lt;br /&gt;   puts punycoded_url&lt;br /&gt;&lt;br /&gt;   %x{ /usr/bin/open "#{punycoded_url}" }&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;</description>
      <pubDate>Wed, 26 Sep 2007 21:00:18 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4575</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>Convert Unicode codepoints to UTF-8 characters with Module#const_missing</title>
      <link>http://snippets.dzone.com/posts/show/4546</link>
      <description>From: http://www.davidflanagan.com/blog/2007_08.html#000136&lt;br /&gt;Author: David Flanagan&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;# This module lazily defines constants of the form Uxxxx for all Unicode&lt;br /&gt;# codepoints from U0000 to U10FFFF. The value of each constant is the&lt;br /&gt;# UTF-8 string for the codepoint.&lt;br /&gt;# Examples:&lt;br /&gt;#   copyright = Unicode::U00A9&lt;br /&gt;#   euro = Unicode::U20AC&lt;br /&gt;#   infinity = Unicode::U221E&lt;br /&gt;#&lt;br /&gt;module Unicode&lt;br /&gt;  def self.const_missing(name)  &lt;br /&gt;    # Check that the constant name is of the right form: U0000 to U10FFFF&lt;br /&gt;    if name.to_s =~ /^U([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/&lt;br /&gt;      # Convert the codepoint to an immutable UTF-8 string,&lt;br /&gt;      # define a real constant for that value and return the value&lt;br /&gt;      #p name, name.class&lt;br /&gt;      const_set(name, [$1.to_i(16)].pack("U").freeze)&lt;br /&gt;    else  # Raise an error for constants that are not Unicode.&lt;br /&gt;      raise NameError, "Uninitialized constant: Unicode::#{name}"&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts copyright = Unicode::U00A9&lt;br /&gt;puts euro = Unicode::U20AC&lt;br /&gt;puts euro = Unicode::U20AC&lt;br /&gt;puts infinity = Unicode::U221E&lt;br /&gt;puts Unicode.const_get(:U221E)&lt;br /&gt;p Unicode.constants&lt;br /&gt;puts Unicode.constants&lt;br /&gt;Unicode.constants.each { |u| puts Unicode.const_get(u) }&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;</description>
      <pubDate>Sat, 15 Sep 2007 12:25:16 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4546</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>UTF8-aware string methods in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/4527</link>
      <description>Author:  ntk&lt;br /&gt;License:    &lt;a href="http://www.opensource.org/licenses/mit-license.php"&gt;The MIT License&lt;/a&gt;, Copyright (c) 2007 ntk&lt;br /&gt;Description:  some basic UTF8-aware string methods for Ruby's String class (Ruby 1.8.6)&lt;br /&gt;Requirements: save this snippet to an UTF-8 encoded file and set the character set encoding of Terminal.app &lt;br /&gt;              to UTF-8 (on Mac OS X: Terminal menu -&gt; Window Settings -&gt; Display -&gt; Character Set Encoding; to enable additional features see &lt;a href="http://smyck.de/2007/06/06/great-stuff-being-able-to-type-utf-8-characters-in-a-terminal-on-os-x/"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Further tools:&lt;br /&gt;- &lt;a href="http://www.yoshidam.net/Ruby.html"&gt;rbuconv&lt;/a&gt;, a pure Ruby library for Unicode translation&lt;br /&gt;- &lt;a href="http://www.yoshidam.net/unicode.txt"&gt;unicode&lt;/a&gt;, a library for Unicode Normalization (sudo gem install unicode); for a Windows version see &lt;a href="http://www.ruby.org.ee/wiki/Unicode_in_Ruby/Rails"&gt;Unicode in Ruby on Rails&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://icu4r.rubyforge.org"&gt;ICU4R&lt;/a&gt;, a Ruby C-extension binding for the &lt;a href="http://www.icu-project.org"&gt;ICU&lt;/a&gt; library&lt;br /&gt;- &lt;a href="http://billposer.org/Software/msort.html"&gt;Msort&lt;/a&gt;, a command-line sorting program&lt;br /&gt;- &lt;a href="http://raa.ruby-lang.org/project/punycode4r/"&gt;punycode4r&lt;/a&gt;, a pure Ruby implementation of Punycode (RFC 3492; sudo gem install punycode4r)&lt;br /&gt;- &lt;a href="http://www.flexiguided.de/publications.utf8proc.en.html"&gt;utf8proc&lt;/a&gt;, library for processing UTF-8 encoded Unicode strings, (sudo gem install utf8proc)&lt;br /&gt;- &lt;a href="http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt"&gt;Oniguruma&lt;/a&gt;, Ruby's regular expression engine; cf. &lt;a href="http://www.igvita.com/blog/2007/04/11/secure-utf-8-input-in-rails/"&gt;Secure UTF-8 Input in Rails&lt;/a&gt; and &lt;a href="http://woss.name/2006/10/25/migrating-your-rails-application-to-unicode/"&gt;Migrating your Rails application to Unicode&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://rubyforge.org/projects/char-encodings/"&gt;character-encodings&lt;/a&gt;, seamless integration of character encodings into Ruby's String class, (sudo gem install character-encodings)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;class String&lt;br /&gt;&lt;br /&gt;   require 'iconv' &lt;br /&gt;   require 'open-uri'      # cf. http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html&lt;br /&gt;&lt;br /&gt;   # taken from: http://www.w3.org/International/questions/qa-forms-utf-8&lt;br /&gt;   UTF8REGEX = /\A(?:                               # ?: non-capturing group (grouping with no back references)&lt;br /&gt;                 [\x09\x0A\x0D\x20-\x7E]            # ASCII&lt;br /&gt;               | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte&lt;br /&gt;               |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs&lt;br /&gt;               | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte&lt;br /&gt;               |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates&lt;br /&gt;               |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3&lt;br /&gt;               | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15&lt;br /&gt;               |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16&lt;br /&gt;               )*\z/mnx&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;#  create UTF-8 character arrays (as class instance variables)&lt;br /&gt;#&lt;br /&gt;#  mapping tables: - http://www.unicode.org/Public/UCA/latest/allkeys.txt&lt;br /&gt;#                  - http://unicode.org/Public/UNIDATA/UnicodeData.txt &lt;br /&gt;#                  - http://unicode.org/Public/UNIDATA/CaseFolding.txt&lt;br /&gt;#                  - http://www.decodeunicode.org &lt;br /&gt;#                  - ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2&lt;br /&gt;#                  - http://camomile.sourceforge.net&lt;br /&gt;#                  - Character Palette (Mac OS X)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # test data&lt;br /&gt;   @small_letters_utf8 = ["U+00F1", "U+00F4", "U+00E6", "U+00F8", "U+00E0", "U+00E1", "U+00E2", "U+00E4", "U+00E5", "U+00E7", "U+00E8", "U+00E9", "U+00EA", "U+00EB", "U+0153"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @capital_letters_utf8 = ["U+00D1", "U+00D4", "U+00C6", "U+00D8", "U+00C0", "U+00C1", "U+00C2", "U+00C4", "U+00C5", "U+00C7", "U+00C8", "U+00C9", "U+00CA", "U+00CB", "U+0152"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @other_letters_utf8 = ["U+03A3", "U+0639", "U+0041", "U+F8D0", "U+F8FF", "U+4E2D", "U+F4EE", "U+00FE", "U+10FFFF", "U+00A9", "U+20AC", "U+221E", "U+20AC", "U+FEFF", "U+FFFD", "U+00FF", "U+00FE", "U+FFFE", "U+FEFF"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;   if @small_letters_utf8.size != @small_letters_utf8.nitems then raise "Invalid UTF-8 char in @small_letters_utf8!" end&lt;br /&gt;   if @capital_letters_utf8.size != @capital_letters_utf8.nitems then raise "Invalid UTF-8 char in @capital_letters_utf8!" end&lt;br /&gt;   if @other_letters_utf8.size != @other_letters_utf8.nitems then raise "Invalid UTF-8 char in @other_letters_utf8!" end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @unicode_array = []&lt;br /&gt;   #open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f| f.each(nil) { |line| line.scan(/^[^;]+/) { |u| @unicode_array &lt;&lt; u } }  end&lt;br /&gt;   #open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f|                                                                               &lt;br /&gt;   #   f.each do |line| line =~ /LATIN|GREEK|CYRILLIC/  ?  ( line.scan(/^[^;]+/) { |u| @unicode_array &lt;&lt; u } )  :  next  end&lt;br /&gt;   #end&lt;br /&gt;&lt;br /&gt;   #@letters_utf8 = @unicode_array.map { |x| u = [x.hex].pack("U*"); u =~ UTF8REGEX ? u : nil }.compact   # code points from UnicodeData.txt&lt;br /&gt;   @letters_utf8 = @small_letters_utf8 + @capital_letters_utf8 + @other_letters_utf8                      # test data only&lt;br /&gt;&lt;br /&gt;   # Hash[*array_with_keys.zip(array_with_values).flatten]&lt;br /&gt;   @downcase_table_utf8 = Hash[*@capital_letters_utf8.zip(@small_letters_utf8).flatten]&lt;br /&gt;   @upcase_table_utf8 = Hash[*@small_letters_utf8.zip(@capital_letters_utf8).flatten]&lt;br /&gt;   @letters_utf8_hash = Hash[*@letters_utf8.zip([]).flatten]    #=&gt; ... "\341\272\242"=&gt;nil ...&lt;br /&gt;&lt;br /&gt;   class &lt;&lt; self &lt;br /&gt;      attr_accessor :small_letters_utf8&lt;br /&gt;      attr_accessor :capital_letters_utf8&lt;br /&gt;      attr_accessor :other_letters_utf8&lt;br /&gt;      attr_accessor :letters_utf8&lt;br /&gt;      attr_accessor :letters_utf8_hash&lt;br /&gt;      attr_accessor :unicode_array&lt;br /&gt;      attr_accessor :downcase_table_utf8&lt;br /&gt;      attr_accessor :upcase_table_utf8&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def each_utf8_char&lt;br /&gt;      scan(/./mu) { |c| yield c }&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def each_utf8_char_with_index&lt;br /&gt;      i = -1&lt;br /&gt;      scan(/./mu) { |c| i+=1; yield(c, i) }&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def length_utf8&lt;br /&gt;      #scan(/./mu).size&lt;br /&gt;      count = 0&lt;br /&gt;      scan(/./mu) { count += 1 }&lt;br /&gt;      count&lt;br /&gt;   end&lt;br /&gt;   alias :size_utf8 :length_utf8&lt;br /&gt;&lt;br /&gt;   def reverse_utf8&lt;br /&gt;      split(//mu).reverse.join&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def reverse_utf8!&lt;br /&gt;      split(//mu).reverse!.join&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def swapcase_utf8&lt;br /&gt;     gsub(/./mu) do |char|  &lt;br /&gt;         if !String.downcase_table_utf8[char].nil? then String.downcase_table_utf8[char]&lt;br /&gt;         elsif !String.upcase_table_utf8[char].nil? then String.upcase_table_utf8[char]&lt;br /&gt;         else char.swapcase&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def swapcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         if !String.downcase_table_utf8[char].nil? then String.downcase_table_utf8[char]&lt;br /&gt;         elsif !String.upcase_table_utf8[char].nil? then String.upcase_table_utf8[char]&lt;br /&gt;         else ret = char.swapcase end&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def downcase_utf8&lt;br /&gt;      gsub(/./mu) do |char|  &lt;br /&gt;         small_char = String.downcase_table_utf8[char]&lt;br /&gt;         small_char.nil? ? char.downcase : small_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def downcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         small_char = String.downcase_table_utf8[char]&lt;br /&gt;         small_char.nil? ? char.downcase : small_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def upcase_utf8&lt;br /&gt;      gsub(/./mu) do |char|  &lt;br /&gt;         capital_char = String.upcase_table_utf8[char]&lt;br /&gt;         capital_char.nil? ? char.upcase : capital_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def upcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         capital_char = String.upcase_table_utf8[char]&lt;br /&gt;         capital_char.nil? ? char.upcase : capital_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def count_utf8(c)&lt;br /&gt;      return nil if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      scan(r).size&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def delete_utf8(c)&lt;br /&gt;      return self if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      gsub(r, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def delete_utf8!(c)&lt;br /&gt;      return self if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      gsub!(r, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def first_utf8&lt;br /&gt;      self[/\A./mu]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def last_utf8&lt;br /&gt;      self[/.\z/mu]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def capitalize_utf8&lt;br /&gt;     return self if self =~ /\A[[:space:]]*\z/m&lt;br /&gt;     ret = ""&lt;br /&gt;     split(/\x20/).each do |w| &lt;br /&gt;         count = 0&lt;br /&gt;         w.gsub(/./mu) do |char|  &lt;br /&gt;            count += 1&lt;br /&gt;            capital_char = String.upcase_table_utf8[char]&lt;br /&gt;            if count == 1 then &lt;br /&gt;               capital_char.nil? ? char.upcase : char.upcase_utf8&lt;br /&gt;            else&lt;br /&gt;               capital_char.nil? ? char.downcase : char.downcase_utf8&lt;br /&gt;            end&lt;br /&gt;         end&lt;br /&gt;         ret &lt;&lt; w + ' '&lt;br /&gt;     end&lt;br /&gt;     ret =~ /\x20\z/ ? ret.sub!(/\x20\z/, '') : ret  &lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def capitalize_utf8!&lt;br /&gt;     return self if self =~ /\A[[:space:]]*\z/m &lt;br /&gt;     ret = ""&lt;br /&gt;     split(/\x20/).each do |w| &lt;br /&gt;         count = 0&lt;br /&gt;         w.gsub!(/./mu) do |char|  &lt;br /&gt;            count += 1&lt;br /&gt;            capital_char = String.upcase_table_utf8[char]&lt;br /&gt;            if count == 1 then &lt;br /&gt;               capital_char.nil? ? char.upcase : char.upcase_utf8&lt;br /&gt;            else&lt;br /&gt;               capital_char.nil? ? char.downcase : char.downcase_utf8&lt;br /&gt;            end&lt;br /&gt;         end&lt;br /&gt;         ret &lt;&lt; w + ' '&lt;br /&gt;     end&lt;br /&gt;     ret =~ /\x20\z/ ? ret.sub!(/\x20\z/, '') : ret&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def index_utf8(s)&lt;br /&gt;&lt;br /&gt;      return nil unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;      #raise(ArgumentError, "Wrong argument for method index_utf8!", caller) unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;&lt;br /&gt;      if s.class == Regexp&lt;br /&gt;         opts = s.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;         if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;         str = s.source&lt;br /&gt;         return nil if str.empty?&lt;br /&gt;         str = "%r{#{str}}" + opts&lt;br /&gt;         r = eval(str)&lt;br /&gt;         l = ""&lt;br /&gt;         sub(r) { l &lt;&lt; $`; " " }  # $`: The string to the left of the last successful match (cf. http://www.zenspider.com/Languages/Ruby/QuickRef.html)&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;&lt;br /&gt;      else&lt;br /&gt;&lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         r = %r{#{s}}mu&lt;br /&gt;         l = ""&lt;br /&gt;         sub(r) { l &lt;&lt; $`; " " }&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;&lt;br /&gt;# this would be a non-regex solution&lt;br /&gt;=begin &lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         return nil unless self =~ %r{#{s}}mu&lt;br /&gt;         indices = []&lt;br /&gt;         s.split(//mu).each do |x|&lt;br /&gt;            ar = []&lt;br /&gt;            self.each_utf8_char_with_index { |c,i| if c == x then ar &lt;&lt; i end  }   # first get all matching indices c == x&lt;br /&gt;            indices &lt;&lt; ar unless ar.empty?&lt;br /&gt;         end&lt;br /&gt;         if indices.empty?&lt;br /&gt;            return nil&lt;br /&gt;         elsif indices.size == 1 &lt;br /&gt;            indices.first.first&lt;br /&gt;         else &lt;br /&gt;            #p indices&lt;br /&gt;            ret = []&lt;br /&gt;            a0 = indices.shift&lt;br /&gt;            a0.each do |i|&lt;br /&gt;               ret &lt;&lt; i&lt;br /&gt;               indices.each { |a| if a.include?(i+1) then i += 1; ret &lt;&lt; i else ret = []; break end  }&lt;br /&gt;               return ret.first unless ret.empty?&lt;br /&gt;            end&lt;br /&gt;            ret.empty? ? nil : ret.first&lt;br /&gt;         end&lt;br /&gt;=end&lt;br /&gt;&lt;br /&gt;      end&lt;br /&gt;   end   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def rindex_utf8(s)&lt;br /&gt;&lt;br /&gt;      return nil unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;      #raise(ArgumentError, "Wrong argument for method index_utf8!", caller) unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;&lt;br /&gt;      if s.class == Regexp&lt;br /&gt;         opts = s.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;         if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;         str = s.source&lt;br /&gt;         return nil if str.empty?&lt;br /&gt;         str = "%r{#{str}}" + opts&lt;br /&gt;         r = eval(str)&lt;br /&gt;         l = ""&lt;br /&gt;         scan(r) { l = $` }  &lt;br /&gt;         #gsub(r) { l = $`; " " }  &lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;      else&lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         r = %r{#{s}}mu&lt;br /&gt;         l = ""&lt;br /&gt;         scan(r) { l = $` }  &lt;br /&gt;         #gsub(r) { l = $`; " " }&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;   end   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # note that the i option does not work in special cases with back references&lt;br /&gt;   # example: "&#224;&#192;".slice_utf8(/(.).*?\1/i) returns nil whereas "aA".slice(/(.).*?\1/i) returns "aA"&lt;br /&gt;   def slice_utf8(regex)   &lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      slice(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def slice_utf8!(regex)   &lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      slice!(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def cut_utf8(p,l)    # (index) position, length&lt;br /&gt;      raise(ArgumentError, "Error: argument is not Fixnum", caller) if p.class != Fixnum or l.class != Fixnum&lt;br /&gt;      s = self.length_utf8&lt;br /&gt;      #if p &lt; 0 then p = s - p.abs end&lt;br /&gt;      if p &lt; 0 then p.abs &gt; s ? (p = 0) : (p = s - p.abs) end      #  or:  ... p.abs &gt; s ? (return nil) : ...&lt;br /&gt;      return nil if l &gt; s or p &gt; (s - 1)&lt;br /&gt;      ret = ""&lt;br /&gt;      count = 0&lt;br /&gt;      each_utf8_char_with_index do |c,i| &lt;br /&gt;         break if count &gt;= l&lt;br /&gt;         if i &gt;= p &amp;&amp; count &lt; l then count += 1; ret &lt;&lt; c; end&lt;br /&gt;      end&lt;br /&gt;      ret&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def starts_with_utf8?(s)&lt;br /&gt;      return nil if self.empty? or s.empty?&lt;br /&gt;      cut_utf8(0, s.size_utf8) == s &lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def ends_with_utf8?(s)&lt;br /&gt;      return nil if self.empty? or s.empty?&lt;br /&gt;      cut_utf8(-(s.size_utf8), s.size_utf8) == s&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def insert_utf8(i,s)                                  # insert_utf8(index, string)&lt;br /&gt;      return self if s.empty?&lt;br /&gt;      l = self.length_utf8&lt;br /&gt;      if l == 0 then return s end&lt;br /&gt;      if i &lt; 0 then i.abs &gt; l ? (i = 0) : (i = l - i.abs) end          #  or:  ... i.abs &gt; l ? (return nil) : ...&lt;br /&gt;      #return nil if i &gt; (l - 1)                         # return nil ...&lt;br /&gt;      spaces = ""&lt;br /&gt;      if i &gt; (l-1) then spaces = " " * (i - (l-1)) end   # ... or add spaces&lt;br /&gt;      str = self &lt;&lt; spaces&lt;br /&gt;      s1 = str.cut_utf8(0, i)&lt;br /&gt;      s2 = str.cut_utf8(i, l - s1.length_utf8)&lt;br /&gt;      s1 &lt;&lt; s &lt;&lt; s2&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def split_utf8(regex)&lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      split(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def scan_utf8(regex)&lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      if block_given? then scan(r) { |a,*m| yield(a,*m) } else scan(r) end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def range_utf8(r)&lt;br /&gt;&lt;br /&gt;      return nil if r.class != Range&lt;br /&gt;      #raise(ArgumentError, "No Range object given!", caller) if r.class != Range&lt;br /&gt;&lt;br /&gt;      a = r.to_s[/^[\+\-]?\d+/].to_i&lt;br /&gt;      b = r.to_s[/[\+\-]?\d+$/].to_i&lt;br /&gt;      d = r.to_s[/\.+/]&lt;br /&gt;&lt;br /&gt;      if d.size == 2 then d = 2 else d = d.size end &lt;br /&gt;&lt;br /&gt;      l = self.length_utf8&lt;br /&gt;&lt;br /&gt;      return nil if b.abs &gt; l || a.abs &gt; l || d &lt; 2 || d &gt; 3&lt;br /&gt;&lt;br /&gt;      if a &lt; 0 then a = l - a.abs end&lt;br /&gt;      if b &lt; 0 then b = l - b.abs end&lt;br /&gt;      &lt;br /&gt;      return nil if a &gt; b&lt;br /&gt;&lt;br /&gt;      str = ""&lt;br /&gt;&lt;br /&gt;      each_utf8_char_with_index do |c,i|&lt;br /&gt;         break if i &gt; b&lt;br /&gt;         if d == 2&lt;br /&gt;            (i &gt;= a &amp;&amp; i &lt;= b) ? str &lt;&lt; c : next&lt;br /&gt;         else&lt;br /&gt;            (i &gt;= a &amp;&amp; i &lt; b) ? str &lt;&lt; c : next&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      str&lt;br /&gt;&lt;br /&gt;   end&lt;br /&gt; &lt;br /&gt;   def utf8?&lt;br /&gt;     self =~ UTF8REGEX&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def clean_utf8&lt;br /&gt;       t = ""&lt;br /&gt;       self.scan(/./um) { |c| t &lt;&lt; c if c =~ UTF8REGEX }&lt;br /&gt;       t&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_encoded_file?   # check (or rather guess) if (HTML) file encoding is UTF-8 (experimental, so use at your own risk!)&lt;br /&gt;&lt;br /&gt;      file = self&lt;br /&gt;      str = ""&lt;br /&gt;&lt;br /&gt;      if file =~ /^http:\/\//&lt;br /&gt;&lt;br /&gt;         url = file&lt;br /&gt;&lt;br /&gt;         if RUBY_PLATFORM =~ /darwin/i   # Mac OS X 10.4.10&lt;br /&gt;          &lt;br /&gt;            seconds = 30  &lt;br /&gt;&lt;br /&gt;            # check if web site is reachable&lt;br /&gt;            # on Windows try to use curb, http://curb.rubyforge.org (sudo gem install curb)&lt;br /&gt;            var = %x{ /usr/bin/curl -I -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url}; /bin/echo -n $? }.to_i&lt;br /&gt;&lt;br /&gt;            #return false unless var == 0&lt;br /&gt;            raise "Failed to create connection to web site: #{url}  --  curl error code: #{var}  --  " unless var == 0&lt;br /&gt;&lt;br /&gt;            str = %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} | \&lt;br /&gt;                      /usr/bin/grep -Eo -m 1 \"(charset|encoding)=[\\"']?[^\\"'&gt;]+\" | /usr/bin/grep -Eo \"[^=\\"'&gt;]+$\" }&lt;br /&gt;            p str&lt;br /&gt;            return true if str =~ /utf-?8/i&lt;br /&gt;            return false if !str.empty? &amp;&amp; str !~ /utf-?8/i&lt;br /&gt;&lt;br /&gt;            # solutions with downloaded file&lt;br /&gt;&lt;br /&gt;            # download HTML file&lt;br /&gt;            #downloaded_file = "/tmp/html"&lt;br /&gt;            downloaded_file = "~/Desktop/html"&lt;br /&gt;            downloaded_file = File.expand_path(downloaded_file)&lt;br /&gt;            %x{ /usr/bin/touch #{downloaded_file} 2&gt;/dev/null }&lt;br /&gt;            raise "No valid HTML download file (path) specified!" unless File.file?(downloaded_file)&lt;br /&gt;            %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} -o #{downloaded_file} #{url} }&lt;br /&gt;            &lt;br /&gt;            simple_test = %x{ /usr/bin/file -ik #{downloaded_file} }    #  cf. man file&lt;br /&gt;            p simple_test &lt;br /&gt;&lt;br /&gt;            # read entire file into a string&lt;br /&gt;            File.open(downloaded_file).read.each(nil) do |str| &lt;br /&gt;               #return true if str =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;               str.utf8? ? (return true) : (return false) &lt;br /&gt;            end &lt;br /&gt;&lt;br /&gt;            #check each line of the downloaded file&lt;br /&gt;            #count_lines = 0&lt;br /&gt;            #count_utf8 = 0&lt;br /&gt;            #File.foreach(downloaded_file) { |line| return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i; count_lines += 1;  count_utf8 += 1 if line.clean_utf8.utf8?; break if count_lines != count_utf8 }&lt;br /&gt;            #count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;            &lt;br /&gt;&lt;br /&gt;            # in-memory solutions&lt;br /&gt;&lt;br /&gt;            #html_file_cleaned_utf8 = %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.clean_utf8&lt;br /&gt;            #p html_file_cleaned_utf8.utf8?&lt;br /&gt;&lt;br /&gt;            count_lines = 0&lt;br /&gt;            count_utf8 = 0&lt;br /&gt;            #%x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.each(nil) do |line|    # read entire file into string&lt;br /&gt;            %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.each('\n') do |line| &lt;br /&gt;               #return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;               count_lines += 1 &lt;br /&gt;               count_utf8 += 1 if line.utf8?&lt;br /&gt;               break if count_lines != count_utf8&lt;br /&gt;            end&lt;br /&gt;            count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;&lt;br /&gt;         else&lt;br /&gt;&lt;br /&gt;            # check each line of the HTML file (or the entire HTML file at once)&lt;br /&gt;            # cf. http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html&lt;br /&gt;            count_lines = 0&lt;br /&gt;            count_utf8 = 0&lt;br /&gt;            open(url) do |f|   &lt;br /&gt;               # p f.meta, f.content_encoding, f.content_type&lt;br /&gt;               cs = f.charset&lt;br /&gt;               return true if cs =~ /utf-?8/i&lt;br /&gt;               #f.each(nil) do |str| str.utf8? ? (return true) : (return false) end  # read entire file into string&lt;br /&gt;               f.each_line do |line| &lt;br /&gt;                  count_lines += 1 &lt;br /&gt;                  count_utf8 += 1 if line.utf8?&lt;br /&gt;                  break unless count_lines == count_utf8&lt;br /&gt;               end&lt;br /&gt;            end&lt;br /&gt;            count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;&lt;br /&gt;         end&lt;br /&gt;&lt;br /&gt;      else&lt;br /&gt;&lt;br /&gt;         return false unless File.file?(file)&lt;br /&gt;&lt;br /&gt;         if RUBY_PLATFORM =~ /darwin/i then str = %x{ /usr/bin/file -ik #{file} }; return true if str =~ /utf-?8/i end&lt;br /&gt;&lt;br /&gt;         # read entire file into a string&lt;br /&gt;         #File.open(file).read.each(nil) do |str| return true if str =~ /(charset|encoding) *= *["']? *utf-?8/i; str.utf8? ? (return true) : (return false) end &lt;br /&gt;&lt;br /&gt;         # check each line of the file&lt;br /&gt;         count_lines = 0&lt;br /&gt;         count_utf8 = 0&lt;br /&gt;         File.foreach(file) do |line| &lt;br /&gt;            return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;            count_lines += 1;  &lt;br /&gt;            count_utf8 += 1 if line.utf8?; &lt;br /&gt;            break if count_lines != count_utf8 &lt;br /&gt;         end&lt;br /&gt;&lt;br /&gt;         count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;         &lt;br /&gt;      end   &lt;br /&gt;&lt;br /&gt;      str =~ /utf-?8/i ? true : false&lt;br /&gt;&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/&lt;br /&gt;   def validate_utf8&lt;br /&gt;      Iconv.iconv('UTF-8//IGNORE', 'UTF-8', (self + ' ') ).first[0..-2]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://www.ruby-forum.com/topic/70357&lt;br /&gt;   def asciify_utf8&lt;br /&gt;       return nil unless self.utf8?&lt;br /&gt;       #Iconv.iconv('US-ASCII//IGNORE//TRANSLIT', 'UTF-8', (self + ' ') ).first[0..-2]&lt;br /&gt;       # delete all punctuation characters inside words except "-" in words such as up-to-date&lt;br /&gt;       Iconv.iconv('US-ASCII//IGNORE//TRANSLIT', 'UTF-8', (self + ' ') ).first[0..-2].gsub(/(?!-.*)\b[[:punct:]]+\b/, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def latin1_to_utf8     # ISO-8859-1 to UTF-8&lt;br /&gt;      ret = Iconv.iconv("UTF-8//IGNORE", "ISO-8859-1", (self + "\x20") ).first[0..-2]&lt;br /&gt;      ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def cp1252_to_utf8     # CP1252 (WINDOWS-1252) to UTF-8&lt;br /&gt;      ret = Iconv.iconv("UTF-8//IGNORE", "CP1252", (self + "\x20") ).first[0..-2]&lt;br /&gt;      ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://www.ruby-forum.com/topic/70357 &lt;br /&gt;   def utf16le_to_utf8&lt;br /&gt;       ret = Iconv.iconv('UTF-8//IGNORE', 'UTF-16LE', (self[0,(self.length/2*2)] + "\000\000") ).first[0..-2]&lt;br /&gt;       ret =~ /\x00\z/ ?  ret.sub!(/\x00\z/, '') : ret&lt;br /&gt;       ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def utf8_to_utf16le&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      ret = Iconv.iconv('UTF-16LE//IGNORE', 'UTF-8', self ).first&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def utf8_to_unicode&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      scan(/./mu) { |c| str &lt;&lt; "U+" &lt;&lt; sprintf("%04X", c.unpack("U*").first) }&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def unicode_to_utf8&lt;br /&gt;      return self if self =~ /\A[[:space:]]*\z/m&lt;br /&gt;      str = ""&lt;br /&gt;      #scan(/U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})/) { |u| str &lt;&lt; [u.first.hex].pack("U*") }&lt;br /&gt;      #scan(/U\+([[:digit:][:xdigit:]]{4,5}|10[[:digit:][:xdigit:]]{4})/) { |u| str &lt;&lt; [u.first.hex].pack("U*") }&lt;br /&gt;      scan(/(U\+(?:[[:digit:][:xdigit:]]{4,5}|10[[:digit:][:xdigit:]]{4})|.)/mu) do        # for mixed strings such as "U+00bfHabla espaU+00f1ol?"&lt;br /&gt;         c = $1&lt;br /&gt;         if c =~ /^U\+/&lt;br /&gt;            str &lt;&lt; [c[2..-1].hex].pack("U*")&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c&lt;br /&gt;         end       &lt;br /&gt;      end&lt;br /&gt;      str.utf8? ? str : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # dec, hex, oct conversions (experimental!)&lt;br /&gt;&lt;br /&gt;   def utf8_to_dec&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00"  # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; sprintf("%04X", c.unpack("U*").first).hex.to_s &lt;&lt; "\x00"   # convert to decimal&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-2]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def dec_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00/).each do |c|&lt;br /&gt;         if c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; [c.to_i].pack("U*")&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_dec_2&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"  # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; x.to_s &lt;&lt; null_str }      # convert to decimal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def dec_to_utf8_2   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /[[:digit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.to_i.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.to_i.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_hex&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"    # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; sprintf("%X", x) &lt;&lt; null_str }      # convert to hexadecimal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def hex_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:xdigit:]]+\x00/ &amp;&amp; self =~ /[[:xdigit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:xdigit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.hex.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.hex.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_oct&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"   # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; sprintf("%o", x) &lt;&lt; null_str }      # convert to octal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def oct_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /[[:digit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.oct.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.oct.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. http://node-0.mneisen.org/2007/03/13/email-subjects-in-utf-8-mit-ruby-kodieren/&lt;br /&gt;   def email_subject_utf8&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      "=?utf-8?b?#{[self].pack("m").delete("\n")}?="&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts String.downcase_table_utf8.to_s&lt;br /&gt;&lt;br /&gt;#puts String.letters_utf8.to_s&lt;br /&gt;#String.letters_utf8.each { |c| puts "#{c.inspect} ::  #{c}" }&lt;br /&gt;&lt;br /&gt;str = "&#338;uvres Compl&#232;tes"&lt;br /&gt;str = "&#338;uvres \000Compl&#232;tes"&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str = str.validate_utf8; p str&lt;br /&gt;str = str.clean_utf8; p str&lt;br /&gt;str.utf8?  ? "#{str}: UTF-8 string seems OK!\n".display : "#{str}: No valid UTF-8 string!\n".display&lt;br /&gt;puts str.asciify_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str_in_utf8 = "\303\251"&lt;br /&gt;print "UTF-16:   "; p Iconv.iconv('UTF-16', 'UTF-8', str_in_utf8 ).first&lt;br /&gt;print "UTF-16BE: "; p Iconv.iconv('UTF-16BE', 'UTF-8', str_in_utf8 ).first&lt;br /&gt;print "UTF-16LE: "; p str_in_utf8.utf8_to_utf16le&lt;br /&gt;str_in_utf16le = "c\000a\000f\000\351\000"&lt;br /&gt;puts str_in_utf16le.utf16le_to_utf8&lt;br /&gt;puts str_in_utf16le.utf16le_to_utf8.asciify_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.upcase_utf8&lt;br /&gt;puts str.downcase_utf8&lt;br /&gt;puts str.capitalize_utf8&lt;br /&gt;puts str.capitalize_utf8!&lt;br /&gt;puts str.swapcase_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".swapcase_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".swapcase_utf8!&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.slice_utf8(/../i)&lt;br /&gt;puts str.slice_utf8(/(.).*?\1/i)&lt;br /&gt;puts "&#224;&#192;".slice_utf8(/(.).*?\1/i)   # =&gt; nil despite the i option!&lt;br /&gt;puts "aA".slice(/(.).*?\1/i)        # =&gt; aA&lt;br /&gt;puts "&#224;&#192; &#224;&#192;".slice_utf8!(/([&#224;&#192;]).*?\1/i)&lt;br /&gt;puts "&#224;&#192; &#224;&#192;".slice_utf8!(/(.).*?\1/ium)&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".slice_utf8!(/(.).*?\1/ium)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str.capitalize_utf8.each_utf8_char_with_index { |c,i| puts "#{i}: #{c}" }&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.range_utf8(0..2)&lt;br /&gt;puts str.range_utf8(0..-2)&lt;br /&gt;puts str.range_utf8(-4..-1)&lt;br /&gt;puts str.range_utf8(-3..-1)&lt;br /&gt;puts str.range_utf8(-3...-1)&lt;br /&gt;puts str.range_utf8([-3..-1])&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p str.scan_utf8(/./)&lt;br /&gt;"&#224;cA&#32459;f&#233;&#224;".scan_utf8(/./) { |c| puts c }&lt;br /&gt;"&#224;cA&#32459;f&#233;&#224;".scan_utf8(/(.)(.)?/) { |a,b| print a,b,"\n" }&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('&#32459;')&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('&#32459;f')&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('z')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!fz".index_utf8('9&#32459;!fz')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz".index_utf8(/9&#32459;!fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".index_utf8(//)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8('9&#32459;!fz')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(/9&#32459;!fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(/9..fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(//)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".utf8_to_utf16le.utf16le_to_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".utf8_to_utf16le.utf16le_to_utf8.asciify_utf8&lt;br /&gt;puts "&#224;&#192;".slice_utf8(/../i)&lt;br /&gt;puts "&#224;&#192;".slice_utf8!(/../i)&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('&#32459;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('&#224;&#192;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('z')&lt;br /&gt;puts "&#32459; &#224;&#192;/ ^&#32459; &#224;&#192;".count_utf8('/&#32459;^')&lt;br /&gt;puts "&#32459; &#224;&#192;/ ^&#32459; &#224;&#192;".count_utf8('^/&#32459;^')  # count all chars except those specified; note that the leading ^ will result in the regex: /[^\/&#32459;^]/u&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".delete_utf8('&#224;&#192; ')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192; &#32459; &#224;&#192; &#32459; &#224;&#192;".delete_utf8!('&#607;&#32459;&#224; &#230;&#165;')&lt;br /&gt;&lt;br /&gt;puts str.cut_utf8(0,5)&lt;br /&gt;puts str.cut_utf8(-5,5)&lt;br /&gt;puts str.cut_utf8(-10,50)&lt;br /&gt;&lt;br /&gt;puts str.length_utf8&lt;br /&gt;puts str.size_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".first_utf8&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".last_utf8&lt;br /&gt;p "&#32459; &#224;&#192; &#32459; &#224;&#192;\n".last_utf8&lt;br /&gt;puts "".first_utf8&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".starts_with_utf8?('&#32459;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".ends_with_utf8?('k')&lt;br /&gt;puts "".ends_with_utf8?('k')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".ends_with_utf8?('')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;".starts_with_utf8?('&#32459; &#224;&#192; &#32459; &#224;&#192;')&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;".insert_utf8(20, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(2, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(-2, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(-200, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(200, "abc")&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "Hello, world!".utf8_to_unicode&lt;br /&gt;p "&#32459;&#224;&#192;&#32459;&#224;".utf8_to_unicode&lt;br /&gt;p "&#32459;&#224;&#192;&#32459;&#224;&#66374;".utf8_to_unicode&lt;br /&gt;&lt;br /&gt;puts "Hello, world!".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;&#66374;".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;&#66374;".size_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;encoded_file = "/ISO-8859-Latin-1.txt"&lt;br /&gt;encoded_file = "/cp1252.txt"&lt;br /&gt;&lt;br /&gt;File.open(encoded_file).read.each(nil) do |str| &lt;br /&gt;   p str&lt;br /&gt;   #str = str.latin1_to_utf8&lt;br /&gt;   str = str.cp1252_to_utf8&lt;br /&gt;   p str&lt;br /&gt;   puts str&lt;br /&gt;   str.utf8? ? (puts "UTF-8 conversion - YES") : (puts "UTF-8 conversion - NO") &lt;br /&gt;end &lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "U+00bfHabla espaU+00f1ol?".unicode_to_utf8&lt;br /&gt;&lt;br /&gt;# cf. http://www.decodeunicode.org/en/miscellaneous_symbols&lt;br /&gt;code_points = &lt;&lt;-EOS&lt;br /&gt;U+2603   SNOWMAN&lt;br /&gt;U+2708   AIRPLANE&lt;br /&gt;U+00a9   COPYRIGHT SIGN&lt;br /&gt;U+2615   HOT BEVERAGE&lt;br /&gt;U+2602   UMBRELLA&lt;br /&gt;U+2614   UMBRELLA WITH RAIN DROPS&lt;br /&gt;U+261D   WHITE UP POINTING INDEX&lt;br /&gt;U+2620   SKULL AND CROSSBONES&lt;br /&gt;U+262F   YIN YANG&lt;br /&gt;U+262E   PEACE SYMBOL&lt;br /&gt;U+263A   WHITE SMILING FACE&lt;br /&gt;EOS&lt;br /&gt;&lt;br /&gt;puts code_points.unicode_to_utf8&lt;br /&gt;&lt;br /&gt;# see:&lt;br /&gt;# - http://intertwingly.net/stories/2004/04/14/i18n.html (I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n)&lt;br /&gt;# - http://www.intertwingly.net/blog/1763.html (Unicode and weblogs)&lt;br /&gt;# - http://www.intertwingly.net/blog/1768.html (UTF-8 musings)&lt;br /&gt;&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".asciify_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".utf8_to_unicode&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".size_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".upcase_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;# NOTE: To convert the following UTF-8 strings containing a \x00 to dec, hex or oct you have to add \x00 to UTF8REGEX:  [\x00\x09\x0A\x0D\x20-\x7E]            # ASCII &lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec.dec_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec.dec_to_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex.hex_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex.hex_to_utf8&lt;br /&gt;    &lt;br /&gt;puts&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct.oct_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct.oct_to_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts '"Hello, world" in Portuguese: "Ol&#225; Mundo" or "Al&#244; Mundo" (Portugu&#234;s)'.email_subject_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;file = "http://www.ruby-forum.com"&lt;br /&gt;file = "http://blade.nagaokaut.ac.jp"&lt;br /&gt;file = "http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml"&lt;br /&gt;file = "http://www.columbia.edu/kermit/utf8.html"   #  UTF-8 SAMPLER&lt;br /&gt;&lt;br /&gt;p file.utf8_encoded_file?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;require 'open-uri'  &lt;br /&gt;  &lt;br /&gt;# UnicodeData.txt&lt;br /&gt;unicode_array = []&lt;br /&gt;&lt;br /&gt;open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f| &lt;br /&gt;   #f.each(nil) do |line| line.scan(/^[^;]+/) { |u| unicode_array &lt;&lt; u } end       # all code points&lt;br /&gt;   f.each do |line| line =~ /LATIN|GREEK|CYRILLIC/ ?  ( line.scan(/^[^;]+/) { |u| unicode_array &lt;&lt; u } ) : next end&lt;br /&gt;end&lt;br /&gt;unicode_array.each { |x| u = [x.hex].pack("U*"); u.utf8? ? (puts "U+#{x} ::  #{u.inspect}  ::  #{u}") : (puts "U+#{x} ::  #{u.inspect}  ::  #{u}  :: NO!") } &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Array&lt;br /&gt;   def dups_indices   # cf. http://www.ruby-forum.com/topic/122008 and http://snippets.dzone.com/posts/show/4148&lt;br /&gt;      (0...self.size).to_a - self.uniq.map{ |x| index(x) }&lt;br /&gt;   end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;#  CaseFolding.txt&lt;br /&gt;capital_letters_utf8 = []&lt;br /&gt;small_letters_utf8 = []&lt;br /&gt;&lt;br /&gt;open('http://www.unicode.org/Public/UNIDATA/CaseFolding.txt') do |f| &lt;br /&gt;   f.each do |line| &lt;br /&gt;      if line =~ /.*/ &lt;br /&gt;      #if line =~ /LATIN|GREEK|CYRILLIC/ &lt;br /&gt;         line.scan(/^([^;#]+); +\S+ ([^;\s]+)/) { capital_letters_utf8 &lt;&lt; [$1.hex].pack("U*"); small_letters_utf8 &lt;&lt; [$2.hex].pack("U*") }&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;puts small_letters_utf8.size, capital_letters_utf8.size&lt;br /&gt;deleted_pairs = []&lt;br /&gt;small_letters_utf8.dups_indices.reverse.each do |i|   # small_letters_utf8 will be array_with_keys below&lt;br /&gt;   deleted_pairs &lt;&lt; [small_letters_utf8.at(i), capital_letters_utf8.at(i)]&lt;br /&gt;   small_letters_utf8.delete_at(i); capital_letters_utf8.delete_at(i)&lt;br /&gt;end&lt;br /&gt;puts small_letters_utf8.size, capital_letters_utf8.size&lt;br /&gt;&lt;br /&gt;# Hash[*array_with_keys.zip(array_with_values).flatten]&lt;br /&gt;upcase_table_utf8 = Hash[*small_letters_utf8.zip(capital_letters_utf8).flatten]&lt;br /&gt;#upcase_table_utf8.each_pair { |k,v| puts "#{k} :: #{v}" }&lt;br /&gt;&lt;br /&gt;puts upcase_table_utf8["a"]&lt;br /&gt;puts upcase_table_utf8["&#7834;"]&lt;br /&gt;puts upcase_table_utf8.value?("A")&lt;br /&gt;&lt;br /&gt;deleted_pairs.each { |s,c| puts "deleted:  #{s}   ::   #{c}" }&lt;br /&gt;&lt;br /&gt;upcase_table_utf8.size.times do |i|&lt;br /&gt;#20.times do |i|&lt;br /&gt;   puts "array index #{i}  ::  #{small_letters_utf8.at(i)}  ::  #{capital_letters_utf8.at(i)}"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;</description>
      <pubDate>Tue, 11 Sep 2007 18:09:13 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4527</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>Unicode chart</title>
      <link>http://snippets.dzone.com/posts/show/4260</link>
      <description>This PHP-enhanced HTML page will display the first 4,096 (unless you change it) Unicode characters in a neat table.  Your browser's ability to render the characters properly may vary.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;HTML&gt;&lt;br /&gt;&lt;HEAD&gt;&lt;br /&gt;&lt;TITLE&gt;Unicode Chart&lt;/TITLE&gt;&lt;br /&gt;&lt;LINK REL="Stylesheet" TYPE="text/css" HREF="styles.css"&gt;&lt;br /&gt;&lt;STYLE TYPE="text/css"&gt;&lt;br /&gt;TH {text-align: center; }&lt;br /&gt;TD {text-align: center; }&lt;br /&gt;&lt;/STYLE&gt;&lt;br /&gt;&lt;/HEAD&gt;&lt;br /&gt;&lt;BODY&gt;&lt;br /&gt;&lt;TABLE ALIGN=CENTER BORDER=1&gt;&lt;br /&gt;&lt;TR&gt;&lt;TH&gt; &lt;/TH&gt;&lt;TH&gt;0&lt;/TH&gt;&lt;TH&gt;1&lt;/TH&gt;&lt;TH&gt;2&lt;/TH&gt;&lt;TH&gt;3&lt;/TH&gt;&lt;TH&gt;4&lt;/TH&gt;&lt;TH&gt;5&lt;/TH&gt;&lt;TH&gt;6&lt;/TH&gt;&lt;TH&gt;7&lt;/TH&gt;&lt;TH&gt;8&lt;/TH&gt;&lt;TH&gt;9&lt;/TH&gt;&lt;TH&gt;A&lt;/TH&gt;&lt;TH&gt;B&lt;/TH&gt;&lt;TH&gt;C&lt;/TH&gt;&lt;TH&gt;D&lt;/TH&gt;&lt;TH&gt;E&lt;/TH&gt;&lt;TH&gt;F&lt;/TH&gt;&lt;/TR&gt;&lt;br /&gt;&lt;?PHP&lt;br /&gt; for ($i=0; $i&lt;256; $i++) { //DON'T try to generate the whole chart&lt;br /&gt;  printf('&lt;TR&gt;&lt;TD&gt;%04X&lt;/TD&gt;', $i);&lt;br /&gt;  for ($j=0; $j&lt;16; $j++) {&lt;br /&gt;   printf('&lt;TD&gt;&amp;#x%X%X;&lt;/TD&gt;', $i, $j);&lt;br /&gt;  }&lt;br /&gt;  echo "&lt;/TR&gt;\n";&lt;br /&gt; }&lt;br /&gt;?&gt;&lt;br /&gt;&lt;/TABLE&gt;&lt;br /&gt;&lt;/BODY&gt;&lt;br /&gt;&lt;/HTML&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Thu, 05 Jul 2007 04:12:52 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4260</guid>
      <author>Minimiscience (Guildorn Tanaleth)</author>
    </item>
    <item>
      <title>Cutting of part of the unicode line</title>
      <link>http://snippets.dzone.com/posts/show/3066</link>
      <description>// description of your code here&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;sub left_subject{&lt;br /&gt;	my $self=shift;&lt;br /&gt;	my $count=shift;&lt;br /&gt;	use utf8;&lt;br /&gt;	my $topic=$self-&gt;subject;&lt;br /&gt;	utf8::decode($topic);&lt;br /&gt;	utf8::upgrade($topic);&lt;br /&gt;	my ($subtopic)=($topic=~/(.{0,$count})/);&lt;br /&gt;#	utf8::decode($subtopic);&lt;br /&gt;	$subtopic=~s/\b\w{1,5}$//;&lt;br /&gt;	utf8::downgrade($subtopic);&lt;br /&gt;	return $subtopic.'...';&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Fri, 01 Dec 2006 18:52:04 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/3066</guid>
      <author>gugu (Andrey Kostenko)</author>
    </item>
    <item>
      <title>Adding UTF8 methods to class String in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/2786</link>
      <description>From: http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html (in the comments)&lt;br /&gt;Requirement: sudo gem install character-encodings --remote&lt;br /&gt;&lt;br /&gt;For the module Encoding::Character::UTF8::Methods see the file called utf-8.rb &lt;br /&gt;in the source code of character-encodings.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;require('encoding/character/utf-8')&lt;br /&gt;&lt;br /&gt;class Proc&lt;br /&gt;&lt;br /&gt;  def uStringMethods()&lt;br /&gt;    umethods = []&lt;br /&gt;    Encoding::Character::UTF8.methods.each do |m|    &lt;br /&gt;      umethods.push(%!&lt;br /&gt;          define_method("u#{m}") do |*args|&lt;br /&gt;            Encoding::Character::UTF8.#{m}(self, *args)&lt;br /&gt;          end  # unless instance_methods.include?("u#{m}")&lt;br /&gt;        !)&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    #puts umethods&lt;br /&gt;    umethods = umethods.reject { |m| m =~ /taguri/ }&lt;br /&gt;&lt;br /&gt;    String.class_eval(umethods.join)&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Proc.new {}.uStringMethods()     #  adds methods defined in module Encoding::Character::UTF8::Methods to class String&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts "caf\303\251".length      #=&gt;  5&lt;br /&gt;puts "caf\303\251".ulength     #=&gt;  4&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;#puts String.public_methods.select { |x| x =~ /^u/ }.sort&lt;br /&gt;#puts String.new.public_methods.select { |x| x =~ /^u/ }.sort&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Thu, 05 Oct 2006 19:06:11 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/2786</guid>
      <author>ntk ()</author>
    </item>
  </channel>
</rss>
