<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DZone Snippets: utf-8 code</title>
    <link>http://snippets.dzone.com/posts</link>
    <pubDate>Fri, 16 May 2008 20:35:31 GMT</pubDate>
    <description>DZone Snippets: utf-8 code</description>
    <item>
      <title>Convert cp1252-&gt; utf-8 character set (python and ruby)</title>
      <link>http://snippets.dzone.com/posts/show/5367</link>
      <description>Oooh, I hate character sets. Specifically that there are more than one of them. Here is a Ruby version of a Python script I found to convert cp1252 (aka windows-1252) into utf-8.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;  def clean_up dirty_text&lt;br /&gt;    newstr = ""&lt;br /&gt;    dirty_text.length.times do |i|&lt;br /&gt;      character = dirty_text[i]&lt;br /&gt;      newstr += if character &lt; 0x80&lt;br /&gt;        character.chr&lt;br /&gt;      elsif character &lt; 0xC0&lt;br /&gt;        "\xC2" + character.chr&lt;br /&gt;      else&lt;br /&gt;        "\xC3" + (character - 64).chr&lt;br /&gt;      end&lt;br /&gt;    end&lt;br /&gt;    newstr&lt;br /&gt;  end&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The original Python script was (http://miscoranda.com/96):&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/usr/bin/python&lt;br /&gt;import sys&lt;br /&gt;for c in sys.stdin.read(): &lt;br /&gt;   if ord(c) &lt; 0x80: sys.stdout.write(c)&lt;br /&gt;   elif ord(c) &lt; 0xC0: sys.stdout.write('\xC2' + c)&lt;br /&gt;   else: sys.stdout.write('\xC3' + chr(ord(c) - 64))&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Wed, 16 Apr 2008 11:39:47 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/5367</guid>
      <author>nicwilliams (Dr Nic Williams)</author>
    </item>
    <item>
      <title>Convert a UTF-8 string to ISO-8859-1</title>
      <link>http://snippets.dzone.com/posts/show/5019</link>
      <description>Convert a utf string to iso, used this when generating a pdf with pdf-writer in Rails, all my text is UTF8 but pdf-writer does not support this.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#add this to environment.rb&lt;br /&gt;#call to_iso on any UTF8 string to get a ISO string back&lt;br /&gt;#example : "C&#233;dez le passage aux fran&#231;ais".to_iso&lt;br /&gt;&lt;br /&gt;class String&lt;br /&gt;  require 'iconv' #this line is not needed in rails !&lt;br /&gt;  def to_iso&lt;br /&gt;    Iconv.conv('ISO-8859-1', 'utf-8', self)&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Mon, 21 Jan 2008 14:35:26 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/5019</guid>
      <author>drcorbeille (Christian Meichtry)</author>
    </item>
    <item>
      <title>Some problems with charset in UTF-8 ?</title>
      <link>http://snippets.dzone.com/posts/show/4814</link>
      <description>So you can use this request MySQL before all others, for fix your problems :&lt;br /&gt;&lt;code&gt;&lt;br /&gt;...&lt;br /&gt;mysql_query( "SET NAMES 'utf8' " );&lt;br /&gt;...&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.ab-d.fr/"&gt;Source: ab-d.fr&lt;br /&gt;Languages: PHP and MySQL&lt;/a&gt;</description>
      <pubDate>Fri, 23 Nov 2007 22:07:58 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4814</guid>
      <author>ki4ngel (Benoit Asselin)</author>
    </item>
    <item>
      <title>Match UTF-8 characters</title>
      <link>http://snippets.dzone.com/posts/show/4731</link>
      <description>&lt;code&gt;&lt;br /&gt;var string = 'abcde &#261;b&#263;d&#281;';&lt;br /&gt;&lt;br /&gt;// this wont find anythin&lt;br /&gt;string.match( /^[a-z]*$/i );&lt;br /&gt;&lt;br /&gt;// and this one will work fine :)&lt;br /&gt;string.match( /^[a-z\u00A1-\uFFFF]*$/i );&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Sat, 03 Nov 2007 11:33:37 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4731</guid>
      <author>pawik (Paul)</author>
    </item>
    <item>
      <title>Convert Unicode codepoints to UTF-8 characters with Module#const_missing</title>
      <link>http://snippets.dzone.com/posts/show/4546</link>
      <description>From: http://www.davidflanagan.com/blog/2007_08.html#000136&lt;br /&gt;Author: David Flanagan&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;# This module lazily defines constants of the form Uxxxx for all Unicode&lt;br /&gt;# codepoints from U0000 to U10FFFF. The value of each constant is the&lt;br /&gt;# UTF-8 string for the codepoint.&lt;br /&gt;# Examples:&lt;br /&gt;#   copyright = Unicode::U00A9&lt;br /&gt;#   euro = Unicode::U20AC&lt;br /&gt;#   infinity = Unicode::U221E&lt;br /&gt;#&lt;br /&gt;module Unicode&lt;br /&gt;  def self.const_missing(name)  &lt;br /&gt;    # Check that the constant name is of the right form: U0000 to U10FFFF&lt;br /&gt;    if name.to_s =~ /^U([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/&lt;br /&gt;      # Convert the codepoint to an immutable UTF-8 string,&lt;br /&gt;      # define a real constant for that value and return the value&lt;br /&gt;      #p name, name.class&lt;br /&gt;      const_set(name, [$1.to_i(16)].pack("U").freeze)&lt;br /&gt;    else  # Raise an error for constants that are not Unicode.&lt;br /&gt;      raise NameError, "Uninitialized constant: Unicode::#{name}"&lt;br /&gt;    end&lt;br /&gt;  end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts copyright = Unicode::U00A9&lt;br /&gt;puts euro = Unicode::U20AC&lt;br /&gt;puts euro = Unicode::U20AC&lt;br /&gt;puts infinity = Unicode::U221E&lt;br /&gt;puts Unicode.const_get(:U221E)&lt;br /&gt;p Unicode.constants&lt;br /&gt;puts Unicode.constants&lt;br /&gt;Unicode.constants.each { |u| puts Unicode.const_get(u) }&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;</description>
      <pubDate>Sat, 15 Sep 2007 12:25:16 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4546</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>UTF8-aware string methods in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/4527</link>
      <description>Author:  ntk&lt;br /&gt;License:    &lt;a href="http://www.opensource.org/licenses/mit-license.php"&gt;The MIT License&lt;/a&gt;, Copyright (c) 2007 ntk&lt;br /&gt;Description:  some basic UTF8-aware string methods for Ruby's String class (Ruby 1.8.6)&lt;br /&gt;Requirements: save this snippet to an UTF-8 encoded file and set the character set encoding of Terminal.app &lt;br /&gt;              to UTF-8 (on Mac OS X: Terminal menu -&gt; Window Settings -&gt; Display -&gt; Character Set Encoding; to enable additional features see &lt;a href="http://smyck.de/2007/06/06/great-stuff-being-able-to-type-utf-8-characters-in-a-terminal-on-os-x/"&gt;here&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Further tools:&lt;br /&gt;- &lt;a href="http://www.yoshidam.net/Ruby.html"&gt;rbuconv&lt;/a&gt;, a pure Ruby library for Unicode translation&lt;br /&gt;- &lt;a href="http://www.yoshidam.net/unicode.txt"&gt;unicode&lt;/a&gt;, a library for Unicode Normalization (sudo gem install unicode); for a Windows version see &lt;a href="http://www.ruby.org.ee/wiki/Unicode_in_Ruby/Rails"&gt;Unicode in Ruby on Rails&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://icu4r.rubyforge.org"&gt;ICU4R&lt;/a&gt;, a Ruby C-extension binding for the &lt;a href="http://www.icu-project.org"&gt;ICU&lt;/a&gt; library&lt;br /&gt;- &lt;a href="http://billposer.org/Software/msort.html"&gt;Msort&lt;/a&gt;, a command-line sorting program&lt;br /&gt;- &lt;a href="http://raa.ruby-lang.org/project/punycode4r/"&gt;punycode4r&lt;/a&gt;, a pure Ruby implementation of Punycode (RFC 3492; sudo gem install punycode4r)&lt;br /&gt;- &lt;a href="http://www.flexiguided.de/publications.utf8proc.en.html"&gt;utf8proc&lt;/a&gt;, library for processing UTF-8 encoded Unicode strings, (sudo gem install utf8proc)&lt;br /&gt;- &lt;a href="http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt"&gt;Oniguruma&lt;/a&gt;, Ruby's regular expression engine; cf. &lt;a href="http://www.igvita.com/blog/2007/04/11/secure-utf-8-input-in-rails/"&gt;Secure UTF-8 Input in Rails&lt;/a&gt; and &lt;a href="http://woss.name/2006/10/25/migrating-your-rails-application-to-unicode/"&gt;Migrating your Rails application to Unicode&lt;/a&gt;&lt;br /&gt;- &lt;a href="http://rubyforge.org/projects/char-encodings/"&gt;character-encodings&lt;/a&gt;, seamless integration of character encodings into Ruby's String class, (sudo gem install character-encodings)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;class String&lt;br /&gt;&lt;br /&gt;   require 'iconv' &lt;br /&gt;   require 'open-uri'      # cf. http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html&lt;br /&gt;&lt;br /&gt;   # taken from: http://www.w3.org/International/questions/qa-forms-utf-8&lt;br /&gt;   UTF8REGEX = /\A(?:                               # ?: non-capturing group (grouping with no back references)&lt;br /&gt;                 [\x09\x0A\x0D\x20-\x7E]            # ASCII&lt;br /&gt;               | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte&lt;br /&gt;               |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs&lt;br /&gt;               | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte&lt;br /&gt;               |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates&lt;br /&gt;               |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3&lt;br /&gt;               | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15&lt;br /&gt;               |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16&lt;br /&gt;               )*\z/mnx&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;#  create UTF-8 character arrays (as class instance variables)&lt;br /&gt;#&lt;br /&gt;#  mapping tables: - http://www.unicode.org/Public/UCA/latest/allkeys.txt&lt;br /&gt;#                  - http://unicode.org/Public/UNIDATA/UnicodeData.txt &lt;br /&gt;#                  - http://unicode.org/Public/UNIDATA/CaseFolding.txt&lt;br /&gt;#                  - http://www.decodeunicode.org &lt;br /&gt;#                  - ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2&lt;br /&gt;#                  - http://camomile.sourceforge.net&lt;br /&gt;#                  - Character Palette (Mac OS X)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # test data&lt;br /&gt;   @small_letters_utf8 = ["U+00F1", "U+00F4", "U+00E6", "U+00F8", "U+00E0", "U+00E1", "U+00E2", "U+00E4", "U+00E5", "U+00E7", "U+00E8", "U+00E9", "U+00EA", "U+00EB", "U+0153"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @capital_letters_utf8 = ["U+00D1", "U+00D4", "U+00C6", "U+00D8", "U+00C0", "U+00C1", "U+00C2", "U+00C4", "U+00C5", "U+00C7", "U+00C8", "U+00C9", "U+00CA", "U+00CB", "U+0152"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @other_letters_utf8 = ["U+03A3", "U+0639", "U+0041", "U+F8D0", "U+F8FF", "U+4E2D", "U+F4EE", "U+00FE", "U+10FFFF", "U+00A9", "U+20AC", "U+221E", "U+20AC", "U+FEFF", "U+FFFD", "U+00FF", "U+00FE", "U+FFFE", "U+FEFF"].map { |x| u = [x[2..-1].hex].pack("U*"); u =~ UTF8REGEX ? u : nil }&lt;br /&gt;&lt;br /&gt;   if @small_letters_utf8.size != @small_letters_utf8.nitems then raise "Invalid UTF-8 char in @small_letters_utf8!" end&lt;br /&gt;   if @capital_letters_utf8.size != @capital_letters_utf8.nitems then raise "Invalid UTF-8 char in @capital_letters_utf8!" end&lt;br /&gt;   if @other_letters_utf8.size != @other_letters_utf8.nitems then raise "Invalid UTF-8 char in @other_letters_utf8!" end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   @unicode_array = []&lt;br /&gt;   #open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f| f.each(nil) { |line| line.scan(/^[^;]+/) { |u| @unicode_array &lt;&lt; u } }  end&lt;br /&gt;   #open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f|                                                                               &lt;br /&gt;   #   f.each do |line| line =~ /LATIN|GREEK|CYRILLIC/  ?  ( line.scan(/^[^;]+/) { |u| @unicode_array &lt;&lt; u } )  :  next  end&lt;br /&gt;   #end&lt;br /&gt;&lt;br /&gt;   #@letters_utf8 = @unicode_array.map { |x| u = [x.hex].pack("U*"); u =~ UTF8REGEX ? u : nil }.compact   # code points from UnicodeData.txt&lt;br /&gt;   @letters_utf8 = @small_letters_utf8 + @capital_letters_utf8 + @other_letters_utf8                      # test data only&lt;br /&gt;&lt;br /&gt;   # Hash[*array_with_keys.zip(array_with_values).flatten]&lt;br /&gt;   @downcase_table_utf8 = Hash[*@capital_letters_utf8.zip(@small_letters_utf8).flatten]&lt;br /&gt;   @upcase_table_utf8 = Hash[*@small_letters_utf8.zip(@capital_letters_utf8).flatten]&lt;br /&gt;   @letters_utf8_hash = Hash[*@letters_utf8.zip([]).flatten]    #=&gt; ... "\341\272\242"=&gt;nil ...&lt;br /&gt;&lt;br /&gt;   class &lt;&lt; self &lt;br /&gt;      attr_accessor :small_letters_utf8&lt;br /&gt;      attr_accessor :capital_letters_utf8&lt;br /&gt;      attr_accessor :other_letters_utf8&lt;br /&gt;      attr_accessor :letters_utf8&lt;br /&gt;      attr_accessor :letters_utf8_hash&lt;br /&gt;      attr_accessor :unicode_array&lt;br /&gt;      attr_accessor :downcase_table_utf8&lt;br /&gt;      attr_accessor :upcase_table_utf8&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def each_utf8_char&lt;br /&gt;      scan(/./mu) { |c| yield c }&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def each_utf8_char_with_index&lt;br /&gt;      i = -1&lt;br /&gt;      scan(/./mu) { |c| i+=1; yield(c, i) }&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def length_utf8&lt;br /&gt;      #scan(/./mu).size&lt;br /&gt;      count = 0&lt;br /&gt;      scan(/./mu) { count += 1 }&lt;br /&gt;      count&lt;br /&gt;   end&lt;br /&gt;   alias :size_utf8 :length_utf8&lt;br /&gt;&lt;br /&gt;   def reverse_utf8&lt;br /&gt;      split(//mu).reverse.join&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def reverse_utf8!&lt;br /&gt;      split(//mu).reverse!.join&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def swapcase_utf8&lt;br /&gt;     gsub(/./mu) do |char|  &lt;br /&gt;         if !String.downcase_table_utf8[char].nil? then String.downcase_table_utf8[char]&lt;br /&gt;         elsif !String.upcase_table_utf8[char].nil? then String.upcase_table_utf8[char]&lt;br /&gt;         else char.swapcase&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def swapcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         if !String.downcase_table_utf8[char].nil? then String.downcase_table_utf8[char]&lt;br /&gt;         elsif !String.upcase_table_utf8[char].nil? then String.upcase_table_utf8[char]&lt;br /&gt;         else ret = char.swapcase end&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def downcase_utf8&lt;br /&gt;      gsub(/./mu) do |char|  &lt;br /&gt;         small_char = String.downcase_table_utf8[char]&lt;br /&gt;         small_char.nil? ? char.downcase : small_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def downcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         small_char = String.downcase_table_utf8[char]&lt;br /&gt;         small_char.nil? ? char.downcase : small_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def upcase_utf8&lt;br /&gt;      gsub(/./mu) do |char|  &lt;br /&gt;         capital_char = String.upcase_table_utf8[char]&lt;br /&gt;         capital_char.nil? ? char.upcase : capital_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def upcase_utf8!&lt;br /&gt;      gsub!(/./mu) do |char|  &lt;br /&gt;         capital_char = String.upcase_table_utf8[char]&lt;br /&gt;         capital_char.nil? ? char.upcase : capital_char&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def count_utf8(c)&lt;br /&gt;      return nil if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      scan(r).size&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def delete_utf8(c)&lt;br /&gt;      return self if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      gsub(r, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def delete_utf8!(c)&lt;br /&gt;      return self if c.empty?&lt;br /&gt;      r = %r{[#{c}]}mu&lt;br /&gt;      gsub!(r, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def first_utf8&lt;br /&gt;      self[/\A./mu]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def last_utf8&lt;br /&gt;      self[/.\z/mu]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def capitalize_utf8&lt;br /&gt;     return self if self =~ /\A[[:space:]]*\z/m&lt;br /&gt;     ret = ""&lt;br /&gt;     split(/\x20/).each do |w| &lt;br /&gt;         count = 0&lt;br /&gt;         w.gsub(/./mu) do |char|  &lt;br /&gt;            count += 1&lt;br /&gt;            capital_char = String.upcase_table_utf8[char]&lt;br /&gt;            if count == 1 then &lt;br /&gt;               capital_char.nil? ? char.upcase : char.upcase_utf8&lt;br /&gt;            else&lt;br /&gt;               capital_char.nil? ? char.downcase : char.downcase_utf8&lt;br /&gt;            end&lt;br /&gt;         end&lt;br /&gt;         ret &lt;&lt; w + ' '&lt;br /&gt;     end&lt;br /&gt;     ret =~ /\x20\z/ ? ret.sub!(/\x20\z/, '') : ret  &lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def capitalize_utf8!&lt;br /&gt;     return self if self =~ /\A[[:space:]]*\z/m &lt;br /&gt;     ret = ""&lt;br /&gt;     split(/\x20/).each do |w| &lt;br /&gt;         count = 0&lt;br /&gt;         w.gsub!(/./mu) do |char|  &lt;br /&gt;            count += 1&lt;br /&gt;            capital_char = String.upcase_table_utf8[char]&lt;br /&gt;            if count == 1 then &lt;br /&gt;               capital_char.nil? ? char.upcase : char.upcase_utf8&lt;br /&gt;            else&lt;br /&gt;               capital_char.nil? ? char.downcase : char.downcase_utf8&lt;br /&gt;            end&lt;br /&gt;         end&lt;br /&gt;         ret &lt;&lt; w + ' '&lt;br /&gt;     end&lt;br /&gt;     ret =~ /\x20\z/ ? ret.sub!(/\x20\z/, '') : ret&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def index_utf8(s)&lt;br /&gt;&lt;br /&gt;      return nil unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;      #raise(ArgumentError, "Wrong argument for method index_utf8!", caller) unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;&lt;br /&gt;      if s.class == Regexp&lt;br /&gt;         opts = s.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;         if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;         str = s.source&lt;br /&gt;         return nil if str.empty?&lt;br /&gt;         str = "%r{#{str}}" + opts&lt;br /&gt;         r = eval(str)&lt;br /&gt;         l = ""&lt;br /&gt;         sub(r) { l &lt;&lt; $`; " " }  # $`: The string to the left of the last successful match (cf. http://www.zenspider.com/Languages/Ruby/QuickRef.html)&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;&lt;br /&gt;      else&lt;br /&gt;&lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         r = %r{#{s}}mu&lt;br /&gt;         l = ""&lt;br /&gt;         sub(r) { l &lt;&lt; $`; " " }&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;&lt;br /&gt;# this would be a non-regex solution&lt;br /&gt;=begin &lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         return nil unless self =~ %r{#{s}}mu&lt;br /&gt;         indices = []&lt;br /&gt;         s.split(//mu).each do |x|&lt;br /&gt;            ar = []&lt;br /&gt;            self.each_utf8_char_with_index { |c,i| if c == x then ar &lt;&lt; i end  }   # first get all matching indices c == x&lt;br /&gt;            indices &lt;&lt; ar unless ar.empty?&lt;br /&gt;         end&lt;br /&gt;         if indices.empty?&lt;br /&gt;            return nil&lt;br /&gt;         elsif indices.size == 1 &lt;br /&gt;            indices.first.first&lt;br /&gt;         else &lt;br /&gt;            #p indices&lt;br /&gt;            ret = []&lt;br /&gt;            a0 = indices.shift&lt;br /&gt;            a0.each do |i|&lt;br /&gt;               ret &lt;&lt; i&lt;br /&gt;               indices.each { |a| if a.include?(i+1) then i += 1; ret &lt;&lt; i else ret = []; break end  }&lt;br /&gt;               return ret.first unless ret.empty?&lt;br /&gt;            end&lt;br /&gt;            ret.empty? ? nil : ret.first&lt;br /&gt;         end&lt;br /&gt;=end&lt;br /&gt;&lt;br /&gt;      end&lt;br /&gt;   end   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def rindex_utf8(s)&lt;br /&gt;&lt;br /&gt;      return nil unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;      #raise(ArgumentError, "Wrong argument for method index_utf8!", caller) unless !self.empty? &amp;&amp; (s.class == Regexp || s.class == String)&lt;br /&gt;&lt;br /&gt;      if s.class == Regexp&lt;br /&gt;         opts = s.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;         if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;         str = s.source&lt;br /&gt;         return nil if str.empty?&lt;br /&gt;         str = "%r{#{str}}" + opts&lt;br /&gt;         r = eval(str)&lt;br /&gt;         l = ""&lt;br /&gt;         scan(r) { l = $` }  &lt;br /&gt;         #gsub(r) { l = $`; " " }  &lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;      else&lt;br /&gt;         return nil if s.empty?&lt;br /&gt;         r = %r{#{s}}mu&lt;br /&gt;         l = ""&lt;br /&gt;         scan(r) { l = $` }  &lt;br /&gt;         #gsub(r) { l = $`; " " }&lt;br /&gt;         l.empty? ? nil : l.length_utf8&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;   end   &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # note that the i option does not work in special cases with back references&lt;br /&gt;   # example: "&#224;&#192;".slice_utf8(/(.).*?\1/i) returns nil whereas "aA".slice(/(.).*?\1/i) returns "aA"&lt;br /&gt;   def slice_utf8(regex)   &lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      slice(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def slice_utf8!(regex)   &lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      slice!(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def cut_utf8(p,l)    # (index) position, length&lt;br /&gt;      raise(ArgumentError, "Error: argument is not Fixnum", caller) if p.class != Fixnum or l.class != Fixnum&lt;br /&gt;      s = self.length_utf8&lt;br /&gt;      #if p &lt; 0 then p = s - p.abs end&lt;br /&gt;      if p &lt; 0 then p.abs &gt; s ? (p = 0) : (p = s - p.abs) end      #  or:  ... p.abs &gt; s ? (return nil) : ...&lt;br /&gt;      return nil if l &gt; s or p &gt; (s - 1)&lt;br /&gt;      ret = ""&lt;br /&gt;      count = 0&lt;br /&gt;      each_utf8_char_with_index do |c,i| &lt;br /&gt;         break if count &gt;= l&lt;br /&gt;         if i &gt;= p &amp;&amp; count &lt; l then count += 1; ret &lt;&lt; c; end&lt;br /&gt;      end&lt;br /&gt;      ret&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def starts_with_utf8?(s)&lt;br /&gt;      return nil if self.empty? or s.empty?&lt;br /&gt;      cut_utf8(0, s.size_utf8) == s &lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def ends_with_utf8?(s)&lt;br /&gt;      return nil if self.empty? or s.empty?&lt;br /&gt;      cut_utf8(-(s.size_utf8), s.size_utf8) == s&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def insert_utf8(i,s)                                  # insert_utf8(index, string)&lt;br /&gt;      return self if s.empty?&lt;br /&gt;      l = self.length_utf8&lt;br /&gt;      if l == 0 then return s end&lt;br /&gt;      if i &lt; 0 then i.abs &gt; l ? (i = 0) : (i = l - i.abs) end          #  or:  ... i.abs &gt; l ? (return nil) : ...&lt;br /&gt;      #return nil if i &gt; (l - 1)                         # return nil ...&lt;br /&gt;      spaces = ""&lt;br /&gt;      if i &gt; (l-1) then spaces = " " * (i - (l-1)) end   # ... or add spaces&lt;br /&gt;      str = self &lt;&lt; spaces&lt;br /&gt;      s1 = str.cut_utf8(0, i)&lt;br /&gt;      s2 = str.cut_utf8(i, l - s1.length_utf8)&lt;br /&gt;      s1 &lt;&lt; s &lt;&lt; s2&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def split_utf8(regex)&lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      split(r)&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def scan_utf8(regex)&lt;br /&gt;      opts = regex.inspect.gsub(/\A(.).*\1([eimnosux]*)\z/mu, '\2')&lt;br /&gt;      if  opts.count('u') == 0 then opts = opts + "u" end&lt;br /&gt;      s = regex.source&lt;br /&gt;      str = "%r{#{s}}" + opts&lt;br /&gt;      r = eval(str)&lt;br /&gt;      if block_given? then scan(r) { |a,*m| yield(a,*m) } else scan(r) end&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def range_utf8(r)&lt;br /&gt;&lt;br /&gt;      return nil if r.class != Range&lt;br /&gt;      #raise(ArgumentError, "No Range object given!", caller) if r.class != Range&lt;br /&gt;&lt;br /&gt;      a = r.to_s[/^[\+\-]?\d+/].to_i&lt;br /&gt;      b = r.to_s[/[\+\-]?\d+$/].to_i&lt;br /&gt;      d = r.to_s[/\.+/]&lt;br /&gt;&lt;br /&gt;      if d.size == 2 then d = 2 else d = d.size end &lt;br /&gt;&lt;br /&gt;      l = self.length_utf8&lt;br /&gt;&lt;br /&gt;      return nil if b.abs &gt; l || a.abs &gt; l || d &lt; 2 || d &gt; 3&lt;br /&gt;&lt;br /&gt;      if a &lt; 0 then a = l - a.abs end&lt;br /&gt;      if b &lt; 0 then b = l - b.abs end&lt;br /&gt;      &lt;br /&gt;      return nil if a &gt; b&lt;br /&gt;&lt;br /&gt;      str = ""&lt;br /&gt;&lt;br /&gt;      each_utf8_char_with_index do |c,i|&lt;br /&gt;         break if i &gt; b&lt;br /&gt;         if d == 2&lt;br /&gt;            (i &gt;= a &amp;&amp; i &lt;= b) ? str &lt;&lt; c : next&lt;br /&gt;         else&lt;br /&gt;            (i &gt;= a &amp;&amp; i &lt; b) ? str &lt;&lt; c : next&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;&lt;br /&gt;      str&lt;br /&gt;&lt;br /&gt;   end&lt;br /&gt; &lt;br /&gt;   def utf8?&lt;br /&gt;     self =~ UTF8REGEX&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def clean_utf8&lt;br /&gt;       t = ""&lt;br /&gt;       self.scan(/./um) { |c| t &lt;&lt; c if c =~ UTF8REGEX }&lt;br /&gt;       t&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_encoded_file?   # check (or rather guess) if (HTML) file encoding is UTF-8 (experimental, so use at your own risk!)&lt;br /&gt;&lt;br /&gt;      file = self&lt;br /&gt;      str = ""&lt;br /&gt;&lt;br /&gt;      if file =~ /^http:\/\//&lt;br /&gt;&lt;br /&gt;         url = file&lt;br /&gt;&lt;br /&gt;         if RUBY_PLATFORM =~ /darwin/i   # Mac OS X 10.4.10&lt;br /&gt;          &lt;br /&gt;            seconds = 30  &lt;br /&gt;&lt;br /&gt;            # check if web site is reachable&lt;br /&gt;            # on Windows try to use curb, http://curb.rubyforge.org (sudo gem install curb)&lt;br /&gt;            var = %x{ /usr/bin/curl -I -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url}; /bin/echo -n $? }.to_i&lt;br /&gt;&lt;br /&gt;            #return false unless var == 0&lt;br /&gt;            raise "Failed to create connection to web site: #{url}  --  curl error code: #{var}  --  " unless var == 0&lt;br /&gt;&lt;br /&gt;            str = %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} | \&lt;br /&gt;                      /usr/bin/grep -Eo -m 1 \"(charset|encoding)=[\\"']?[^\\"'&gt;]+\" | /usr/bin/grep -Eo \"[^=\\"'&gt;]+$\" }&lt;br /&gt;            p str&lt;br /&gt;            return true if str =~ /utf-?8/i&lt;br /&gt;            return false if !str.empty? &amp;&amp; str !~ /utf-?8/i&lt;br /&gt;&lt;br /&gt;            # solutions with downloaded file&lt;br /&gt;&lt;br /&gt;            # download HTML file&lt;br /&gt;            #downloaded_file = "/tmp/html"&lt;br /&gt;            downloaded_file = "~/Desktop/html"&lt;br /&gt;            downloaded_file = File.expand_path(downloaded_file)&lt;br /&gt;            %x{ /usr/bin/touch #{downloaded_file} 2&gt;/dev/null }&lt;br /&gt;            raise "No valid HTML download file (path) specified!" unless File.file?(downloaded_file)&lt;br /&gt;            %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} -o #{downloaded_file} #{url} }&lt;br /&gt;            &lt;br /&gt;            simple_test = %x{ /usr/bin/file -ik #{downloaded_file} }    #  cf. man file&lt;br /&gt;            p simple_test &lt;br /&gt;&lt;br /&gt;            # read entire file into a string&lt;br /&gt;            File.open(downloaded_file).read.each(nil) do |str| &lt;br /&gt;               #return true if str =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;               str.utf8? ? (return true) : (return false) &lt;br /&gt;            end &lt;br /&gt;&lt;br /&gt;            #check each line of the downloaded file&lt;br /&gt;            #count_lines = 0&lt;br /&gt;            #count_utf8 = 0&lt;br /&gt;            #File.foreach(downloaded_file) { |line| return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i; count_lines += 1;  count_utf8 += 1 if line.clean_utf8.utf8?; break if count_lines != count_utf8 }&lt;br /&gt;            #count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;            &lt;br /&gt;&lt;br /&gt;            # in-memory solutions&lt;br /&gt;&lt;br /&gt;            #html_file_cleaned_utf8 = %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.clean_utf8&lt;br /&gt;            #p html_file_cleaned_utf8.utf8?&lt;br /&gt;&lt;br /&gt;            count_lines = 0&lt;br /&gt;            count_utf8 = 0&lt;br /&gt;            #%x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.each(nil) do |line|    # read entire file into string&lt;br /&gt;            %x{ /usr/bin/curl -L --fail --silent --connect-timeout #{seconds} --max-time #{seconds+10} #{url} }.each('\n') do |line| &lt;br /&gt;               #return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;               count_lines += 1 &lt;br /&gt;               count_utf8 += 1 if line.utf8?&lt;br /&gt;               break if count_lines != count_utf8&lt;br /&gt;            end&lt;br /&gt;            count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;&lt;br /&gt;         else&lt;br /&gt;&lt;br /&gt;            # check each line of the HTML file (or the entire HTML file at once)&lt;br /&gt;            # cf. http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/index.html&lt;br /&gt;            count_lines = 0&lt;br /&gt;            count_utf8 = 0&lt;br /&gt;            open(url) do |f|   &lt;br /&gt;               # p f.meta, f.content_encoding, f.content_type&lt;br /&gt;               cs = f.charset&lt;br /&gt;               return true if cs =~ /utf-?8/i&lt;br /&gt;               #f.each(nil) do |str| str.utf8? ? (return true) : (return false) end  # read entire file into string&lt;br /&gt;               f.each_line do |line| &lt;br /&gt;                  count_lines += 1 &lt;br /&gt;                  count_utf8 += 1 if line.utf8?&lt;br /&gt;                  break unless count_lines == count_utf8&lt;br /&gt;               end&lt;br /&gt;            end&lt;br /&gt;            count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;&lt;br /&gt;         end&lt;br /&gt;&lt;br /&gt;      else&lt;br /&gt;&lt;br /&gt;         return false unless File.file?(file)&lt;br /&gt;&lt;br /&gt;         if RUBY_PLATFORM =~ /darwin/i then str = %x{ /usr/bin/file -ik #{file} }; return true if str =~ /utf-?8/i end&lt;br /&gt;&lt;br /&gt;         # read entire file into a string&lt;br /&gt;         #File.open(file).read.each(nil) do |str| return true if str =~ /(charset|encoding) *= *["']? *utf-?8/i; str.utf8? ? (return true) : (return false) end &lt;br /&gt;&lt;br /&gt;         # check each line of the file&lt;br /&gt;         count_lines = 0&lt;br /&gt;         count_utf8 = 0&lt;br /&gt;         File.foreach(file) do |line| &lt;br /&gt;            return true if line =~ /(charset|encoding) *= *["']? *utf-?8/i&lt;br /&gt;            count_lines += 1;  &lt;br /&gt;            count_utf8 += 1 if line.utf8?; &lt;br /&gt;            break if count_lines != count_utf8 &lt;br /&gt;         end&lt;br /&gt;&lt;br /&gt;         count_lines == count_utf8 ? (return true) : (return false)&lt;br /&gt;         &lt;br /&gt;      end   &lt;br /&gt;&lt;br /&gt;      str =~ /utf-?8/i ? true : false&lt;br /&gt;&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/&lt;br /&gt;   def validate_utf8&lt;br /&gt;      Iconv.iconv('UTF-8//IGNORE', 'UTF-8', (self + ' ') ).first[0..-2]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://www.ruby-forum.com/topic/70357&lt;br /&gt;   def asciify_utf8&lt;br /&gt;       return nil unless self.utf8?&lt;br /&gt;       #Iconv.iconv('US-ASCII//IGNORE//TRANSLIT', 'UTF-8', (self + ' ') ).first[0..-2]&lt;br /&gt;       # delete all punctuation characters inside words except "-" in words such as up-to-date&lt;br /&gt;       Iconv.iconv('US-ASCII//IGNORE//TRANSLIT', 'UTF-8', (self + ' ') ).first[0..-2].gsub(/(?!-.*)\b[[:punct:]]+\b/, '')&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def latin1_to_utf8     # ISO-8859-1 to UTF-8&lt;br /&gt;      ret = Iconv.iconv("UTF-8//IGNORE", "ISO-8859-1", (self + "\x20") ).first[0..-2]&lt;br /&gt;      ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def cp1252_to_utf8     # CP1252 (WINDOWS-1252) to UTF-8&lt;br /&gt;      ret = Iconv.iconv("UTF-8//IGNORE", "CP1252", (self + "\x20") ).first[0..-2]&lt;br /&gt;      ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. Paul Battley, http://www.ruby-forum.com/topic/70357 &lt;br /&gt;   def utf16le_to_utf8&lt;br /&gt;       ret = Iconv.iconv('UTF-8//IGNORE', 'UTF-16LE', (self[0,(self.length/2*2)] + "\000\000") ).first[0..-2]&lt;br /&gt;       ret =~ /\x00\z/ ?  ret.sub!(/\x00\z/, '') : ret&lt;br /&gt;       ret.utf8? ? ret : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def utf8_to_utf16le&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      ret = Iconv.iconv('UTF-16LE//IGNORE', 'UTF-8', self ).first&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def utf8_to_unicode&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      scan(/./mu) { |c| str &lt;&lt; "U+" &lt;&lt; sprintf("%04X", c.unpack("U*").first) }&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def unicode_to_utf8&lt;br /&gt;      return self if self =~ /\A[[:space:]]*\z/m&lt;br /&gt;      str = ""&lt;br /&gt;      #scan(/U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})/) { |u| str &lt;&lt; [u.first.hex].pack("U*") }&lt;br /&gt;      #scan(/U\+([[:digit:][:xdigit:]]{4,5}|10[[:digit:][:xdigit:]]{4})/) { |u| str &lt;&lt; [u.first.hex].pack("U*") }&lt;br /&gt;      scan(/(U\+(?:[[:digit:][:xdigit:]]{4,5}|10[[:digit:][:xdigit:]]{4})|.)/mu) do        # for mixed strings such as "U+00bfHabla espaU+00f1ol?"&lt;br /&gt;         c = $1&lt;br /&gt;         if c =~ /^U\+/&lt;br /&gt;            str &lt;&lt; [c[2..-1].hex].pack("U*")&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c&lt;br /&gt;         end       &lt;br /&gt;      end&lt;br /&gt;      str.utf8? ? str : nil&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   # dec, hex, oct conversions (experimental!)&lt;br /&gt;&lt;br /&gt;   def utf8_to_dec&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00"  # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; sprintf("%04X", c.unpack("U*").first).hex.to_s &lt;&lt; "\x00"   # convert to decimal&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-2]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def dec_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00/).each do |c|&lt;br /&gt;         if c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; [c.to_i].pack("U*")&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_dec_2&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"  # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; x.to_s &lt;&lt; null_str }      # convert to decimal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def dec_to_utf8_2   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /[[:digit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.to_i.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.to_i.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_hex&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"    # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; sprintf("%X", x) &lt;&lt; null_str }      # convert to hexadecimal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def hex_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:xdigit:]]+\x00/ &amp;&amp; self =~ /[[:xdigit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:xdigit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.hex.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.hex.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;   def utf8_to_oct&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      str = ""&lt;br /&gt;      tmpstr = ""&lt;br /&gt;      null_str = "\x00"&lt;br /&gt;      scan(/./mu) do |c| &lt;br /&gt;         if c =~ /^\x00$/&lt;br /&gt;            str &lt;&lt; "aaa\x00\x00"   # encode \x00 as "aaa"&lt;br /&gt;         else&lt;br /&gt;            tmpstr = ""&lt;br /&gt;            c.each_byte { |x| tmpstr &lt;&lt; sprintf("%o", x) &lt;&lt; null_str }      # convert to octal&lt;br /&gt;            str &lt;&lt; tmpstr &lt;&lt; null_str&lt;br /&gt;         end&lt;br /&gt;      end     &lt;br /&gt;      str[0..-3]&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   def oct_to_utf8   # \x00 is encoded as "aaa"&lt;br /&gt;      return self if self.empty?&lt;br /&gt;      return nil unless self =~ /\A[[:digit:]]+\x00/ &amp;&amp; self =~ /[[:digit:]]+\x00\x00/ &amp;&amp; self =~ /\A[a[:digit:]\x00]+\z/&lt;br /&gt;      str = ""&lt;br /&gt;      split(/\x00\x00/).each do |c|&lt;br /&gt;         if c =~ /\x00/&lt;br /&gt;            c.split(/\x00/).each { |x| str &lt;&lt; x.oct.chr }&lt;br /&gt;         elsif c.eql?("aaa")&lt;br /&gt;            str &lt;&lt; "\x00"&lt;br /&gt;         else&lt;br /&gt;            str &lt;&lt; c.oct.chr&lt;br /&gt;         end&lt;br /&gt;      end&lt;br /&gt;      str&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;   # cf. http://node-0.mneisen.org/2007/03/13/email-subjects-in-utf-8-mit-ruby-kodieren/&lt;br /&gt;   def email_subject_utf8&lt;br /&gt;      return nil unless self.utf8?&lt;br /&gt;      "=?utf-8?b?#{[self].pack("m").delete("\n")}?="&lt;br /&gt;   end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts String.downcase_table_utf8.to_s&lt;br /&gt;&lt;br /&gt;#puts String.letters_utf8.to_s&lt;br /&gt;#String.letters_utf8.each { |c| puts "#{c.inspect} ::  #{c}" }&lt;br /&gt;&lt;br /&gt;str = "&#338;uvres Compl&#232;tes"&lt;br /&gt;str = "&#338;uvres \000Compl&#232;tes"&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str = str.validate_utf8; p str&lt;br /&gt;str = str.clean_utf8; p str&lt;br /&gt;str.utf8?  ? "#{str}: UTF-8 string seems OK!\n".display : "#{str}: No valid UTF-8 string!\n".display&lt;br /&gt;puts str.asciify_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str_in_utf8 = "\303\251"&lt;br /&gt;print "UTF-16:   "; p Iconv.iconv('UTF-16', 'UTF-8', str_in_utf8 ).first&lt;br /&gt;print "UTF-16BE: "; p Iconv.iconv('UTF-16BE', 'UTF-8', str_in_utf8 ).first&lt;br /&gt;print "UTF-16LE: "; p str_in_utf8.utf8_to_utf16le&lt;br /&gt;str_in_utf16le = "c\000a\000f\000\351\000"&lt;br /&gt;puts str_in_utf16le.utf16le_to_utf8&lt;br /&gt;puts str_in_utf16le.utf16le_to_utf8.asciify_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.upcase_utf8&lt;br /&gt;puts str.downcase_utf8&lt;br /&gt;puts str.capitalize_utf8&lt;br /&gt;puts str.capitalize_utf8!&lt;br /&gt;puts str.swapcase_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".swapcase_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".swapcase_utf8!&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.slice_utf8(/../i)&lt;br /&gt;puts str.slice_utf8(/(.).*?\1/i)&lt;br /&gt;puts "&#224;&#192;".slice_utf8(/(.).*?\1/i)   # =&gt; nil despite the i option!&lt;br /&gt;puts "aA".slice(/(.).*?\1/i)        # =&gt; aA&lt;br /&gt;puts "&#224;&#192; &#224;&#192;".slice_utf8!(/([&#224;&#192;]).*?\1/i)&lt;br /&gt;puts "&#224;&#192; &#224;&#192;".slice_utf8!(/(.).*?\1/ium)&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".slice_utf8!(/(.).*?\1/ium)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;str.capitalize_utf8.each_utf8_char_with_index { |c,i| puts "#{i}: #{c}" }&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts str.range_utf8(0..2)&lt;br /&gt;puts str.range_utf8(0..-2)&lt;br /&gt;puts str.range_utf8(-4..-1)&lt;br /&gt;puts str.range_utf8(-3..-1)&lt;br /&gt;puts str.range_utf8(-3...-1)&lt;br /&gt;puts str.range_utf8([-3..-1])&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p str.scan_utf8(/./)&lt;br /&gt;"&#224;cA&#32459;f&#233;&#224;".scan_utf8(/./) { |c| puts c }&lt;br /&gt;"&#224;cA&#32459;f&#233;&#224;".scan_utf8(/(.)(.)?/) { |a,b| print a,b,"\n" }&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('&#32459;')&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('&#32459;f')&lt;br /&gt;p "&#224;cA&#32459;f&#233;&#224;".index_utf8('z')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!fz".index_utf8('9&#32459;!fz')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz".index_utf8(/9&#32459;!fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".index_utf8(//)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8('9&#32459;!fz')&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(/9&#32459;!fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(/9..fz/)&lt;br /&gt;p "kf&#233;&#224; &#32459;f &#224;c &#32459; 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f 9&#32459;!ofz 9&#32459;!fz kf&#233;&#224; &#32459;f &#224;c &#32459; 9h&#32459;!fz 9&#32459;!fz A&#32459;kf&#233;&#224; &#32459;f".rindex_utf8(//)&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".utf8_to_utf16le.utf16le_to_utf8&lt;br /&gt;puts "&#224;cA&#32459;f&#233;&#224;".utf8_to_utf16le.utf16le_to_utf8.asciify_utf8&lt;br /&gt;puts "&#224;&#192;".slice_utf8(/../i)&lt;br /&gt;puts "&#224;&#192;".slice_utf8!(/../i)&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('&#32459;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('&#224;&#192;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".count_utf8('z')&lt;br /&gt;puts "&#32459; &#224;&#192;/ ^&#32459; &#224;&#192;".count_utf8('/&#32459;^')&lt;br /&gt;puts "&#32459; &#224;&#192;/ ^&#32459; &#224;&#192;".count_utf8('^/&#32459;^')  # count all chars except those specified; note that the leading ^ will result in the regex: /[^\/&#32459;^]/u&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".delete_utf8('&#224;&#192; ')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192; &#32459; &#224;&#192; &#32459; &#224;&#192;".delete_utf8!('&#607;&#32459;&#224; &#230;&#165;')&lt;br /&gt;&lt;br /&gt;puts str.cut_utf8(0,5)&lt;br /&gt;puts str.cut_utf8(-5,5)&lt;br /&gt;puts str.cut_utf8(-10,50)&lt;br /&gt;&lt;br /&gt;puts str.length_utf8&lt;br /&gt;puts str.size_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".first_utf8&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".last_utf8&lt;br /&gt;p "&#32459; &#224;&#192; &#32459; &#224;&#192;\n".last_utf8&lt;br /&gt;puts "".first_utf8&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".starts_with_utf8?('&#32459;')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".ends_with_utf8?('k')&lt;br /&gt;puts "".ends_with_utf8?('k')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;&#192;".ends_with_utf8?('')&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;".starts_with_utf8?('&#32459; &#224;&#192; &#32459; &#224;&#192;')&lt;br /&gt;&lt;br /&gt;puts "&#32459; &#224;&#192; &#32459; &#224;".insert_utf8(20, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(2, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(-2, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(-200, "abc")&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;".insert_utf8(200, "abc")&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "Hello, world!".utf8_to_unicode&lt;br /&gt;p "&#32459;&#224;&#192;&#32459;&#224;".utf8_to_unicode&lt;br /&gt;p "&#32459;&#224;&#192;&#32459;&#224;&#66374;".utf8_to_unicode&lt;br /&gt;&lt;br /&gt;puts "Hello, world!".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;&#66374;".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "&#32459;&#224;&#192;&#32459;&#224;&#66374;".size_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;encoded_file = "/ISO-8859-Latin-1.txt"&lt;br /&gt;encoded_file = "/cp1252.txt"&lt;br /&gt;&lt;br /&gt;File.open(encoded_file).read.each(nil) do |str| &lt;br /&gt;   p str&lt;br /&gt;   #str = str.latin1_to_utf8&lt;br /&gt;   str = str.cp1252_to_utf8&lt;br /&gt;   p str&lt;br /&gt;   puts str&lt;br /&gt;   str.utf8? ? (puts "UTF-8 conversion - YES") : (puts "UTF-8 conversion - NO") &lt;br /&gt;end &lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts "U+00bfHabla espaU+00f1ol?".unicode_to_utf8&lt;br /&gt;&lt;br /&gt;# cf. http://www.decodeunicode.org/en/miscellaneous_symbols&lt;br /&gt;code_points = &lt;&lt;-EOS&lt;br /&gt;U+2603   SNOWMAN&lt;br /&gt;U+2708   AIRPLANE&lt;br /&gt;U+00a9   COPYRIGHT SIGN&lt;br /&gt;U+2615   HOT BEVERAGE&lt;br /&gt;U+2602   UMBRELLA&lt;br /&gt;U+2614   UMBRELLA WITH RAIN DROPS&lt;br /&gt;U+261D   WHITE UP POINTING INDEX&lt;br /&gt;U+2620   SKULL AND CROSSBONES&lt;br /&gt;U+262F   YIN YANG&lt;br /&gt;U+262E   PEACE SYMBOL&lt;br /&gt;U+263A   WHITE SMILING FACE&lt;br /&gt;EOS&lt;br /&gt;&lt;br /&gt;puts code_points.unicode_to_utf8&lt;br /&gt;&lt;br /&gt;# see:&lt;br /&gt;# - http://intertwingly.net/stories/2004/04/14/i18n.html (I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n)&lt;br /&gt;# - http://www.intertwingly.net/blog/1763.html (Unicode and weblogs)&lt;br /&gt;# - http://www.intertwingly.net/blog/1768.html (UTF-8 musings)&lt;br /&gt;&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".asciify_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".utf8_to_unicode&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".utf8_to_unicode.unicode_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".size_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;n".upcase_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;# NOTE: To convert the following UTF-8 strings containing a \x00 to dec, hex or oct you have to add \x00 to UTF8REGEX:  [\x00\x09\x0A\x0D\x20-\x7E]            # ASCII &lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec.dec_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_dec.dec_to_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex.hex_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_hex.hex_to_utf8&lt;br /&gt;    &lt;br /&gt;puts&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct&lt;br /&gt;p "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct.oct_to_utf8&lt;br /&gt;puts "I&#241;t&#235;rn&#226;ti&#244;n&#224;liz&#230;ti&#248;\x00n".utf8_to_oct.oct_to_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;puts '"Hello, world" in Portuguese: "Ol&#225; Mundo" or "Al&#244; Mundo" (Portugu&#234;s)'.email_subject_utf8&lt;br /&gt;&lt;br /&gt;puts&lt;br /&gt;file = "http://www.ruby-forum.com"&lt;br /&gt;file = "http://blade.nagaokaut.ac.jp"&lt;br /&gt;file = "http://blade.nagaokaut.ac.jp/ruby/ruby-talk/index.shtml"&lt;br /&gt;file = "http://www.columbia.edu/kermit/utf8.html"   #  UTF-8 SAMPLER&lt;br /&gt;&lt;br /&gt;p file.utf8_encoded_file?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;require 'open-uri'  &lt;br /&gt;  &lt;br /&gt;# UnicodeData.txt&lt;br /&gt;unicode_array = []&lt;br /&gt;&lt;br /&gt;open('http://unicode.org/Public/UNIDATA/UnicodeData.txt') do |f| &lt;br /&gt;   #f.each(nil) do |line| line.scan(/^[^;]+/) { |u| unicode_array &lt;&lt; u } end       # all code points&lt;br /&gt;   f.each do |line| line =~ /LATIN|GREEK|CYRILLIC/ ?  ( line.scan(/^[^;]+/) { |u| unicode_array &lt;&lt; u } ) : next end&lt;br /&gt;end&lt;br /&gt;unicode_array.each { |x| u = [x.hex].pack("U*"); u.utf8? ? (puts "U+#{x} ::  #{u.inspect}  ::  #{u}") : (puts "U+#{x} ::  #{u.inspect}  ::  #{u}  :: NO!") } &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;class Array&lt;br /&gt;   def dups_indices   # cf. http://www.ruby-forum.com/topic/122008 and http://snippets.dzone.com/posts/show/4148&lt;br /&gt;      (0...self.size).to_a - self.uniq.map{ |x| index(x) }&lt;br /&gt;   end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;#  CaseFolding.txt&lt;br /&gt;capital_letters_utf8 = []&lt;br /&gt;small_letters_utf8 = []&lt;br /&gt;&lt;br /&gt;open('http://www.unicode.org/Public/UNIDATA/CaseFolding.txt') do |f| &lt;br /&gt;   f.each do |line| &lt;br /&gt;      if line =~ /.*/ &lt;br /&gt;      #if line =~ /LATIN|GREEK|CYRILLIC/ &lt;br /&gt;         line.scan(/^([^;#]+); +\S+ ([^;\s]+)/) { capital_letters_utf8 &lt;&lt; [$1.hex].pack("U*"); small_letters_utf8 &lt;&lt; [$2.hex].pack("U*") }&lt;br /&gt;      end&lt;br /&gt;   end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;puts small_letters_utf8.size, capital_letters_utf8.size&lt;br /&gt;deleted_pairs = []&lt;br /&gt;small_letters_utf8.dups_indices.reverse.each do |i|   # small_letters_utf8 will be array_with_keys below&lt;br /&gt;   deleted_pairs &lt;&lt; [small_letters_utf8.at(i), capital_letters_utf8.at(i)]&lt;br /&gt;   small_letters_utf8.delete_at(i); capital_letters_utf8.delete_at(i)&lt;br /&gt;end&lt;br /&gt;puts small_letters_utf8.size, capital_letters_utf8.size&lt;br /&gt;&lt;br /&gt;# Hash[*array_with_keys.zip(array_with_values).flatten]&lt;br /&gt;upcase_table_utf8 = Hash[*small_letters_utf8.zip(capital_letters_utf8).flatten]&lt;br /&gt;#upcase_table_utf8.each_pair { |k,v| puts "#{k} :: #{v}" }&lt;br /&gt;&lt;br /&gt;puts upcase_table_utf8["a"]&lt;br /&gt;puts upcase_table_utf8["&#7834;"]&lt;br /&gt;puts upcase_table_utf8.value?("A")&lt;br /&gt;&lt;br /&gt;deleted_pairs.each { |s,c| puts "deleted:  #{s}   ::   #{c}" }&lt;br /&gt;&lt;br /&gt;upcase_table_utf8.size.times do |i|&lt;br /&gt;#20.times do |i|&lt;br /&gt;   puts "array index #{i}  ::  #{small_letters_utf8.at(i)}  ::  #{capital_letters_utf8.at(i)}"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;</description>
      <pubDate>Tue, 11 Sep 2007 18:09:13 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4527</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>PHP: multibyte function to get string length</title>
      <link>http://snippets.dzone.com/posts/show/4145</link>
      <description>&lt;code&gt;&lt;br /&gt;	/**&lt;br /&gt;	 * Get string length, multibyte.&lt;br /&gt;	 *&lt;br /&gt;	 * @param   string  $t Any string content&lt;br /&gt;	 * @param   string  $encoding Charset encoding&lt;br /&gt;	 * @return  int     String length&lt;br /&gt;	 */&lt;br /&gt;	function mb_strlen($t, $encoding = 'UTF-8')&lt;br /&gt;	{&lt;br /&gt;		/* --enable-mbstring */&lt;br /&gt;		if (function_exists('mb_strlen'))&lt;br /&gt;		{&lt;br /&gt;			return mb_strlen($t, $encoding);&lt;br /&gt;		}&lt;br /&gt;		else&lt;br /&gt;		{&lt;br /&gt;			return strlen(utf8_decode($t));&lt;br /&gt;		}&lt;br /&gt;	}&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Fri, 15 Jun 2007 05:35:44 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4145</guid>
      <author>Dmitry-Sh (Dmitry Shilnikov)</author>
    </item>
    <item>
      <title>Adding UTF8 methods to class String in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/2786</link>
      <description>From: http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html (in the comments)&lt;br /&gt;Requirement: sudo gem install character-encodings --remote&lt;br /&gt;&lt;br /&gt;For the module Encoding::Character::UTF8::Methods see the file called utf-8.rb &lt;br /&gt;in the source code of character-encodings.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;require('encoding/character/utf-8')&lt;br /&gt;&lt;br /&gt;class Proc&lt;br /&gt;&lt;br /&gt;  def uStringMethods()&lt;br /&gt;    umethods = []&lt;br /&gt;    Encoding::Character::UTF8.methods.each do |m|    &lt;br /&gt;      umethods.push(%!&lt;br /&gt;          define_method("u#{m}") do |*args|&lt;br /&gt;            Encoding::Character::UTF8.#{m}(self, *args)&lt;br /&gt;          end  # unless instance_methods.include?("u#{m}")&lt;br /&gt;        !)&lt;br /&gt;    end&lt;br /&gt;&lt;br /&gt;    #puts umethods&lt;br /&gt;    umethods = umethods.reject { |m| m =~ /taguri/ }&lt;br /&gt;&lt;br /&gt;    String.class_eval(umethods.join)&lt;br /&gt;  end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Proc.new {}.uStringMethods()     #  adds methods defined in module Encoding::Character::UTF8::Methods to class String&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;puts "caf\303\251".length      #=&gt;  5&lt;br /&gt;puts "caf\303\251".ulength     #=&gt;  4&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;#puts String.public_methods.select { |x| x =~ /^u/ }.sort&lt;br /&gt;#puts String.new.public_methods.select { |x| x =~ /^u/ }.sort&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Thu, 05 Oct 2006 19:06:11 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/2786</guid>
      <author>ntk ()</author>
    </item>
    <item>
      <title>UTF-8 compatible String ranges in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/2553</link>
      <description>As found at &lt;a href="http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/123935"&gt;http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/123935&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;class String&lt;br /&gt;        def [] (*params)&lt;br /&gt;                if params.all? { |p| Integer===p } ||&lt;br /&gt;                   params.size==1 &amp;&amp; Range===params[0]&lt;br /&gt;                        res = self.unpack("U*").[](*params)&lt;br /&gt;                        res = [res] unless Array===res&lt;br /&gt;                        return res.pack("U*")&lt;br /&gt;                end&lt;br /&gt;                super&lt;br /&gt;        end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Wed, 06 Sep 2006 06:25:55 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/2553</guid>
      <author>jswizard (JavaScript Wizard)</author>
    </item>
    <item>
      <title>Parsing UTF-8 encoded strings in Ruby</title>
      <link>http://snippets.dzone.com/posts/show/1659</link>
      <description>Instead of using $KCODE = 'UTF8' together with require 'jcode' you can use the /u regex parameter&lt;br /&gt;to parse UTF-8 strings containing multibyte characters.&lt;br /&gt;&lt;br /&gt;A Latin1 &lt;-&gt; UTF-8 conversion hack btw can be found here: &lt;br /&gt;http://rubyforge.org/pipermail/fxruby-users/2005-September/000480.html&lt;br /&gt;&lt;br /&gt;For comparison just drop the u option!&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;br /&gt;string = "abc\303\244"  #  \303\244 stands for &#228;&lt;br /&gt;&lt;br /&gt;puts string.scan(/./u).size&lt;br /&gt;&lt;br /&gt;puts string.split(//u).reverse.join&lt;br /&gt;&lt;br /&gt;puts string.gsub(/.$/u, '')&lt;br /&gt;&lt;br /&gt;regex = Regexp.new(/..../u)&lt;br /&gt;md = regex.match(string)&lt;br /&gt;puts md[0].inspect&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Wed, 08 Mar 2006 16:28:11 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/1659</guid>
      <author>ntk ()</author>
    </item>
  </channel>
</rss>
