<?xml version="1.0"?>
<rss version="2.0" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:yt="http://gdata.youtube.com/schemas/2007">
   <channel>
      <title>hadoop mailing list atom feed</title>
      <description>Pipes Output</description>
      <link>http://pipes.yahoo.com/pipes/pipe.info?_id=qtPlGpw33hGuPsNH6ycw5g</link>
      <pubDate>Sun, 22 Nov 2009 15:45:34 -0800</pubDate>
      <generator>http://pipes.yahoo.com/pipes/</generator>
      <item>
         <title>building the eclipse plugin</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4A482ADC.3060604@gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
hi all -- Just wondering how to build the eclipse plugin. &quot;ant binary&quot; does not seem to catch it. I would like to experiment with a few changes. thanks!
Brien &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>brien colwell</author>
         <guid isPermaLink="false">urn:uuid:%3c4A482ADC-3060604@gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 19:45:48 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3ce76221610906281532g5ce298ean38eab4155e013c32@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Marcus, The code that needs to patched is in the tasktracker, because the
tasktracker is what starts the child JVM that runs user code. Thanks,
Mikhail On Sun, Jun 28, 2009 at 6:14 PM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com&amp;gt;wrote: &amp;gt; Hi.
&amp;gt;
&amp;gt; Just to be clear. It is the jobtracker that needs the patched code right ?
&amp;gt; Or is it the tasktrackers ?
&amp;gt;
&amp;gt; Kindly
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt; On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt;
&amp;gt; wrote:
&amp;gt;
&amp;gt; &amp;gt; Marcus,
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; We currently use 0.20.0 but this patch just inserts 8 lines of code into
&amp;gt; &amp;gt; TaskRunner.java, which could certainly be done with 0.18.3.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Yes, this patch just appends additional jars to the child JVM classpath.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; I've never really used tmpjars myself, but if it involves uploading
&amp;gt; &amp;gt; multiple
&amp;gt; &amp;gt; jar files into HDFS every time a job is started, I see how it can be
&amp;gt; really
&amp;gt; &amp;gt; slow. On our ~80-job workflow this would have really slowed things down.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Thanks,
&amp;gt; &amp;gt; Mikhail
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou &amp;lt;
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Makes sense... I will try both rsync and NFS but I think rsync will
&amp;gt; beat
&amp;gt; &amp;gt; &amp;gt; NFS
&amp;gt; &amp;gt; &amp;gt; since NFS can be slow as hell sometimes but what the heck we already
&amp;gt; have
&amp;gt; &amp;gt; &amp;gt; our maven2 repo on NFS so why not :)
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Are you saying that this patch make the client able to configure which
&amp;gt; &amp;gt; &amp;gt; &quot;extra&quot; local jar files to add as classpath when firing up the
&amp;gt; &amp;gt; &amp;gt; TaskTrackerChild ?
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; To be explicit: Do you confirm that using tmpjars like I do is a
&amp;gt; costful
&amp;gt; &amp;gt; &amp;gt; slow operation ?
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; To what branch to you apply the patch (we use 0.18.3) ?
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Cheers
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt;
&amp;gt; &amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; This is the way we deal with this problem, too. We put our jar files
&amp;gt; on
&amp;gt; &amp;gt; &amp;gt; &amp;gt; NFS, and the attached patch makes possible to add those jar files to
&amp;gt; &amp;gt; the
&amp;gt; &amp;gt; &amp;gt; &amp;gt; tasktracker classpath through a configuration property.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; Thanks,
&amp;gt; &amp;gt; &amp;gt; &amp;gt; Mikhail
&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 5:21 PM, Stuart White &amp;lt;
&amp;gt; stuart.white1@gmail.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; Although I've never done it, I believe you could manually copy your
&amp;gt; &amp;gt; jar
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; files out to your cluster somewhere in hadoop's classpath, and that
&amp;gt; &amp;gt; &amp;gt; would
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; remove the need for you to copy them to your cluster at the start
of
&amp;gt; &amp;gt; &amp;gt; each
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; job.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou &amp;lt;
&amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Running without a jobtracker makes the job start almost instantly.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; I think it is due to something with the classloader. I use a
huge
&amp;gt; &amp;gt; &amp;gt; amount
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; of
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which
need
&amp;gt; &amp;gt; to
&amp;gt; &amp;gt; &amp;gt; be
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; loaded every time I guess.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; By issuing conf.setNumTasksToExecutePerJvm(-1); will the
&amp;gt; TaskTracker
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; child
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; live forever then ?
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Cheers
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; timrobertson100@gmail.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; How long does it take to start the code locally in a single
&amp;gt; &amp;gt; thread?
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Can you reuse the JVM so it only starts once per node per
job?
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Cheers,
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Tim
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Wonder how one should improve the startup times of
a hadoop
&amp;gt; job.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; Some
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; of
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; my
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many
jar
&amp;gt; files
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; take a
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; long
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; The data input amounts in these cases are neglible
so it seems
&amp;gt; &amp;gt; &amp;gt; that
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Hadoop
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; have a really high setup cost, which I can live with
but this
&amp;gt; &amp;gt; &amp;gt; seems
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; to
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; much.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it
is bad if
&amp;gt; &amp;gt; it
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; takes
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; 2
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more
&amp;gt; &amp;gt; reasonable.
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hints ?
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; --
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; +46702561312
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; http://www.tailsweep.com/
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Mikhail Bautin</author>
         <guid isPermaLink="false">urn:uuid:%3ce76221610906281532g5ce298ean38eab4155e013c32@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 15:32:40 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906281514j4806727fs65e66ef11885f216@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi. Just to be clear. It is the jobtracker that needs the patched code right ?
Or is it the tasktrackers ? Kindly //Marcus On Mon, Jun 29, 2009 at 12:08 AM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt; wrote: &amp;gt; Marcus,
&amp;gt;
&amp;gt; We currently use 0.20.0 but this patch just inserts 8 lines of code into
&amp;gt; TaskRunner.java, which could certainly be done with 0.18.3.
&amp;gt;
&amp;gt; Yes, this patch just appends additional jars to the child JVM classpath.
&amp;gt;
&amp;gt; I've never really used tmpjars myself, but if it involves uploading
&amp;gt; multiple
&amp;gt; jar files into HDFS every time a job is started, I see how it can be really
&amp;gt; slow. On our ~80-job workflow this would have really slowed things down.
&amp;gt;
&amp;gt; Thanks,
&amp;gt; Mikhail
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com
&amp;gt; &amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt; Makes sense... I will try both rsync and NFS but I think rsync will beat
&amp;gt; &amp;gt; NFS
&amp;gt; &amp;gt; since NFS can be slow as hell sometimes but what the heck we already have
&amp;gt; &amp;gt; our maven2 repo on NFS so why not :)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Are you saying that this patch make the client able to configure which
&amp;gt; &amp;gt; &quot;extra&quot; local jar files to add as classpath when firing up the
&amp;gt; &amp;gt; TaskTrackerChild ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; To be explicit: Do you confirm that using tmpjars like I do is a costful
&amp;gt; &amp;gt; slow operation ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; To what branch to you apply the patch (we use 0.18.3) ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Cheers
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt;
&amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; This is the way we deal with this problem, too. We put our jar files on
&amp;gt; &amp;gt; &amp;gt; NFS, and the attached patch makes possible to add those jar files to
&amp;gt; the
&amp;gt; &amp;gt; &amp;gt; tasktracker classpath through a configuration property.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Thanks,
&amp;gt; &amp;gt; &amp;gt; Mikhail
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 5:21 PM, Stuart White &amp;lt;stuart.white1@gmail.com
&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Although I've never done it, I believe you could manually copy your
&amp;gt; jar
&amp;gt; &amp;gt; &amp;gt;&amp;gt; files out to your cluster somewhere in hadoop's classpath, and that
&amp;gt; &amp;gt; would
&amp;gt; &amp;gt; &amp;gt;&amp;gt; remove the need for you to copy them to your cluster at the start of
&amp;gt; &amp;gt; each
&amp;gt; &amp;gt; &amp;gt;&amp;gt; job.
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou &amp;lt;
&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Running without a jobtracker makes the job start almost instantly.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; I think it is due to something with the classloader. I use a huge
&amp;gt; &amp;gt; amount
&amp;gt; &amp;gt; &amp;gt;&amp;gt; of
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which need
&amp;gt; to
&amp;gt; &amp;gt; be
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; loaded every time I guess.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
&amp;gt; &amp;gt; &amp;gt;&amp;gt; child
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; live forever then ?
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Cheers
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; timrobertson100@gmail.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; How long does it take to start the code locally in a single
&amp;gt; thread?
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Can you reuse the JVM so it only starts once per node per job?
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Cheers,
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Tim
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Wonder how one should improve the startup times of a hadoop
job.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Some
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; of
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; my
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many jar
files
&amp;gt; &amp;gt; &amp;gt;&amp;gt; take a
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; long
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; The data input amounts in these cases are neglible so it
seems
&amp;gt; &amp;gt; that
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Hadoop
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; have a really high setup cost, which I can live with but
this
&amp;gt; &amp;gt; seems
&amp;gt; &amp;gt; &amp;gt;&amp;gt; to
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; much.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it is
bad if
&amp;gt; it
&amp;gt; &amp;gt; &amp;gt;&amp;gt; takes
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; 2
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more
&amp;gt; reasonable.
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hints ?
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; --
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906281514j4806727fs65e66ef11885f216@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 15:14:33 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3ce76221610906281508h40b82bebq6d8b6bbfcf401bc4@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Marcus, We currently use 0.20.0 but this patch just inserts 8 lines of code into
TaskRunner.java, which could certainly be done with 0.18.3. Yes, this patch just appends additional jars to the child JVM classpath. I've never really used tmpjars myself, but if it involves uploading multiple
jar files into HDFS every time a job is started, I see how it can be really
slow. On our ~80-job workflow this would have really slowed things down. Thanks,
Mikhail On Sun, Jun 28, 2009 at 5:40 PM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com&amp;gt;wrote: &amp;gt; Makes sense... I will try both rsync and NFS but I think rsync will beat
&amp;gt; NFS
&amp;gt; since NFS can be slow as hell sometimes but what the heck we already have
&amp;gt; our maven2 repo on NFS so why not :)
&amp;gt;
&amp;gt; Are you saying that this patch make the client able to configure which
&amp;gt; &quot;extra&quot; local jar files to add as classpath when firing up the
&amp;gt; TaskTrackerChild ?
&amp;gt;
&amp;gt; To be explicit: Do you confirm that using tmpjars like I do is a costful
&amp;gt; slow operation ?
&amp;gt;
&amp;gt; To what branch to you apply the patch (we use 0.18.3) ?
&amp;gt;
&amp;gt; Cheers
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt;
&amp;gt; wrote:
&amp;gt;
&amp;gt; &amp;gt; This is the way we deal with this problem, too. We put our jar files on
&amp;gt; &amp;gt; NFS, and the attached patch makes possible to add those jar files to the
&amp;gt; &amp;gt; tasktracker classpath through a configuration property.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Thanks,
&amp;gt; &amp;gt; Mikhail
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 5:21 PM, Stuart White &amp;lt;stuart.white1@gmail.com
&amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; Although I've never done it, I believe you could manually copy your jar
&amp;gt; &amp;gt;&amp;gt; files out to your cluster somewhere in hadoop's classpath, and that
&amp;gt; would
&amp;gt; &amp;gt;&amp;gt; remove the need for you to copy them to your cluster at the start of
&amp;gt; each
&amp;gt; &amp;gt;&amp;gt; job.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou &amp;lt;
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt;&amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; Running without a jobtracker makes the job start almost instantly.
&amp;gt; &amp;gt;&amp;gt; &amp;gt; I think it is due to something with the classloader. I use a huge
&amp;gt; amount
&amp;gt; &amp;gt;&amp;gt; of
&amp;gt; &amp;gt;&amp;gt; &amp;gt; jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which need to
&amp;gt; be
&amp;gt; &amp;gt;&amp;gt; &amp;gt; loaded every time I guess.
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
&amp;gt; &amp;gt;&amp;gt; child
&amp;gt; &amp;gt;&amp;gt; &amp;gt; live forever then ?
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; Cheers
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;
&amp;gt; &amp;gt;&amp;gt; timrobertson100@gmail.com
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; How long does it take to start the code locally in a single thread?
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Can you reuse the JVM so it only starts once per node per job?
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Cheers,
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Tim
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;
&amp;gt; &amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Wonder how one should improve the startup times of a hadoop job.
&amp;gt; &amp;gt;&amp;gt; Some
&amp;gt; &amp;gt;&amp;gt; &amp;gt; of
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; my
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many jar files
&amp;gt; &amp;gt;&amp;gt; take a
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; long
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; The data input amounts in these cases are neglible so it seems
&amp;gt; that
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; Hadoop
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; have a really high setup cost, which I can live with but this
&amp;gt; seems
&amp;gt; &amp;gt;&amp;gt; to
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; much.
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it is bad if
it
&amp;gt; &amp;gt;&amp;gt; takes
&amp;gt; &amp;gt;&amp;gt; &amp;gt; 2
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more reasonable.
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hints ?
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; --
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; &amp;gt; --
&amp;gt; &amp;gt;&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt;&amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt;&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt;&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt;&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Mikhail Bautin</author>
         <guid isPermaLink="false">urn:uuid:%3ce76221610906281508h40b82bebq6d8b6bbfcf401bc4@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 15:08:08 -0700</pubDate>
      </item>
      <item>
         <title>Re: FSDataInputStream.read(byte[]) only reads to a block boundary?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c56a973050906281507l77df81l1f9bb70390798db8@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
On Sun, Jun 28, 2009 at 3:01 PM, Matei Zaharia &amp;lt;matei@cloudera.com&amp;gt; wrote: &amp;gt; This kind of partial read is often used by the OS to return to your
&amp;gt; application as soon as possible if trying to read more data would block, in
&amp;gt; case you can begin computing on the partial data. In some applications,
&amp;gt; it's
&amp;gt; not useful, but when you can begin computing on partial data, it allows the
&amp;gt; OS to overlap IO with your computation, improving throughput. I think
&amp;gt; FSDataInputStream returns at the block boundary for the same reason.
&amp;gt; It is very unusual, nay, unexpected to the point of bizarre, for the OS to
do so on a regular file. Typically only seen on network fds. &amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 11:16 AM, Raghu Angadi &amp;lt;rangadi@yahoo-inc.com
&amp;gt; &amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; This seems to be the case. I don't think there is any specific reason not
&amp;gt; &amp;gt; to read across the block boundary...
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Even if HDFS does read across the blocks, it is still not a good idea to
&amp;gt; &amp;gt; ignore the JavaDoc for read(). If you want all the bytes read, then you
&amp;gt; &amp;gt; should have a while loop or one of the readFully() variants. For e.g. if
&amp;gt; you
&amp;gt; &amp;gt; later change your code by wrapping a BufferedInputStream around 'in', you
&amp;gt; &amp;gt; would still get partial reads even if HDFS reads all the data.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Raghu.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; forbbs forbbs wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;&amp;gt; The hadoop version is 0.19.0.
&amp;gt; &amp;gt;&amp;gt; My file is larger than 64MB, and the block size is 64MB.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; The output of the code below is '10'. May I read across the block
&amp;gt; &amp;gt;&amp;gt; boundary? Or I should use 'while (left..){}' style code?
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; public static void main(String[] args) throws IOException
&amp;gt; &amp;gt;&amp;gt; {
&amp;gt; &amp;gt;&amp;gt; Configuration conf = new Configuration();
&amp;gt; &amp;gt;&amp;gt; FileSystem fs = FileSystem.get(conf);
&amp;gt; &amp;gt;&amp;gt; FSDataInputStream fin = fs.open(new Path(args[0]));
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; fin.seek(64*1024*1024 - 10);
&amp;gt; &amp;gt;&amp;gt; byte[] buffer = new byte[32*1024];
&amp;gt; &amp;gt;&amp;gt; int len = fin.read(buffer);
&amp;gt; &amp;gt;&amp;gt; //int len = fin.read(buffer, 0, 128);
&amp;gt; &amp;gt;&amp;gt; System.out.println(len);
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; fin.close();
&amp;gt; &amp;gt;&amp;gt; }
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>&quot;M. C. Srivas&quot;</author>
         <guid isPermaLink="false">urn:uuid:%3c56a973050906281507l77df81l1f9bb70390798db8@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 15:07:46 -0700</pubDate>
      </item>
      <item>
         <title>Re: FSDataInputStream.read(byte[]) only reads to a block boundary?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3ce4eb02ce0906281501w16675c19u9a5b96bcad4f95f7@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
This kind of partial read is often used by the OS to return to your
application as soon as possible if trying to read more data would block, in
case you can begin computing on the partial data. In some applications, it's
not useful, but when you can begin computing on partial data, it allows the
OS to overlap IO with your computation, improving throughput. I think
FSDataInputStream returns at the block boundary for the same reason. On Sun, Jun 28, 2009 at 11:16 AM, Raghu Angadi &amp;lt;rangadi@yahoo-inc.com&amp;gt;wrote: &amp;gt;
&amp;gt; This seems to be the case. I don't think there is any specific reason not
&amp;gt; to read across the block boundary...
&amp;gt;
&amp;gt; Even if HDFS does read across the blocks, it is still not a good idea to
&amp;gt; ignore the JavaDoc for read(). If you want all the bytes read, then you
&amp;gt; should have a while loop or one of the readFully() variants. For e.g. if you
&amp;gt; later change your code by wrapping a BufferedInputStream around 'in', you
&amp;gt; would still get partial reads even if HDFS reads all the data.
&amp;gt;
&amp;gt; Raghu.
&amp;gt;
&amp;gt;
&amp;gt; forbbs forbbs wrote:
&amp;gt;
&amp;gt;&amp;gt; The hadoop version is 0.19.0.
&amp;gt;&amp;gt; My file is larger than 64MB, and the block size is 64MB.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; The output of the code below is '10'. May I read across the block
&amp;gt;&amp;gt; boundary? Or I should use 'while (left..){}' style code?
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; public static void main(String[] args) throws IOException
&amp;gt;&amp;gt; {
&amp;gt;&amp;gt; Configuration conf = new Configuration();
&amp;gt;&amp;gt; FileSystem fs = FileSystem.get(conf);
&amp;gt;&amp;gt; FSDataInputStream fin = fs.open(new Path(args[0]));
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; fin.seek(64*1024*1024 - 10);
&amp;gt;&amp;gt; byte[] buffer = new byte[32*1024];
&amp;gt;&amp;gt; int len = fin.read(buffer);
&amp;gt;&amp;gt; //int len = fin.read(buffer, 0, 128);
&amp;gt;&amp;gt; System.out.println(len);
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; fin.close();
&amp;gt;&amp;gt; }
&amp;gt;&amp;gt;
&amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Matei Zaharia</author>
         <guid isPermaLink="false">urn:uuid:%3ce4eb02ce0906281501w16675c19u9a5b96bcad4f95f7@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 15:01:37 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906281440l4d3bccc5u9e401bba1600db20@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Makes sense... I will try both rsync and NFS but I think rsync will beat NFS
since NFS can be slow as hell sometimes but what the heck we already have
our maven2 repo on NFS so why not :) Are you saying that this patch make the client able to configure which
&quot;extra&quot; local jar files to add as classpath when firing up the
TaskTrackerChild ? To be explicit: Do you confirm that using tmpjars like I do is a costful
slow operation ? To what branch to you apply the patch (we use 0.18.3) ? Cheers //Marcus On Sun, Jun 28, 2009 at 11:26 PM, Mikhail Bautin &amp;lt;mbautin@gmail.com&amp;gt; wrote: &amp;gt; This is the way we deal with this problem, too. We put our jar files on
&amp;gt; NFS, and the attached patch makes possible to add those jar files to the
&amp;gt; tasktracker classpath through a configuration property.
&amp;gt;
&amp;gt; Thanks,
&amp;gt; Mikhail
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 5:21 PM, Stuart White &amp;lt;stuart.white1@gmail.com&amp;gt;wrote:
&amp;gt;
&amp;gt;&amp;gt; Although I've never done it, I believe you could manually copy your jar
&amp;gt;&amp;gt; files out to your cluster somewhere in hadoop's classpath, and that would
&amp;gt;&amp;gt; remove the need for you to copy them to your cluster at the start of each
&amp;gt;&amp;gt; job.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com
&amp;gt;&amp;gt; &amp;gt;wrote:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; &amp;gt; Hi.
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; Running without a jobtracker makes the job start almost instantly.
&amp;gt;&amp;gt; &amp;gt; I think it is due to something with the classloader. I use a huge amount
&amp;gt;&amp;gt; of
&amp;gt;&amp;gt; &amp;gt; jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which need to be
&amp;gt;&amp;gt; &amp;gt; loaded every time I guess.
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker
&amp;gt;&amp;gt; child
&amp;gt;&amp;gt; &amp;gt; live forever then ?
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; Cheers
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; //Marcus
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;
&amp;gt;&amp;gt; timrobertson100@gmail.com
&amp;gt;&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; How long does it take to start the code locally in a single thread?
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; Can you reuse the JVM so it only starts once per node per job?
&amp;gt;&amp;gt; &amp;gt; &amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; Cheers,
&amp;gt;&amp;gt; &amp;gt; &amp;gt; Tim
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;
&amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; wrote:
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Wonder how one should improve the startup times of a hadoop job.
&amp;gt;&amp;gt; Some
&amp;gt;&amp;gt; &amp;gt; of
&amp;gt;&amp;gt; &amp;gt; &amp;gt; my
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many jar files
&amp;gt;&amp;gt; take a
&amp;gt;&amp;gt; &amp;gt; &amp;gt; long
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; The data input amounts in these cases are neglible so it seems that
&amp;gt;&amp;gt; &amp;gt; &amp;gt; Hadoop
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; have a really high setup cost, which I can live with but this seems
&amp;gt;&amp;gt; to
&amp;gt;&amp;gt; &amp;gt; &amp;gt; much.
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it is bad if it
&amp;gt;&amp;gt; takes
&amp;gt;&amp;gt; &amp;gt; 2
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more reasonable.
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Hints ?
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; --
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; +46702561312
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt;&amp;gt; &amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; --
&amp;gt;&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt;&amp;gt; &amp;gt; +46702561312
&amp;gt;&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt;&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt;
&amp;gt;
&amp;gt; -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906281440l4d3bccc5u9e401bba1600db20@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 14:40:01 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3ce76221610906281426s1c832010s3786bd32ac25b27d@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
An error occured while fetching this message, sorry !
&lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Mikhail Bautin</author>
         <guid isPermaLink="false">urn:uuid:%3ce76221610906281426s1c832010s3786bd32ac25b27d@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 14:26:02 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4af5cd780906281421r7c0aa045g863c9007fe364947@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Although I've never done it, I believe you could manually copy your jar
files out to your cluster somewhere in hadoop's classpath, and that would
remove the need for you to copy them to your cluster at the start of each
job. On Sun, Jun 28, 2009 at 4:08 PM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com&amp;gt;wrote: &amp;gt; Hi.
&amp;gt;
&amp;gt; Running without a jobtracker makes the job start almost instantly.
&amp;gt; I think it is due to something with the classloader. I use a huge amount of
&amp;gt; jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which need to be
&amp;gt; loaded every time I guess.
&amp;gt;
&amp;gt; By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
&amp;gt; live forever then ?
&amp;gt;
&amp;gt; Cheers
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;timrobertson100@gmail.com
&amp;gt; &amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt; How long does it take to start the code locally in a single thread?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Can you reuse the JVM so it only starts once per node per job?
&amp;gt; &amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Cheers,
&amp;gt; &amp;gt; Tim
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;marcus.herou@tailsweep.com
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; wrote:
&amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Wonder how one should improve the startup times of a hadoop job. Some
&amp;gt; of
&amp;gt; &amp;gt; my
&amp;gt; &amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many jar files take a
&amp;gt; &amp;gt; long
&amp;gt; &amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; &amp;gt; &amp;gt; The data input amounts in these cases are neglible so it seems that
&amp;gt; &amp;gt; Hadoop
&amp;gt; &amp;gt; &amp;gt; have a really high setup cost, which I can live with but this seems to
&amp;gt; &amp;gt; much.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it is bad if it takes
&amp;gt; 2
&amp;gt; &amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more reasonable.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Hints ?
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; --
&amp;gt; &amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; +46702561312
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; http://www.tailsweep.com/
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Stuart White</author>
         <guid isPermaLink="false">urn:uuid:%3c4af5cd780906281421r7c0aa045g863c9007fe364947@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 14:21:35 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi. Running without a jobtracker makes the job start almost instantly.
I think it is due to something with the classloader. I use a huge amount of
jarfiles jobConf.set(&quot;tmpjars&quot;, &quot;jar1.jar,jar2.jar&quot;)... which need to be
loaded every time I guess. By issuing conf.setNumTasksToExecutePerJvm(-1); will the TaskTracker child
live forever then ? Cheers //Marcus On Sun, Jun 28, 2009 at 9:54 PM, tim robertson &amp;lt;timrobertson100@gmail.com&amp;gt;wrote: &amp;gt; How long does it take to start the code locally in a single thread?
&amp;gt;
&amp;gt; Can you reuse the JVM so it only starts once per node per job?
&amp;gt; conf.setNumTasksToExecutePerJvm(-1)
&amp;gt;
&amp;gt; Cheers,
&amp;gt; Tim
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;marcus.herou@tailsweep.com&amp;gt;
&amp;gt; wrote:
&amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Wonder how one should improve the startup times of a hadoop job. Some of
&amp;gt; my
&amp;gt; &amp;gt; jobs which have a lot of dependencies in terms of many jar files take a
&amp;gt; long
&amp;gt; &amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; &amp;gt; The data input amounts in these cases are neglible so it seems that
&amp;gt; Hadoop
&amp;gt; &amp;gt; have a really high setup cost, which I can live with but this seems to
&amp;gt; much.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Let's say a job takes 10 minutes to complete then it is bad if it takes 2
&amp;gt; &amp;gt; mins to set it up... 20-30 sec max would be a lot more reasonable.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Hints ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt;
&amp;gt; -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906281408n1c2484bfve6dc1ea339110e9d@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 14:08:15 -0700</pubDate>
      </item>
      <item>
         <title>Re: Scaling out/up or a mix</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906281349x5419567djfc3992206c0fb718@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi. The crawlers are _very_ threaded but no we use our own threading framework
since it was not available at the time on hadoop-core. Crawlers normally just wait a lot on clients inducing very little CPU but
consumes some memory due to the parallellism. //Marcus On Sat, Jun 27, 2009 at 6:10 PM, jason hadoop &amp;lt;jason.hadoop@gmail.com&amp;gt;wrote: &amp;gt; How about multi-threaded mappers?
&amp;gt; Multi-Threaded mappers are ideal for map tasks that are non locally io
&amp;gt; bound
&amp;gt; with many distinct endpoints.
&amp;gt; You can also control the thread count on a per job basis.
&amp;gt;
&amp;gt; On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com
&amp;gt; &amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt; The argument currently against increasing num-mappers is that the
&amp;gt; machines
&amp;gt; &amp;gt; will get into oom and since a lot of the jobs are crawlers I need more
&amp;gt; &amp;gt; ip-numbers so I don't get banned :)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Thing is that we currently have solr on the very same machines and
&amp;gt; &amp;gt; data-nodes as well so I can only give the MR nodes about 1G memory since
&amp;gt; I
&amp;gt; &amp;gt; need SOLR to have 4G...
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Now I see that I should get some obvious and juste critique about the
&amp;gt; &amp;gt; layout
&amp;gt; &amp;gt; of this arch but I'm a little limited in budget and so is then the arch
&amp;gt; :)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; However is it wise to have the MR tasks on the same nodes as the
&amp;gt; data-nodes
&amp;gt; &amp;gt; or should I split the arch ? I mean the data-nodes perhaps need more
&amp;gt; &amp;gt; disk-IO
&amp;gt; &amp;gt; and the MR more memory and CPU ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Trying to find a sweetspot hardware spec of those two roles.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; //Marcus
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman &amp;lt;bbockelm@cse.unl.edu
&amp;gt; &amp;gt; &amp;gt;wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Hey Marcus,
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Are you recording the data rates coming out of HDFS? Since you have
&amp;gt; such
&amp;gt; &amp;gt; a
&amp;gt; &amp;gt; &amp;gt; low CPU utilizations, I'd look at boxes utterly packed with big hard
&amp;gt; &amp;gt; drives
&amp;gt; &amp;gt; &amp;gt; (also, why are you using RAID1 for Hadoop??).
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
&amp;gt; &amp;gt; &amp;gt; Based on the data rates you see, make the call.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; On the other hand, what's the argument against running 3x more mappers
&amp;gt; &amp;gt; per
&amp;gt; &amp;gt; &amp;gt; box? It seems that your boxes still have more overhead to use --
&amp;gt; there's
&amp;gt; &amp;gt; no
&amp;gt; &amp;gt; &amp;gt; I/O wait.
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Brian
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; We have a deployment of 10 hadoop servers and I now need more mapping
&amp;gt; &amp;gt; &amp;gt;&amp;gt; capability (no not just add more mappers per instance) since I have so
&amp;gt; &amp;gt; &amp;gt;&amp;gt; many
&amp;gt; &amp;gt; &amp;gt;&amp;gt; jobs running. Now I am wondering what I should aim on...
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Memory, cpu or disk... How long is a rope perhaps you would say ?
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; A typical server is currently using about 15-20% cpu today on a
&amp;gt; &amp;gt; quad-core
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Some specs below.
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt; mpstat 2 5
&amp;gt; &amp;gt; &amp;gt;&amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Linux 2.6.24-19-server (mapreduce2) 06/26/2009
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft
&amp;gt; %steal
&amp;gt; &amp;gt; &amp;gt;&amp;gt; %idle intr/s
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 69.45 8572.50
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 79.48 8075.50
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 78.95 9219.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 80.80 8489.50
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 83.96 5495.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Average: all 15.62 0.00 1.79 1.47 0.62 1.97
&amp;gt; 0.00
&amp;gt; &amp;gt; &amp;gt;&amp;gt; 78.53 7970.30
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; What I am thinking is... Is it wiser to go for many of these cheap
&amp;gt; boxes
&amp;gt; &amp;gt; &amp;gt;&amp;gt; with 8GB of RAM or should I for instance focus on machines which can
&amp;gt; &amp;gt; give
&amp;gt; &amp;gt; &amp;gt;&amp;gt; more I|O throughput ?
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; I know that these things are hard but perhaps someone have draw some
&amp;gt; &amp;gt; &amp;gt;&amp;gt; conclusions before the pragmatic way.
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Kindly
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; //Marcus
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;&amp;gt; --
&amp;gt; &amp;gt; &amp;gt;&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; &amp;gt;&amp;gt; +46702561312
&amp;gt; &amp;gt; &amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; &amp;gt;&amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt; +46702561312
&amp;gt; &amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; Pro Hadoop, a book to guide you from beginner to hadoop mastery,
&amp;gt; http://www.amazon.com/dp/1430219424?tag=jewlerymall
&amp;gt; www.prohadoopbook.com a community for Hadoop Professionals
&amp;gt; -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906281349x5419567djfc3992206c0fb718@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 13:49:11 -0700</pubDate>
      </item>
      <item>
         <title>Re: combine two map tasks</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c314098690906281306n5ae39fc5y759b681bba0bb9be@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
The ChainMapper class introduced in Hadoop 19 will provide you with the
ability to have an arbitrary number of map tasks to run one after the other,
in the context of a single job.
The one issue to be aware of is that the chain of mappers only see the
output the previous map in the chain. There is a nice discussion of this in chapter 8 of Pro Hadoop, by Apress.com On Sun, Jun 28, 2009 at 5:04 AM, bharath vissapragada &amp;lt;
bharathvissapragada1990@gmail.com&amp;gt; wrote: &amp;gt; See this .. hope this answers your question .
&amp;gt;
&amp;gt; http://developer.yahoo.com/hadoop/tutorial/module4.html#tips
&amp;gt;
&amp;gt; On Sun, Jun 28, 2009 at 5:28 PM, bonito &amp;lt;bonito.perdo@gmail.com&amp;gt; wrote:
&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Hello!
&amp;gt; &amp;gt; I am a new hadoop user and my question may sound naive..
&amp;gt; &amp;gt; However, I would like to ask if there is a way to combine the results of
&amp;gt; &amp;gt; two
&amp;gt; &amp;gt; mpa tasks that may &quot;run&quot; simultaneously.
&amp;gt; &amp;gt; I use the MultipleInput class and thus I have two different mappers.
&amp;gt; &amp;gt; I want the result/output of the one map (associated with one input file)
&amp;gt; to
&amp;gt; &amp;gt; be used in the process of the second map (associated with the second
&amp;gt; input
&amp;gt; &amp;gt; file).
&amp;gt; &amp;gt; I have thought of storing the map1 output in the hdfs and retrieving it
&amp;gt; &amp;gt; using the map2.
&amp;gt; &amp;gt; However, I have no clue whether this is possible. I mean...what about
&amp;gt; &amp;gt; time-executing issues? map2 has to wait until map1 is completed...
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; The thought of executing them in a serial manner is not the one I really
&amp;gt; &amp;gt; want...
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Any suggestion would be appreciated.
&amp;gt; &amp;gt; Thank you in advance :)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; View this message in context:
&amp;gt; &amp;gt; http://www.nabble.com/combine-two-map-tasks-tp24240928p24240928.html
&amp;gt; &amp;gt; Sent from the Hadoop core-user mailing list archive at Nabble.com.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; -- Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>jason hadoop</author>
         <guid isPermaLink="false">urn:uuid:%3c314098690906281306n5ae39fc5y759b681bba0bb9be@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 13:06:55 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c32120a6a0906281254l3de4890avde9c272773d14745@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
How long does it take to start the code locally in a single thread? Can you reuse the JVM so it only starts once per node per job?
conf.setNumTasksToExecutePerJvm(-1) Cheers,
Tim On Sun, Jun 28, 2009 at 9:43 PM, Marcus Herou&amp;lt;marcus.herou@tailsweep.com&amp;gt; wrote:
&amp;gt; Hi.
&amp;gt;
&amp;gt; Wonder how one should improve the startup times of a hadoop job. Some of my
&amp;gt; jobs which have a lot of dependencies in terms of many jar files take a long
&amp;gt; time to start in hadoop up to 2 minutes some times.
&amp;gt; The data input amounts in these cases are neglible so it seems that Hadoop
&amp;gt; have a really high setup cost, which I can live with but this seems to much.
&amp;gt;
&amp;gt; Let's say a job takes 10 minutes to complete then it is bad if it takes 2
&amp;gt; mins to set it up... 20-30 sec max would be a lot more reasonable.
&amp;gt;
&amp;gt; Hints ?
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; +46702561312
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; http://www.tailsweep.com/
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>tim robertson</author>
         <guid isPermaLink="false">urn:uuid:%3c32120a6a0906281254l3de4890avde9c272773d14745@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 12:54:15 -0700</pubDate>
      </item>
      <item>
         <title>hadoop jobs take long time to setup</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906281243i3436983ehd1b1fb209e51f5c@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi. Wonder how one should improve the startup times of a hadoop job. Some of my
jobs which have a lot of dependencies in terms of many jar files take a long
time to start in hadoop up to 2 minutes some times.
The data input amounts in these cases are neglible so it seems that Hadoop
have a really high setup cost, which I can live with but this seems to much. Let's say a job takes 10 minutes to complete then it is bad if it takes 2
mins to set it up... 20-30 sec max would be a lot more reasonable. Hints ? //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906281243i3436983ehd1b1fb209e51f5c@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 12:43:35 -0700</pubDate>
      </item>
      <item>
         <title>Re: FSDataInputStream.read(byte[]) only reads to a block boundary?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4A47B365.20103@yahoo-inc.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; This seems to be the case. I don't think there is any specific reason not to read across the block boundary... Even if HDFS does read across the blocks, it is still not a good idea to ignore the JavaDoc for read(). If you want all the bytes read, then you should have a while loop or one of the readFully() variants. For e.g. if you later change your code by wrapping a BufferedInputStream around 'in', you would still get partial reads even if HDFS reads all the data. Raghu. forbbs forbbs wrote:
&amp;gt; The hadoop version is 0.19.0.
&amp;gt; My file is larger than 64MB, and the block size is 64MB.
&amp;gt; &amp;gt; The output of the code below is '10'. May I read across the block
&amp;gt; boundary? Or I should use 'while (left..){}' style code?
&amp;gt; &amp;gt; public static void main(String[] args) throws IOException
&amp;gt; {
&amp;gt; Configuration conf = new Configuration();
&amp;gt; FileSystem fs = FileSystem.get(conf);
&amp;gt; FSDataInputStream fin = fs.open(new Path(args[0]));
&amp;gt; &amp;gt; fin.seek(64*1024*1024 - 10);
&amp;gt; byte[] buffer = new byte[32*1024];
&amp;gt; int len = fin.read(buffer);
&amp;gt; //int len = fin.read(buffer, 0, 128);
&amp;gt; System.out.println(len);
&amp;gt; &amp;gt; fin.close();
&amp;gt; } &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Raghu Angadi</author>
         <guid isPermaLink="false">urn:uuid:%3c4A47B365-20103@yahoo-inc-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 11:16:05 -0700</pubDate>
      </item>
      <item>
         <title>Re: combine two map tasks</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c73d592f60906280504o2c17d134ja5703c5093de4053@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
See this .. hope this answers your question . http://developer.yahoo.com/hadoop/tutorial/module4.html#tips On Sun, Jun 28, 2009 at 5:28 PM, bonito &amp;lt;bonito.perdo@gmail.com&amp;gt; wrote: &amp;gt;
&amp;gt; Hello!
&amp;gt; I am a new hadoop user and my question may sound naive..
&amp;gt; However, I would like to ask if there is a way to combine the results of
&amp;gt; two
&amp;gt; mpa tasks that may &quot;run&quot; simultaneously.
&amp;gt; I use the MultipleInput class and thus I have two different mappers.
&amp;gt; I want the result/output of the one map (associated with one input file) to
&amp;gt; be used in the process of the second map (associated with the second input
&amp;gt; file).
&amp;gt; I have thought of storing the map1 output in the hdfs and retrieving it
&amp;gt; using the map2.
&amp;gt; However, I have no clue whether this is possible. I mean...what about
&amp;gt; time-executing issues? map2 has to wait until map1 is completed...
&amp;gt;
&amp;gt; The thought of executing them in a serial manner is not the one I really
&amp;gt; want...
&amp;gt;
&amp;gt; Any suggestion would be appreciated.
&amp;gt; Thank you in advance :)
&amp;gt;
&amp;gt; --
&amp;gt; View this message in context:
&amp;gt; http://www.nabble.com/combine-two-map-tasks-tp24240928p24240928.html
&amp;gt; Sent from the Hadoop core-user mailing list archive at Nabble.com.
&amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>bharath vissapragada</author>
         <guid isPermaLink="false">urn:uuid:%3c73d592f60906280504o2c17d134ja5703c5093de4053@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 05:04:21 -0700</pubDate>
      </item>
      <item>
         <title>combine two map tasks</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c24240928.post@talk.nabble.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; Hello! I am a new hadoop user and my question may sound naive..
However, I would like to ask if there is a way to combine the results of two
mpa tasks that may &quot;run&quot; simultaneously. I use the MultipleInput class and thus I have two different mappers. I want the result/output of the one map (associated with one input file) to
be used in the process of the second map (associated with the second input
file). I have thought of storing the map1 output in the hdfs and retrieving it
using the map2.
However, I have no clue whether this is possible. I mean...what about
time-executing issues? map2 has to wait until map1 is completed... The thought of executing them in a serial manner is not the one I really
want... Any suggestion would be appreciated.
Thank you in advance :) -- View this message in context: http://www.nabble.com/combine-two-map-tasks-tp24240928p24240928.html
Sent from the Hadoop core-user mailing list archive at Nabble.com. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>bonito</author>
         <guid isPermaLink="false">urn:uuid:%3c24240928-post@talk-nabble-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 04:58:12 -0700</pubDate>
      </item>
      <item>
         <title>FSDataInputStream.read(byte[]) only reads to a block boundary?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c6241f7fc0906280208of05528dmc9f80f99cd1d8781@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
The hadoop version is 0.19.0.
My file is larger than 64MB, and the block size is 64MB. The output of the code below is '10'. May I read across the block
boundary? Or I should use 'while (left..){}' style code? public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); FSDataInputStream fin = fs.open(new Path(args[0])); fin.seek(64*1024*1024 - 10); byte[] buffer = new byte[32*1024]; int len = fin.read(buffer); //int len = fin.read(buffer, 0, 128); System.out.println(len); fin.close(); } &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>forbbs forbbs</author>
         <guid isPermaLink="false">urn:uuid:%3c6241f7fc0906280208of05528dmc9f80f99cd1d8781@mail-gmail-com%3e</guid>
         <pubDate>Sun, 28 Jun 2009 02:08:35 -0700</pubDate>
      </item>
      <item>
         <title>Re: hadoop lucene integration</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cfa03480d0906271101y401cc23aqa9cf5ce14514c64d@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
katta ? On Fri, Jun 26, 2009 at 1:01 PM, Nick Cen &amp;lt;cenyongh@gmail.com&amp;gt; wrote: &amp;gt; there is sample index code under the contrib directory, maybe you can take
&amp;gt; a
&amp;gt; see.
&amp;gt;
&amp;gt; 2009/6/26 m.harig &amp;lt;m.harig@gmail.com&amp;gt;
&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; hi all
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; I've work experience with lucene , but am new to hadoop , i
&amp;gt; &amp;gt; created a index by lucene , please any1 tell me how to use hadoop for my
&amp;gt; &amp;gt; lucene index for distributed file system , if possible can any1 send me
&amp;gt; an
&amp;gt; &amp;gt; example or the link in which i can use it for my index. Please .
&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; View this message in context:
&amp;gt; &amp;gt; http://www.nabble.com/hadoop-lucene-integration-tp24214951p24214951.html
&amp;gt; &amp;gt; Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; http://daily.appspot.com/food/
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>zsongbo</author>
         <guid isPermaLink="false">urn:uuid:%3cfa03480d0906271101y401cc23aqa9cf5ce14514c64d@mail-gmail-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 11:01:47 -0700</pubDate>
      </item>
      <item>
         <title>Re: HDFS Random Access</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4A464BC1.6060605@yahoo-inc.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; Yes, FSDataInputStream allows random access. There are way to read x bytes at a position p:
1) in.seek(p); read(buf, 0, x);
2) in.(p, buf, 0, x);
These two have slightly different semantics. The second one is preferred and is easier for HDFS to optimize further. Random access should be pretty good with HDFS and it is increasingly getting more users and thus more importance. HBase is one of the users. Just yesterday I attached a benchmark and comparissions to random access on native filesystem to https://issues.apache.org/jira/browse/HDFS-236 . As of now, the overhead on average is about 2 ms over 9-10ms it takes for native read. There are a few fairly simple fixes possible to reduce this gap. I think getFileStatus() is the way to find the length, though there might have been a call added to FSDataInputStream recently. I am not sure. Raghu.
tsuraan wrote:
&amp;gt; All the documentation for HDFS says that it's for large streaming
&amp;gt; jobs, but I couldn't find an explicit answer to this, so I'll try
&amp;gt; asking here. How is HDFS's random seek performance within an
&amp;gt; FSDataInputStream? I use lucene with a lot of indices (potentially
&amp;gt; thousands), so I was thinking of putting them into HDFS and
&amp;gt; reimplementing my search as a Hadoop map-reduce. I've noticed that
&amp;gt; lucene tends to do a bit of random seeking when searching though; I
&amp;gt; don't believe that it guarantees that all seeks be to increasing file
&amp;gt; positions either.
&amp;gt; &amp;gt; Would HDFS be a bad fit for an access pattern that involves seeks to
&amp;gt; random positions within a stream?
&amp;gt; &amp;gt; Also, is getFileStatus the typical way of getting the length of a file
&amp;gt; in HDFS, or is there some method on FSDataInputStream that I'm not
&amp;gt; seeing?
&amp;gt; &amp;gt; Please cc: me on any reply; I'm not on the hadoop list. Thanks! &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Raghu Angadi</author>
         <guid isPermaLink="false">urn:uuid:%3c4A464BC1-6060605@yahoo-inc-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 09:41:37 -0700</pubDate>
      </item>
      <item>
         <title>Re: Scaling out/up or a mix</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c314098690906270910k491ba1b9x4a474c138252bea5@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
How about multi-threaded mappers?
Multi-Threaded mappers are ideal for map tasks that are non locally io bound
with many distinct endpoints.
You can also control the thread count on a per job basis. On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou &amp;lt;marcus.herou@tailsweep.com&amp;gt;wrote: &amp;gt; The argument currently against increasing num-mappers is that the machines
&amp;gt; will get into oom and since a lot of the jobs are crawlers I need more
&amp;gt; ip-numbers so I don't get banned :)
&amp;gt;
&amp;gt; Thing is that we currently have solr on the very same machines and
&amp;gt; data-nodes as well so I can only give the MR nodes about 1G memory since I
&amp;gt; need SOLR to have 4G...
&amp;gt;
&amp;gt; Now I see that I should get some obvious and juste critique about the
&amp;gt; layout
&amp;gt; of this arch but I'm a little limited in budget and so is then the arch :)
&amp;gt;
&amp;gt; However is it wise to have the MR tasks on the same nodes as the data-nodes
&amp;gt; or should I split the arch ? I mean the data-nodes perhaps need more
&amp;gt; disk-IO
&amp;gt; and the MR more memory and CPU ?
&amp;gt;
&amp;gt; Trying to find a sweetspot hardware spec of those two roles.
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman &amp;lt;bbockelm@cse.unl.edu
&amp;gt; &amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt; Hey Marcus,
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Are you recording the data rates coming out of HDFS? Since you have such
&amp;gt; a
&amp;gt; &amp;gt; low CPU utilizations, I'd look at boxes utterly packed with big hard
&amp;gt; drives
&amp;gt; &amp;gt; (also, why are you using RAID1 for Hadoop??).
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
&amp;gt; &amp;gt; Based on the data rates you see, make the call.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On the other hand, what's the argument against running 3x more mappers
&amp;gt; per
&amp;gt; &amp;gt; box? It seems that your boxes still have more overhead to use -- there's
&amp;gt; no
&amp;gt; &amp;gt; I/O wait.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Brian
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Hi.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; We have a deployment of 10 hadoop servers and I now need more mapping
&amp;gt; &amp;gt;&amp;gt; capability (no not just add more mappers per instance) since I have so
&amp;gt; &amp;gt;&amp;gt; many
&amp;gt; &amp;gt;&amp;gt; jobs running. Now I am wondering what I should aim on...
&amp;gt; &amp;gt;&amp;gt; Memory, cpu or disk... How long is a rope perhaps you would say ?
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; A typical server is currently using about 15-20% cpu today on a
&amp;gt; quad-core
&amp;gt; &amp;gt;&amp;gt; 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; Some specs below.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt;&amp;gt; mpstat 2 5
&amp;gt; &amp;gt;&amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; Linux 2.6.24-19-server (mapreduce2) 06/26/2009
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft %steal
&amp;gt; &amp;gt;&amp;gt; %idle intr/s
&amp;gt; &amp;gt;&amp;gt; 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 0.00
&amp;gt; &amp;gt;&amp;gt; 69.45 8572.50
&amp;gt; &amp;gt;&amp;gt; 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 0.00
&amp;gt; &amp;gt;&amp;gt; 79.48 8075.50
&amp;gt; &amp;gt;&amp;gt; 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 0.00
&amp;gt; &amp;gt;&amp;gt; 78.95 9219.00
&amp;gt; &amp;gt;&amp;gt; 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 0.00
&amp;gt; &amp;gt;&amp;gt; 80.80 8489.50
&amp;gt; &amp;gt;&amp;gt; 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 0.00
&amp;gt; &amp;gt;&amp;gt; 83.96 5495.00
&amp;gt; &amp;gt;&amp;gt; Average: all 15.62 0.00 1.79 1.47 0.62 1.97 0.00
&amp;gt; &amp;gt;&amp;gt; 78.53 7970.30
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; What I am thinking is... Is it wiser to go for many of these cheap boxes
&amp;gt; &amp;gt;&amp;gt; with 8GB of RAM or should I for instance focus on machines which can
&amp;gt; give
&amp;gt; &amp;gt;&amp;gt; more I|O throughput ?
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; I know that these things are hard but perhaps someone have draw some
&amp;gt; &amp;gt;&amp;gt; conclusions before the pragmatic way.
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; Kindly
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; //Marcus
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;&amp;gt; --
&amp;gt; &amp;gt;&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; &amp;gt;&amp;gt; +46702561312
&amp;gt; &amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt; &amp;gt;&amp;gt; http://www.tailsweep.com/
&amp;gt; &amp;gt;&amp;gt;
&amp;gt; &amp;gt;
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; +46702561312
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; http://www.tailsweep.com/
&amp;gt; -- Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>jason hadoop</author>
         <guid isPermaLink="false">urn:uuid:%3c314098690906270910k491ba1b9x4a474c138252bea5@mail-gmail-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 09:10:25 -0700</pubDate>
      </item>
      <item>
         <title>Re: Confused about partitioning and reducers</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4af5cd780906270830r58d4d009r5e5ba258629e163b@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Please disregard this question. I think I'm mistaken. On Sat, Jun 27, 2009 at 10:25 AM, Stuart White &amp;lt;stuart.white1@gmail.com&amp;gt;wrote: &amp;gt; If I call HashPartitioner.getPartition(), passing a key of 4 and a
&amp;gt; numPartitions of 5, it returns a partition of 4. (Which is what I would
&amp;gt; expect.)
&amp;gt;
&amp;gt; However, if I have a mapred job, and in my mapper I emit a record with key
&amp;gt; 4, I'm configured to use the HashPartitioner, I have 5 Reducers configured,
&amp;gt; and I'm using the IdentityReducer, the record with key 4 gets handled by
&amp;gt; Reducer #0 (because it gets written out to part-00000).
&amp;gt;
&amp;gt; I would have expected a record with key 4 to be handled by reducer #4 (and
&amp;gt; therefore written to part-00004) because the HashPartitioner returns 4 for a
&amp;gt; key of 4 and a numPartitions of 5.
&amp;gt;
&amp;gt; Obviously I'm missing something here. What is the logic for deciding which
&amp;gt; partition of records is handled by which reducer instance?
&amp;gt;
&amp;gt; It can't be random, otherwise mapside join wouldn't work.
&amp;gt;
&amp;gt; Thanks.
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Stuart White</author>
         <guid isPermaLink="false">urn:uuid:%3c4af5cd780906270830r58d4d009r5e5ba258629e163b@mail-gmail-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 08:30:17 -0700</pubDate>
      </item>
      <item>
         <title>Re: Scaling out/up or a mix</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906270826l64f96c0dg3db4d96089bd73f5@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
The argument currently against increasing num-mappers is that the machines
will get into oom and since a lot of the jobs are crawlers I need more
ip-numbers so I don't get banned :) Thing is that we currently have solr on the very same machines and
data-nodes as well so I can only give the MR nodes about 1G memory since I
need SOLR to have 4G... Now I see that I should get some obvious and juste critique about the layout
of this arch but I'm a little limited in budget and so is then the arch :) However is it wise to have the MR tasks on the same nodes as the data-nodes
or should I split the arch ? I mean the data-nodes perhaps need more disk-IO
and the MR more memory and CPU ? Trying to find a sweetspot hardware spec of those two roles. //Marcus On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman &amp;lt;bbockelm@cse.unl.edu&amp;gt;wrote: &amp;gt; Hey Marcus,
&amp;gt;
&amp;gt; Are you recording the data rates coming out of HDFS? Since you have such a
&amp;gt; low CPU utilizations, I'd look at boxes utterly packed with big hard drives
&amp;gt; (also, why are you using RAID1 for Hadoop??).
&amp;gt;
&amp;gt; You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
&amp;gt; Based on the data rates you see, make the call.
&amp;gt;
&amp;gt; On the other hand, what's the argument against running 3x more mappers per
&amp;gt; box? It seems that your boxes still have more overhead to use -- there's no
&amp;gt; I/O wait.
&amp;gt;
&amp;gt; Brian
&amp;gt;
&amp;gt;
&amp;gt; On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
&amp;gt;
&amp;gt; Hi.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; We have a deployment of 10 hadoop servers and I now need more mapping
&amp;gt;&amp;gt; capability (no not just add more mappers per instance) since I have so
&amp;gt;&amp;gt; many
&amp;gt;&amp;gt; jobs running. Now I am wondering what I should aim on...
&amp;gt;&amp;gt; Memory, cpu or disk... How long is a rope perhaps you would say ?
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; A typical server is currently using about 15-20% cpu today on a quad-core
&amp;gt;&amp;gt; 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Some specs below.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt; mpstat 2 5
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Linux 2.6.24-19-server (mapreduce2) 06/26/2009
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft %steal
&amp;gt;&amp;gt; %idle intr/s
&amp;gt;&amp;gt; 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 0.00
&amp;gt;&amp;gt; 69.45 8572.50
&amp;gt;&amp;gt; 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 0.00
&amp;gt;&amp;gt; 79.48 8075.50
&amp;gt;&amp;gt; 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 0.00
&amp;gt;&amp;gt; 78.95 9219.00
&amp;gt;&amp;gt; 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 0.00
&amp;gt;&amp;gt; 80.80 8489.50
&amp;gt;&amp;gt; 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 0.00
&amp;gt;&amp;gt; 83.96 5495.00
&amp;gt;&amp;gt; Average: all 15.62 0.00 1.79 1.47 0.62 1.97 0.00
&amp;gt;&amp;gt; 78.53 7970.30
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; What I am thinking is... Is it wiser to go for many of these cheap boxes
&amp;gt;&amp;gt; with 8GB of RAM or should I for instance focus on machines which can give
&amp;gt;&amp;gt; more I|O throughput ?
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; I know that these things are hard but perhaps someone have draw some
&amp;gt;&amp;gt; conclusions before the pragmatic way.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Kindly
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; //Marcus
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; --
&amp;gt;&amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt;&amp;gt; +46702561312
&amp;gt;&amp;gt; marcus.herou@tailsweep.com
&amp;gt;&amp;gt; http://www.tailsweep.com/
&amp;gt;&amp;gt;
&amp;gt;
&amp;gt; -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906270826l64f96c0dg3db4d96089bd73f5@mail-gmail-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 08:26:31 -0700</pubDate>
      </item>
      <item>
         <title>Confused about partitioning and reducers</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4af5cd780906270825w1242a689p96d23e293d2cedb0@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
If I call HashPartitioner.getPartition(), passing a key of 4 and a
numPartitions of 5, it returns a partition of 4. (Which is what I would
expect.) However, if I have a mapred job, and in my mapper I emit a record with key
4, I'm configured to use the HashPartitioner, I have 5 Reducers configured,
and I'm using the IdentityReducer, the record with key 4 gets handled by
Reducer #0 (because it gets written out to part-00000). I would have expected a record with key 4 to be handled by reducer #4 (and
therefore written to part-00004) because the HashPartitioner returns 4 for a
key of 4 and a numPartitions of 5. Obviously I'm missing something here. What is the logic for deciding which
partition of records is handled by which reducer instance? It can't be random, otherwise mapside join wouldn't work. Thanks. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Stuart White</author>
         <guid isPermaLink="false">urn:uuid:%3c4af5cd780906270825w1242a689p96d23e293d2cedb0@mail-gmail-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 08:25:15 -0700</pubDate>
      </item>
      <item>
         <title>Archives not getting unarchived at tasktrackers</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c24233281.post@talk.nabble.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; Hi All, I am using DistributedCache.addCacheArchives() to distribute a tar file to
the tasktrackers using the following statement. DistributedCache.addCacheArchives(new URI(&quot;/home/akhil1988/sample.tar&quot;),
conf); According to the documentation it should get unarchived at the tasktrackers.
But the statement: DistributedCache.getLocalCacheArchives(conf); returns the following Path /hadoop/tmp/hadoop/mapred/local/taskTracker/archive/cn1.cloud.cs.illinois.edu/home/akhil1988/sample.tar That means sample.tar did not get unarchived.
Nor I am able to access file sample.txt in the above folder. Can anyone tell where I am going wrong? I tarred the file sample.txt using the following command: tar -cvf
sample.tar sample.txt Thanks,
Akhil
-- View this message in context: http://www.nabble.com/Archives-not-getting-unarchived-at-tasktrackers-tp24233281p24233281.html
Sent from the Hadoop core-user mailing list archive at Nabble.com. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>akhil1988</author>
         <guid isPermaLink="false">urn:uuid:%3c24233281-post@talk-nabble-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 06:36:40 -0700</pubDate>
      </item>
      <item>
         <title>Re: Using the Stanford NLP with hadoop</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c24231349.post@talk.nabble.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; by 'ClassName', which class are you actually refering to?
the class in which the LexicalParser is invoked? in my code, the class that implements the parser is named 'parse'
and this is the code that i used. lp = new LexicalizedParser(new ObjectInputStream(new
GZIPInputStream(parse.class.getResourceAsStream(&quot;/englishPCFG.ser.gz&quot;)))); the program runs to completion and map-reduce process is declared as
successfully completed everytime even if the code is changed to lp = new LexicalizedParser(new ObjectInputStream(new
GZIPInputStream(parse.class.getResourceAsStream(&quot;/englishPCF_G.ser.gz&quot;)))); this indicates that the getResourceAsStream does throw an exception even if
the file is not present, i guess. any ideas? :confused: Kevin Peterson-3 wrote:
&amp;gt; &amp;gt; On Sat, Apr 18, 2009 at 5:18 AM, hari939 wrote:
&amp;gt; &amp;gt;&amp;gt;
&amp;gt;&amp;gt; My project of parsing through material for a semantic search engine
&amp;gt;&amp;gt; requires
&amp;gt;&amp;gt; me to use the http://nlp.stanford.edu/software/lex-parser.shtml Stanford
&amp;gt;&amp;gt; NLP parser on hadoop cluster.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; To use the Stanford NLP parser, one must create a lexical parser object
&amp;gt;&amp;gt; using a englishPCFG.ser.gz file as a constructor's parameter.
&amp;gt;&amp;gt; i have tried loading the file onto the Hadoop dfs in the /user/root/
&amp;gt;&amp;gt; folder
&amp;gt;&amp;gt; and have also tried packing the file along with the jar of the java
&amp;gt;&amp;gt; program.
&amp;gt; &amp;gt; &amp;gt; Use getResourceAsStream to read it from the jar.
&amp;gt; &amp;gt; Use the ObjectInputStream constructor.
&amp;gt; &amp;gt; That is, new LexicalizedParser(new ObjectInputStream(new
&amp;gt; GzipInputStream(ClassName.class.getResourceAsStream(&quot;/englishPCFG.ser.gz&quot;)))
&amp;gt; &amp;gt; I'm interested to know if you have found any other open source parsers in
&amp;gt; Java or at least have java bindings.
&amp;gt; &amp;gt; -- View this message in context: http://www.nabble.com/Using-the-Stanford-NLP-with-hadoop-tp23112316p24231349.html
Sent from the Hadoop core-user mailing list archive at Nabble.com. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>hari939</author>
         <guid isPermaLink="false">urn:uuid:%3c24231349-post@talk-nabble-com%3e</guid>
         <pubDate>Sat, 27 Jun 2009 01:35:09 -0700</pubDate>
      </item>
      <item>
         <title>Re: Scaling out/up or a mix</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4B0CD558-03AD-427F-8A35-28956D3514CB@cse.unl.edu%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hey Marcus, Are you recording the data rates coming out of HDFS? Since you have such a low CPU utilizations, I'd look at boxes utterly packed with big hard drives (also, why are you using RAID1 for Hadoop??). You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. Based on the data rates you see, make the call. On the other hand, what's the argument against running 3x more mappers per box? It seems that your boxes still have more overhead to use -- there's no I/O wait. Brian On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: &amp;gt; Hi.
&amp;gt;
&amp;gt; We have a deployment of 10 hadoop servers and I now need more mapping
&amp;gt; capability (no not just add more mappers per instance) since I have &amp;gt; so many
&amp;gt; jobs running. Now I am wondering what I should aim on...
&amp;gt; Memory, cpu or disk... How long is a rope perhaps you would say ?
&amp;gt;
&amp;gt; A typical server is currently using about 15-20% cpu today on a quad- &amp;gt; core
&amp;gt; 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
&amp;gt;
&amp;gt; Some specs below.
&amp;gt;&amp;gt; mpstat 2 5
&amp;gt; Linux 2.6.24-19-server (mapreduce2) 06/26/2009
&amp;gt;
&amp;gt; 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft &amp;gt; %steal
&amp;gt; %idle intr/s
&amp;gt; 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 &amp;gt; 0.00
&amp;gt; 69.45 8572.50
&amp;gt; 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 &amp;gt; 0.00
&amp;gt; 79.48 8075.50
&amp;gt; 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 &amp;gt; 0.00
&amp;gt; 78.95 9219.00
&amp;gt; 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 &amp;gt; 0.00
&amp;gt; 80.80 8489.50
&amp;gt; 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 &amp;gt; 0.00
&amp;gt; 83.96 5495.00
&amp;gt; Average: all 15.62 0.00 1.79 1.47 0.62 1.97 &amp;gt; 0.00
&amp;gt; 78.53 7970.30
&amp;gt;
&amp;gt; What I am thinking is... Is it wiser to go for many of these cheap &amp;gt; boxes
&amp;gt; with 8GB of RAM or should I for instance focus on machines which can &amp;gt; give
&amp;gt; more I|O throughput ?
&amp;gt;
&amp;gt; I know that these things are hard but perhaps someone have draw some
&amp;gt; conclusions before the pragmatic way.
&amp;gt;
&amp;gt; Kindly
&amp;gt;
&amp;gt; //Marcus
&amp;gt;
&amp;gt;
&amp;gt; -- &amp;gt; Marcus Herou CTO and co-founder Tailsweep AB
&amp;gt; +46702561312
&amp;gt; marcus.herou@tailsweep.com
&amp;gt; http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Brian Bockelman</author>
         <guid isPermaLink="false">urn:uuid:%3c4B0CD558-03AD-427F-8A35-28956D3514CB@cse-unl-edu%3e</guid>
         <pubDate>Fri, 26 Jun 2009 19:24:31 -0700</pubDate>
      </item>
      <item>
         <title>Map/Reduce Errors</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cop.uv5vlk0uaanks1@dell-pc.oslo.opera.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi All, I had posted a question earlier regarding some not so intuitive error messages that I was getting on one of the clusters when trying to map/reduce. After many hours of googling :) i found a post that solved my problem. http://www.mail-archive.com/core-user@hadoop.apache.org/msg07202.html. One of our engineers ran way too many jobs that created enormous subdirs in $HADOOP_HOME/logs/userlogs. Deleting these subdirs under $HADOOP_HOME/logs/userlogs/ on the datanodes solved the problem. You can also set the cleanup in the hadoop-default.xml file by setting the cleanup time to x hours instead of 24. The specific param is userlogs.retain. Just wanted to share this with you all. Thanks,
Usman -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>&quot;Usman Waheed&quot;</author>
         <guid isPermaLink="false">urn:uuid:%3cop-uv5vlk0uaanks1@dell-pc-oslo-opera-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 19:23:34 -0700</pubDate>
      </item>
      <item>
         <title>Re: Hadoop0.20 - Class Not Found exception</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c8131791a0906261858t2baa4272u340aad680700f799@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
I meet the question also, resolved by transfer a class name to JobConf
constructor. If you new the *JobConf,you must transfer a class name to it.* 2009/6/27 Amandeep Khurana &amp;lt;amansk@gmail.com&amp;gt; &amp;gt; I'm getting the following error while starting a MR job:
&amp;gt;
&amp;gt; Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
&amp;gt; oracle.jdbc.driver.OracleDriver
&amp;gt; at
&amp;gt;
&amp;gt; org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297)
&amp;gt; ... 21 more
&amp;gt; Caused by: java.lang.ClassNotFoundException:
&amp;gt; oracle.jdbc.driver.OracleDriver
&amp;gt; at java.net.URLClassLoader$1.run(Unknown Source)
&amp;gt; at java.security.AccessController.doPrivileged(Native Method)
&amp;gt; at java.net.URLClassLoader.findClass(Unknown Source)
&amp;gt; at java.lang.ClassLoader.loadClass(Unknown Source)
&amp;gt; at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
&amp;gt; at java.lang.ClassLoader.loadClass(Unknown Source)
&amp;gt; at java.lang.ClassLoader.loadClassInternal(Unknown Source)
&amp;gt; at java.lang.Class.forName0(Native Method)
&amp;gt; at java.lang.Class.forName(Unknown Source)
&amp;gt; at
&amp;gt;
&amp;gt; org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123)
&amp;gt; at
&amp;gt;
&amp;gt; org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292)
&amp;gt; ... 21 more
&amp;gt;
&amp;gt; Interestingly, the relevant jar is bundled into the MR job jar and its also
&amp;gt; there in the $HADOOP_HOME/lib directory.
&amp;gt;
&amp;gt; Exactly same thing worked with 0.19.. Not sure what could have changed or I
&amp;gt; broke to cause this error...
&amp;gt;
&amp;gt; Amandeep
&amp;gt;
&amp;gt;
&amp;gt; Amandeep Khurana
&amp;gt; Computer Science Graduate Student
&amp;gt; University of California, Santa Cruz
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>imcaptor</author>
         <guid isPermaLink="false">urn:uuid:%3c8131791a0906261858t2baa4272u340aad680700f799@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 18:58:25 -0700</pubDate>
      </item>
      <item>
         <title>Re: Can I post pig questions on this forum?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cFB7B9AE5-CAF0-4DDC-9D83-0000B609FE3E@yahoo-inc.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
pig-user@hadoop.apache.org is the right place for pig questions. Alan. On Jun 26, 2009, at 4:34 PM, pmg wrote: &amp;gt;
&amp;gt;
&amp;gt; -- &amp;gt; View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
&amp;gt; Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Alan Gates</author>
         <guid isPermaLink="false">urn:uuid:%3cFB7B9AE5-CAF0-4DDC-9D83-0000B609FE3E@yahoo-inc-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 17:57:14 -0700</pubDate>
      </item>
      <item>
         <title>Re: Using addCacheArchive</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c24229338.post@talk.nabble.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; Thanks Chris for your reply! Well, I could not understand much of what has been discussed on that forum.
I am unaware of Cascading. My problem is simple - I want a directory to present in the local working
directory of tasks so that I can access it from my map task in the following
manner : FileInputStream fin = new FileInputStream(&quot;Config/file1.config&quot;); where,
Config is a directory which contains many files/directories, one of which is
file1.config It would be helpful to me if you can tell me what statements to use to
distribute a directory to the tasktrackers.
The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says
that archives are unzipped on the tasktrackers but I want an example of how
to use this in case of a dreictory. Thanks,
Akhil Chris Curtin-2 wrote:
&amp;gt; &amp;gt; Hi,
&amp;gt; &amp;gt; I've found it much easier to write the file to HDFS use the API, then pass
&amp;gt; the 'path' to the file in HDFS as a property. You'll need to remember to
&amp;gt; clean up the file after you're done with it.
&amp;gt; &amp;gt; Example details are in this thread:
&amp;gt; http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6#
&amp;gt; &amp;gt; Hope this helps,
&amp;gt; &amp;gt; Chris
&amp;gt; &amp;gt; On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 &amp;lt;akhilanger@gmail.com&amp;gt; wrote:
&amp;gt; &amp;gt;&amp;gt;
&amp;gt;&amp;gt; Please ask any questions if I am not clear above about the problem I am
&amp;gt;&amp;gt; facing.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Thanks,
&amp;gt;&amp;gt; Akhil
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; akhil1988 wrote:
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; Hi All!
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; I want a directory to be present in the local working directory of the
&amp;gt;&amp;gt; &amp;gt; task for which I am using the following statements:
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; DistributedCache.addCacheArchive(new URI(&quot;/home/akhil1988/Config.zip&quot;),
&amp;gt;&amp;gt; &amp;gt; conf);
&amp;gt;&amp;gt; &amp;gt; DistributedCache.createSymlink(conf);
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; Here Config is a directory which I have zipped and put at the given
&amp;gt;&amp;gt; &amp;gt;&amp;gt;&amp;gt; location in HDFS
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; I have zipped the directory because the API doc of DistributedCache
&amp;gt;&amp;gt; &amp;gt; (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that
&amp;gt;&amp;gt; the
&amp;gt;&amp;gt; &amp;gt; archive files are unzipped in the local cache directory :
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; DistributedCache can be used to distribute simple, read-only data/text
&amp;gt;&amp;gt; &amp;gt; files and/or more complex types such as archives, jars etc. Archives
&amp;gt;&amp;gt; (zip,
&amp;gt;&amp;gt; &amp;gt; tar and tgz/tar.gz files) are un-archived at the slave nodes.
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; So, from my understanding of the API docs I expect that the Config.zip
&amp;gt;&amp;gt; &amp;gt; file will be unzipped to Config directory and since I have SymLinked
&amp;gt;&amp;gt; them
&amp;gt;&amp;gt; &amp;gt; I can access the directory in the following manner from my map
&amp;gt;&amp;gt; function:
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; FileInputStream fin = new FileInputStream(&quot;Config/file1.config&quot;);
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; But I get the FileNotFoundException on the execution of this statement.
&amp;gt;&amp;gt; &amp;gt; Please let me know where I am going wrong.
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt; &amp;gt; Thanks,
&amp;gt;&amp;gt; &amp;gt; Akhil
&amp;gt;&amp;gt; &amp;gt;
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; --
&amp;gt;&amp;gt; View this message in context:
&amp;gt;&amp;gt; http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html
&amp;gt;&amp;gt; Sent from the Hadoop core-user mailing list archive at Nabble.com.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;
&amp;gt; &amp;gt; -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html
Sent from the Hadoop core-user mailing list archive at Nabble.com. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>akhil1988</author>
         <guid isPermaLink="false">urn:uuid:%3c24229338-post@talk-nabble-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 17:55:39 -0700</pubDate>
      </item>
      <item>
         <title>Re: FileStatus.getLen(): bug in documentation or bug in implememtation?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c4A455EEF.5030303@yahoo-inc.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Documentation is wrong. Implementation wins.
Could you please file a bug. Thanks,
--Konstantin Dima Rzhevskiy wrote:
&amp;gt; Hi all
&amp;gt; I try get length of file hadoop(RawFilesysten or hdfs) .
&amp;gt; In javadoc method org.apache.hadoop.fs.FileStatus.getLen() writtend that
&amp;gt; this method &quot;return the length of this file, in blocks&quot;
&amp;gt; But method return size in bytes.
&amp;gt; &amp;gt; Is this bug in documentation or implememtation?
&amp;gt; I use hadoop-0.18.3.
&amp;gt; &amp;gt; &amp;gt; Dmitry Rzhevskiy.
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Konstantin Shvachko</author>
         <guid isPermaLink="false">urn:uuid:%3c4A455EEF-5030303@yahoo-inc-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:51:11 -0700</pubDate>
      </item>
      <item>
         <title>Hadoop0.20 - Class Not Found exception</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c35a22e220906261643m310d0e5mf8a1418083af8f01@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
I'm getting the following error while starting a MR job: Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
oracle.jdbc.driver.OracleDriver at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:297) ... 21 more
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClassInternal(Unknown Source) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Unknown Source) at
org.apache.hadoop.mapred.lib.db.DBConfiguration.getConnection(DBConfiguration.java:123) at
org.apache.hadoop.mapred.lib.db.DBInputFormat.configure(DBInputFormat.java:292) ... 21 more Interestingly, the relevant jar is bundled into the MR job jar and its also
there in the $HADOOP_HOME/lib directory. Exactly same thing worked with 0.19.. Not sure what could have changed or I
broke to cause this error... Amandeep Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Amandeep Khurana</author>
         <guid isPermaLink="false">urn:uuid:%3c35a22e220906261643m310d0e5mf8a1418083af8f01@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:43:04 -0700</pubDate>
      </item>
      <item>
         <title>Re: Can I post pig questions on this forum?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c69035570906261636n52e9c462sa1092fead9b291c5@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
pig-user@hadoop.apache.org On Fri, Jun 26, 2009 at 4:34 PM, pmg&amp;lt;parmod.mehta@gmail.com&amp;gt; wrote:
&amp;gt;
&amp;gt;
&amp;gt; --
&amp;gt; View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
&amp;gt; Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
&amp;gt;
&amp;gt; -- get hadoop: cloudera.com/hadoop
online training: cloudera.com/hadoop-training
blog: cloudera.com/blog
twitter: twitter.com/cloudera &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Christophe Bisciglia</author>
         <guid isPermaLink="false">urn:uuid:%3c69035570906261636n52e9c462sa1092fead9b291c5@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:36:34 -0700</pubDate>
      </item>
      <item>
         <title>Can I post pig questions on this forum?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c24228728.post@talk.nabble.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; -- View this message in context: http://www.nabble.com/Can-I-post-pig-questions-on-this-forum--tp24228728p24228728.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>pmg</author>
         <guid isPermaLink="false">urn:uuid:%3c24228728-post@talk-nabble-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:34:05 -0700</pubDate>
      </item>
      <item>
         <title>FileStatus.getLen(): bug in documentation or bug in implememtation?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7ec9beea0906261618t2af0a262tedd70aa9993657b5@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi all
I try get length of file hadoop(RawFilesysten or hdfs) .
In javadoc method org.apache.hadoop.fs.FileStatus.getLen() writtend that
this method &quot;return the length of this file, in blocks&quot;
But method return size in bytes. Is this bug in documentation or implememtation?
I use hadoop-0.18.3. Dmitry Rzhevskiy. &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Dima Rzhevskiy</author>
         <guid isPermaLink="false">urn:uuid:%3c7ec9beea0906261618t2af0a262tedd70aa9993657b5@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:18:51 -0700</pubDate>
      </item>
      <item>
         <title>Re: Permissions needed to run RandomWriter ?</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c623d9cf40906261606x56525151wa1b958f772a503fd@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Have you tried to run the example job as the superuser? It seems like this
might be an issue where hadoop.tmp.dir doesn't have the correctly
permissions. hadoop.tmp.dir and dfs.data.dir should be owned by the unix
user running your Hadoop daemons and owner-writtable and readable. Can you confirm this is the case? Thanks, Alex On Fri, Jun 26, 2009 at 1:29 PM, Mulcahy, Stephen
&amp;lt;stephen.mulcahy@deri.org&amp;gt;wrote: &amp;gt; [Apologies for the top-post, sending this from a dodgy webmail client]
&amp;gt;
&amp;gt; Hi Alex,
&amp;gt;
&amp;gt; My hadoop-site.xml is as follows,
&amp;gt;
&amp;gt; &amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;
&amp;gt; &amp;lt;?xml-stylesheet type=&quot;text/xsl&quot; href=&quot;configuration.xsl&quot;?&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;!-- Put site-specific property overrides in this file. --&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;configuration&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;hadoop01:9001&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;hdfs://hadoop01:9000&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;hadoop.tmp.dir&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;/data1/hadoop-tmp/&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;/data1/hdfs,/data2/hdfs&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt; &amp;lt;/configuration&amp;gt;
&amp;gt;
&amp;gt; Any comments welcome,
&amp;gt;
&amp;gt; -stephen
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; -----Original Message-----
&amp;gt; From: Alex Loddengaard [mailto:alex@cloudera.com]
&amp;gt; Sent: Fri 26/06/2009 18:32
&amp;gt; To: core-user@hadoop.apache.org
&amp;gt; Subject: Re: Permissions needed to run RandomWriter ?
&amp;gt;
&amp;gt; Hey Stephen,
&amp;gt;
&amp;gt; What does your hadoop-site.xml look like? The Exception is in
&amp;gt; java.io.UnixFileSystem, which makes me think that you're actually creating
&amp;gt; and modifying directories on your local file system instead of HDFS. Make
&amp;gt; sure &quot;fs.default.name&quot; looks like &quot;hdfs://your-namenode.domain.com:PORT&quot;.
&amp;gt;
&amp;gt; Alex
&amp;gt;
&amp;gt; On Fri, Jun 26, 2009 at 4:40 AM, stephen mulcahy
&amp;gt; &amp;lt;stephen.mulcahy@deri.org&amp;gt;wrote:
&amp;gt;
&amp;gt; &amp;gt; Hi,
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; I've just installed a new test cluster and I'm trying to give it a quick
&amp;gt; &amp;gt; smoke test with RandomWriter and Sort.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; I can run these fine with the superuser account. When I try to run them
&amp;gt; as
&amp;gt; &amp;gt; another user I run into problems even though I've created the output
&amp;gt; &amp;gt; directory and given permissions to the other user to write to this
&amp;gt; &amp;gt; directory. i.e.
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; 1. smulcahy@hadoop01:~$ hadoop fs -mkdir /foo
&amp;gt; &amp;gt; mkdir: org.apache.hadoop.fs.permission.AccessControlException: Permission
&amp;gt; &amp;gt; denied: user=smulcahy, access=WRITE, inode=&quot;&quot;:hadoop:supergroup:rwxr-xr-x
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; OK - we don't have permissions anyways
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; 2. hadoop@hadoop01:/$ hadoop fs -mkdir /foo
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; OK
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; 3. hadoop fs -chown -R smulcahy /foo
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; OK
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; 4. smulcahy@hadoop01:~$ hadoop fs -mkdir /foo/test
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; OK
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; 5. smulcahy@hadoop01:~$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar
&amp;gt; &amp;gt; randomwriter /foo
&amp;gt; &amp;gt; java.io.IOException: Permission denied
&amp;gt; &amp;gt; at java.io.UnixFileSystem.createFileExclusively(Native Method)
&amp;gt; &amp;gt; at java.io.File.checkAndCreate(File.java:1704)
&amp;gt; &amp;gt; at java.io.File.createTempFile(File.java:1793)
&amp;gt; &amp;gt; at org.apache.hadoop.util.RunJar.main(RunJar.java:115)
&amp;gt; &amp;gt; at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
&amp;gt; &amp;gt; at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
&amp;gt; &amp;gt; at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
&amp;gt; &amp;gt; at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Any suggestions on why step 5. is failing even though I have write
&amp;gt; &amp;gt; permissions to /foo - do I need permissions on some other directory also
&amp;gt; or
&amp;gt; &amp;gt; ... ?
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; Thanks,
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; -stephen
&amp;gt; &amp;gt;
&amp;gt; &amp;gt; --
&amp;gt; &amp;gt; Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
&amp;gt; &amp;gt; NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
&amp;gt; &amp;gt; http://di2.deri.ie http://webstar.deri.ie http://sindice.com
&amp;gt; &amp;gt;
&amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Alex Loddengaard</author>
         <guid isPermaLink="false">urn:uuid:%3c623d9cf40906261606x56525151wa1b958f772a503fd@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 16:06:28 -0700</pubDate>
      </item>
      <item>
         <title>difference between 'hadoop.tmp.dir' &amp; 'mapred.temp.dir'</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cBLU132-W24F5519614E0594BB1F9D0A0350@phx.gbl%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt; Hi, Can some body kindly explain the difference b/w 'hadoop.tmp.dir' &amp;amp; 'mapred.temp.dir'.
I am trying to figure out where does the intermediate temporary files are stored for a mapreduce
job. Thanks,
--umer _________________________________________________________________
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=create&amp;amp;wx_url=/friends.aspx&amp;amp;mkt=en-us &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>umer arshad</author>
         <guid isPermaLink="false">urn:uuid:%3cBLU132-W24F5519614E0594BB1F9D0A0350@phx-gbl%3e</guid>
         <pubDate>Fri, 26 Jun 2009 15:03:44 -0700</pubDate>
      </item>
      <item>
         <title>Re: Error in Cluster Startup: NameNode is not formatted</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c0BE223DB-0149-4597-BA74-1A13BDC9C386@cloudera.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
The property &quot;dfs.name.dir&quot; allows you to control where Hadoop writes NameNode metadata. You should have a property like &amp;lt;property&amp;gt;
&amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;
&amp;lt;value&amp;gt;/data/zhang/hadoop/name/data&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt; to make sure the NameNode data isn't being deleted when you delete the files in /tmp. -Matt On Jun 26, 2009, at 2:33 PM, Boyu Zhang wrote: &amp;gt; Matt,
&amp;gt;
&amp;gt; Thanks a lot for your reply! I did formatted the namenode. But I got &amp;gt; the
&amp;gt; same error again. And actually I successfully run the example jar &amp;gt; file once,
&amp;gt; but after that one time, I couldn't get it run again. I clean the / &amp;gt; tmp dir
&amp;gt; every time before I format namenode again(I am just testing it, so I &amp;gt; don't
&amp;gt; worry about losing data:). Still, I got the same error when I &amp;gt; execute the
&amp;gt; bin/start-dfs.sh . I checked my conf, and I can't figure out why. &amp;gt; Here is my
&amp;gt; conf file:
&amp;gt;
&amp;gt; I really appreciate if you could take a look at it. Thanks a lot.
&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;configuration&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;hdfs://hostname1:9000&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;hostname2:9001&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;dfs.data.dir&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;/data/zhang/hadoop/dfs/data&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;description&amp;gt;Determines where on the local filesystem an DFS data &amp;gt; node
&amp;gt; should store its blocks. If this is a comma-delimited
&amp;gt; list of directories, then data will be stored in all named
&amp;gt; directories, typically on different devices.
&amp;gt; Directories that do not exist are ignored.
&amp;gt; &amp;lt;/description&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; &amp;lt;property&amp;gt;
&amp;gt; &amp;lt;name&amp;gt;mapred.local.dir&amp;lt;/name&amp;gt;
&amp;gt; &amp;lt;value&amp;gt;/data/zhang/hadoop/mapred/local&amp;lt;/value&amp;gt;
&amp;gt; &amp;lt;description&amp;gt;The local directory where MapReduce stores intermediate
&amp;gt; data files. May be a comma-separated list of
&amp;gt; directories on different devices in order to spread disk i/o.
&amp;gt; Directories that do not exist are ignored.
&amp;gt; &amp;lt;/description&amp;gt;
&amp;gt; &amp;lt;/property&amp;gt;
&amp;gt; &amp;lt;/configuration&amp;gt;
&amp;gt;
&amp;gt;
&amp;gt; -----Original Message-----
&amp;gt; From: Matt Massie [mailto:matt@cloudera.com]
&amp;gt; Sent: Friday, June 26, 2009 4:31 PM
&amp;gt; To: core-user@hadoop.apache.org
&amp;gt; Subject: Re: Error in Cluster Startup: NameNode is not formatted
&amp;gt;
&amp;gt; Boyu-
&amp;gt;
&amp;gt; You didn't do anything stupid. I've forgotten to format a NameNode
&amp;gt; too myself.
&amp;gt;
&amp;gt; If you check the QuickStart guide at
&amp;gt; http://hadoop.apache.org/core/docs/current/quickstart.html
&amp;gt; you'll see that formatting the NameNode is the first of the
&amp;gt; Execution section (near the bottom of the page).
&amp;gt;
&amp;gt; The command to format the NameNode is:
&amp;gt;
&amp;gt; hadoop namenode -format
&amp;gt;
&amp;gt; A warning though, you should only format your NameNode once. Just
&amp;gt; like formatting any filesystem, you can loss data if you (re)format.
&amp;gt;
&amp;gt; Good luck.
&amp;gt;
&amp;gt; -Matt
&amp;gt;
&amp;gt; On Jun 26, 2009, at 1:25 PM, Boyu Zhang wrote:
&amp;gt;
&amp;gt;&amp;gt; Hi all,
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; I am a student and I am trying to install the Hadoop on a cluster, I
&amp;gt;&amp;gt; have
&amp;gt;&amp;gt; one machine running namenode, one running jobtracker, two slaves.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; When I run the /bin/start-dfs.sh , there is something wrong with my
&amp;gt;&amp;gt; namenode, it won't start. Here is the error message in the log file:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization
&amp;gt;&amp;gt; failed.
&amp;gt;&amp;gt; java.io.IOException: NameNode is not formatted.
&amp;gt;&amp;gt; at
&amp;gt;&amp;gt; org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:243)
&amp;gt;&amp;gt; at
&amp;gt;&amp;gt; org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
&amp;gt;&amp;gt; at
&amp;gt;&amp;gt; org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
&amp;gt;&amp;gt; at
&amp;gt;&amp;gt; org.apache.hadoop.dfs.FSNamesystem.&amp;lt;init&amp;gt;(FSNamesystem.java:273)
&amp;gt;&amp;gt; at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
&amp;gt;&amp;gt; at org.apache.hadoop.dfs.NameNode.&amp;lt;init&amp;gt;(NameNode.java:193)
&amp;gt;&amp;gt; at org.apache.hadoop.dfs.NameNode.&amp;lt;init&amp;gt;(NameNode.java:179)
&amp;gt;&amp;gt; at
&amp;gt;&amp;gt; org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
&amp;gt;&amp;gt; at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; I think it is something stupid i did, could somebody help me out?
&amp;gt;&amp;gt; Thanks a
&amp;gt;&amp;gt; lot!
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Sincerely,
&amp;gt;&amp;gt;
&amp;gt;&amp;gt; Boyu Zhang
&amp;gt;
&amp;gt; &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Matt Massie</author>
         <guid isPermaLink="false">urn:uuid:%3c0BE223DB-0149-4597-BA74-1A13BDC9C386@cloudera-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 14:45:13 -0700</pubDate>
      </item>
      <item>
         <title>Scaling out/up or a mix</title>
         <link>http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3c7e536b1f0906261443w644fe56dl6ee4407f38ff67bd@mail.gmail.com%3e</link>
         <description>&lt;div&gt;
&lt;pre&gt;
Hi. We have a deployment of 10 hadoop servers and I now need more mapping
capability (no not just add more mappers per instance) since I have so many
jobs running. Now I am wondering what I should aim on...
Memory, cpu or disk... How long is a rope perhaps you would say ? A typical server is currently using about 15-20% cpu today on a quad-core
2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. Some specs below.
&amp;gt; mpstat 2 5
Linux 2.6.24-19-server (mapreduce2) 06/26/2009 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft %steal
%idle intr/s
11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 0.00
69.45 8572.50
11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 0.00
79.48 8075.50
11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 0.00
78.95 9219.00
11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 0.00
80.80 8489.50
11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 0.00
83.96 5495.00
Average: all 15.62 0.00 1.79 1.47 0.62 1.97 0.00
78.53 7970.30 What I am thinking is... Is it wiser to go for many of these cheap boxes
with 8GB of RAM or should I for instance focus on machines which can give
more I|O throughput ? I know that these things are hard but perhaps someone have draw some
conclusions before the pragmatic way. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/ &lt;/pre&gt;
&lt;/div&gt;</description>
         <author>Marcus Herou</author>
         <guid isPermaLink="false">urn:uuid:%3c7e536b1f0906261443w644fe56dl6ee4407f38ff67bd@mail-gmail-com%3e</guid>
         <pubDate>Fri, 26 Jun 2009 14:43:42 -0700</pubDate>
      </item>
   </channel>
</rss>
<!-- fe4.pipes.sp1.yahoo.com uncompressed/chunked Sun Nov 22 15:45:33 PST 2009 -->
