Using Hadoop streaming with perl for map reduce

Question

I would like to use the hadoop-streaming functionality with perl scripts as the mapper and reducer. I found out this explanation that partially answer my question, however it does not contain the functionality of the reducer handling all values together for each key.

For example the mapper might extract pairs, and the reducer will output the list of categories for each product. This is of course possible by saving all reducer data in memory (like in the example I mentioned before), but in many cases this is not scalable. Is there a way to let the perl script get all values for each key at once (like normal map-reduce jobs)?

Ramin Darvishov · Accepted Answer · 2015-11-18 14:13:19Z

1

You can use cpan library Hadoop::Streaming

sub reduce 
{ 
    my ( $self, $key, $value_iterator) = @_;
    ...
    while( $value_iterator->has_next() ) { ... }
    $self->emit( $key, $composite_value );
}

answered Nov 18, 2015 at 14:13

Ramin Darvishov

1,0392 gold badges15 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Shaharg Over a year ago

Just to make sure I understand your answer, I should add to my script a line: "use Hadoop::Streaming" . Then I should put my perl code in the sub (relating to the relevant key and values part). I assume that the default key-value separator is tab. Is this correct?

Ramin Darvishov Over a year ago

yes, you are correct. Note: Hadoop streaming library (perl library) must be installed all tasktracker nodes

Collectives™ on Stack Overflow

Using Hadoop streaming with perl for map reduce

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related